18. Iterative Proportional Fitting, Higher Dimensions

Previously, we showed how to use Iterative Proportional Fitting IPF to sample a two-dimensional data set such that the resulting proportions will match a defined target. In this notebook, we will show how to do so with 3 or more dimensions. First, let’s show the US national percentages of race, age and gender.

race
- white: 58%
- other: 42%
age
- minor: 28%
- adult: 72%
gender
- male: 49%
- female: 51%

Second, we will synthesize data whose distributions do not match the national ones. This synthetic data represents something like survey data. Lastly, we will provide code on how to apply IPF to this 3-dimensional data. However, the code is general enough to be applied for 2, 3 or more dimensions.

18.1. Synthetic data

We will generate synthetic data based on race, age and gender as follows.

white, minor, male \(\sim \mathcal{N}(5.5, 1.0)\)
white, minor, female \(\sim \mathcal{N}(5.3, 1.0)\)
white, adult, male \(\sim \mathcal{N}(5.9, 1.0)\)
white, adult, female \(\sim \mathcal{N}(5.7, 1.0)\)
other, minor, male \(\sim \mathcal{N}(5.3, 1.0)\)
other, minor, female \(\sim \mathcal{N}(5.2, 1.0)\)
other, adult, male \(\sim \mathcal{N}(5.8, 1.0)\)
other, adult, female \(\sim \mathcal{N}(5.5, 1.0)\)

[1]:

import pandas as pd
import numpy as np
import random
import itertools

np.random.seed(37)
random.seed(37)

height = [
    np.random.normal(5.5, 1.0, 100),
    np.random.normal(5.3, 1.0, 200),
    np.random.normal(5.9, 1.0, 300),
    np.random.normal(5.7, 1.0, 200),
    np.random.normal(5.3, 1.0, 400),
    np.random.normal(5.2, 1.0, 500),
    np.random.normal(5.8, 1.0, 300),
    np.random.normal(5.5, 1.0, 200)
]

demographic = [
    ['white', 'minor', 'male'],
    ['white', 'minor', 'female'],
    ['white', 'adult', 'male'],
    ['white', 'adult', 'female'],
    ['other', 'minor', 'male'],
    ['other', 'minor', 'female'],
    ['other', 'adult', 'male'],
    ['other', 'adult', 'female']
]

data = [[{'race': d[0], 'age': d[1], 'gender': d[2], 'height': h} for h in s]
        for d, s in zip(demographic, height)]
data = list(itertools.chain(*data))

df = pd.DataFrame(data)
df.head()

[1]:

	race	age	gender	height
0	white	minor	male	5.445536
1	white	minor	male	6.174308
2	white	minor	male	5.846647
3	white	minor	male	4.199654
4	white	minor	male	7.018512

The percentages of each demographic dimension is as follows.

[2]:

df.race.value_counts().sort_index() / df.shape[0]

[2]:

other    0.636364
white    0.363636
Name: race, dtype: float64

[3]:

df.age.value_counts().sort_index() / df.shape[0]

[3]:

adult    0.454545
minor    0.545455
Name: age, dtype: float64

[4]:

df.gender.value_counts().sort_index() / df.shape[0]

[4]:

female    0.5
male      0.5
Name: gender, dtype: float64

18.2. Contigency table

Now we will create the contigency table \(X\) from the synthetic data.

\(X\) is the contigency table, which is a 3-dimensional matrix
\(u\) is the target marginals which directly reflects the target percentages
\(f\) is just a helper vector to keep track of how the dimensions of the matrix maps back to the demographics variable

[5]:

def get_target_marginals(d):
    factors = list(d.keys())
    targets = [sorted([(k2, v2) for k2, v2 in v.items()]) for k, v in d.items()]
    targets = np.array([[v for _, v in item] for item in targets])
    return factors, targets

def get_table(df, targets):
    factors, target_marginals = get_target_marginals(targets)

    cross_tab = pd.crosstab(df[factors[0]], [df[c] for c in factors[1:]])
    shape = tuple([df[c].unique().shape[0] for c in factors])
    table = cross_tab.values.reshape(shape)

    return factors, target_marginals, table

f, u, X = get_table(df, {
    'race': {'white': 5800, 'other': 4200},
    'age': {'minor': 2800, 'adult': 7200},
    'gender': {'male': 4900, 'female': 5100}
})

18.3. IPF algorithm

Using the IPF algorithm, we will learn the weights.

[6]:

def get_coordinates(M):
    return list(itertools.product(*[list(range(n)) for n in M.shape]))

def get_marginals(M, i):
    coordinates = get_coordinates(M)

    key = lambda tup: tup[0]
    counts = [(c[i], M[c]) for c in coordinates]
    counts = sorted(counts, key=key)
    counts = itertools.groupby(counts, key=key)
    counts = {k: sum([v[1] for v in g]) for k, g in counts}

    return counts

def get_all_marginals(M):
    return np.array([[v for _, v in get_marginals(M, i).items()]
                     for i in range(len(M.shape))])

def get_counts(M, i):
    coordinates = get_coordinates(M)

    key = lambda tup: tup[0]
    counts = [(c[i], M[c], c) for c in coordinates]
    counts = sorted(counts, key=key)
    counts = itertools.groupby(counts, key=key)
    counts = {k: [(tup[1], tup[2]) for tup in g] for k, g in counts}

    return counts

def update_values(M, i, u):
    marg = get_marginals(M, i)
    vals = get_counts(M, i)

    d = [[(c, n * u[k] / marg[k]) for n, c in v] for k, v in vals.items()]
    d = itertools.chain(*d)
    d = list(d)

    return d

def ipf_update(M, u):
    for i in range(len(M.shape)):
        values = update_values(M, i, u[i])
        for idx, v in values:
            M[idx] = v

    o = get_all_marginals(M)
    d = get_deltas(o, u)

    return M, d

def get_deltas(o, t):
    return np.array([np.linalg.norm(o[r] - t[r], 2) for r in range(o.shape[0])])

def get_weights(X, max_iters=50, zero_threshold=0.0001, convergence_threshold=3, debug=True):
    M = X.copy()

    d_prev = np.zeros(len(M.shape))
    count_zero = 0

    for _ in range(max_iters):
        M, d_next = ipf_update(M, u)
        d = np.linalg.norm(d_prev - d_next, 2)

        if d < zero_threshold:
            count_zero += 1

        if debug:
            print(','.join([f'{v:.5f}' for v in d_next]), d)
        d_prev = d_next

        if count_zero >= convergence_threshold:
            break

    w = M / M.sum()
    return w

[7]:

w = get_weights(X)

758.02375,123.06909,3.16228 767.9557278906121
75.74299,7.28011,3.60555 692.0363565714724
6.70820,2.23607,2.23607 69.23235546955618
2.23607,2.23607,2.23607 4.47213595499958
2.23607,2.23607,2.23607 0.0
2.23607,2.23607,2.23607 0.0
2.23607,2.23607,2.23607 0.0

[8]:

[8]:

array([[[0.113334  , 0.13604081],
        [0.10423127, 0.0663199 ]],

       [[0.21406422, 0.256677  ],
        [0.0783235 , 0.0310093 ]]])

18.4. Sampling

Now that we have learned the weights, we can use them to sample (with replacement) from the synthetic data.

[9]:

import functools

def get_sampling_weights(df, f, w):
    get_filters = lambda df, fields, values: [df[f] == v for f, v in zip(fields, values)]
    get_total = lambda df, fields, values: df[functools.reduce(lambda a, b: a & b, get_filters(df, fields, values))].shape[0]

    return {k: v / get_total(df, f, k) for k, v in zip(list(itertools.product(*[sorted(df[c].unique()) for c in f])), np.ravel(w))}

def get_samples(df, f, w, n=10_000):
    weights = get_sampling_weights(df, f, w)
    s = df.apply(lambda r: weights[tuple([r[c] for c in f])], axis=1)
    return df.sample(n=n, replace=True, weights=s)

sample_df = get_samples(df, f, w)
sample_df

[9]:

	race	age	gender	height
1461	other	minor	female	5.337720
2034	other	adult	female	6.243500
1715	other	adult	male	5.261089
310	white	adult	male	5.682793
1884	other	adult	male	5.192252
...	...	...	...	...
2140	other	adult	female	6.026459
732	white	adult	female	5.155105
34	white	minor	male	5.273937
117	white	minor	female	5.599598
649	white	adult	female	6.546797

10000 rows × 4 columns

Here is the cross-tabulation of the resulting sampled matrix.

[10]:

ct = pd.crosstab(sample_df.race, [sample_df.age, sample_df.gender])
ct

[10]:

age	adult		minor
gender	female	male	female	male
race
other	1165	1344	1041	628
white	2225	2492	791	314

Let’s check and verify the marginals.

[11]:

sample_df.race.value_counts().sort_index() / sample_df.shape[0]

[11]:

other    0.4178
white    0.5822
Name: race, dtype: float64

[12]:

sample_df.age.value_counts().sort_index() / sample_df.shape[0]

[12]:

adult    0.7226
minor    0.2774
Name: age, dtype: float64

[13]:

sample_df.gender.value_counts().sort_index() / sample_df.shape[0]

[13]:

female    0.5222
male      0.4778
Name: gender, dtype: float64