18. Iterative Proportional Fitting, Higher Dimensions

Previously, we showed how to use Iterative Proportional Fitting IPF to sample a two-dimensional data set such that the resulting proportions will match a defined target. In this notebook, we will show how to do so with 3 or more dimensions. First, let’s show the US national percentages of race, age and gender.

  • race

    • white: 58%

    • other: 42%

  • age

    • minor: 28%

    • adult: 72%

  • gender

    • male: 49%

    • female: 51%

Second, we will synthesize data whose distributions do not match the national ones. This synthetic data represents something like survey data. Lastly, we will provide code on how to apply IPF to this 3-dimensional data. However, the code is general enough to be applied for 2, 3 or more dimensions.

18.1. Synthetic data

We will generate synthetic data based on race, age and gender as follows.

  • white, minor, male \(\sim \mathcal{N}(5.5, 1.0)\)

  • white, minor, female \(\sim \mathcal{N}(5.3, 1.0)\)

  • white, adult, male \(\sim \mathcal{N}(5.9, 1.0)\)

  • white, adult, female \(\sim \mathcal{N}(5.7, 1.0)\)

  • other, minor, male \(\sim \mathcal{N}(5.3, 1.0)\)

  • other, minor, female \(\sim \mathcal{N}(5.2, 1.0)\)

  • other, adult, male \(\sim \mathcal{N}(5.8, 1.0)\)

  • other, adult, female \(\sim \mathcal{N}(5.5, 1.0)\)

[1]:
import pandas as pd
import numpy as np
import random
import itertools

np.random.seed(37)
random.seed(37)

height = [
    np.random.normal(5.5, 1.0, 100),
    np.random.normal(5.3, 1.0, 200),
    np.random.normal(5.9, 1.0, 300),
    np.random.normal(5.7, 1.0, 200),
    np.random.normal(5.3, 1.0, 400),
    np.random.normal(5.2, 1.0, 500),
    np.random.normal(5.8, 1.0, 300),
    np.random.normal(5.5, 1.0, 200)
]

demographic = [
    ['white', 'minor', 'male'],
    ['white', 'minor', 'female'],
    ['white', 'adult', 'male'],
    ['white', 'adult', 'female'],
    ['other', 'minor', 'male'],
    ['other', 'minor', 'female'],
    ['other', 'adult', 'male'],
    ['other', 'adult', 'female']
]

data = [[{'race': d[0], 'age': d[1], 'gender': d[2], 'height': h} for h in s]
        for d, s in zip(demographic, height)]
data = list(itertools.chain(*data))

df = pd.DataFrame(data)
df.head()
[1]:
race age gender height
0 white minor male 5.445536
1 white minor male 6.174308
2 white minor male 5.846647
3 white minor male 4.199654
4 white minor male 7.018512

The percentages of each demographic dimension is as follows.

[2]:
df.race.value_counts().sort_index() / df.shape[0]
[2]:
other    0.636364
white    0.363636
Name: race, dtype: float64
[3]:
df.age.value_counts().sort_index() / df.shape[0]
[3]:
adult    0.454545
minor    0.545455
Name: age, dtype: float64
[4]:
df.gender.value_counts().sort_index() / df.shape[0]
[4]:
female    0.5
male      0.5
Name: gender, dtype: float64

18.2. Contigency table

Now we will create the contigency table \(X\) from the synthetic data.

  • \(X\) is the contigency table, which is a 3-dimensional matrix

  • \(u\) is the target marginals which directly reflects the target percentages

  • \(f\) is just a helper vector to keep track of how the dimensions of the matrix maps back to the demographics variable

[5]:
def get_target_marginals(d):
    factors = list(d.keys())
    targets = [sorted([(k2, v2) for k2, v2 in v.items()]) for k, v in d.items()]
    targets = np.array([[v for _, v in item] for item in targets])
    return factors, targets

def get_table(df, targets):
    factors, target_marginals = get_target_marginals(targets)

    cross_tab = pd.crosstab(df[factors[0]], [df[c] for c in factors[1:]])
    shape = tuple([df[c].unique().shape[0] for c in factors])
    table = cross_tab.values.reshape(shape)

    return factors, target_marginals, table

f, u, X = get_table(df, {
    'race': {'white': 5800, 'other': 4200},
    'age': {'minor': 2800, 'adult': 7200},
    'gender': {'male': 4900, 'female': 5100}
})

18.3. IPF algorithm

Using the IPF algorithm, we will learn the weights.

[6]:
def get_coordinates(M):
    return list(itertools.product(*[list(range(n)) for n in M.shape]))

def get_marginals(M, i):
    coordinates = get_coordinates(M)

    key = lambda tup: tup[0]
    counts = [(c[i], M[c]) for c in coordinates]
    counts = sorted(counts, key=key)
    counts = itertools.groupby(counts, key=key)
    counts = {k: sum([v[1] for v in g]) for k, g in counts}

    return counts

def get_all_marginals(M):
    return np.array([[v for _, v in get_marginals(M, i).items()]
                     for i in range(len(M.shape))])

def get_counts(M, i):
    coordinates = get_coordinates(M)

    key = lambda tup: tup[0]
    counts = [(c[i], M[c], c) for c in coordinates]
    counts = sorted(counts, key=key)
    counts = itertools.groupby(counts, key=key)
    counts = {k: [(tup[1], tup[2]) for tup in g] for k, g in counts}

    return counts

def update_values(M, i, u):
    marg = get_marginals(M, i)
    vals = get_counts(M, i)

    d = [[(c, n * u[k] / marg[k]) for n, c in v] for k, v in vals.items()]
    d = itertools.chain(*d)
    d = list(d)

    return d

def ipf_update(M, u):
    for i in range(len(M.shape)):
        values = update_values(M, i, u[i])
        for idx, v in values:
            M[idx] = v

    o = get_all_marginals(M)
    d = get_deltas(o, u)

    return M, d

def get_deltas(o, t):
    return np.array([np.linalg.norm(o[r] - t[r], 2) for r in range(o.shape[0])])

def get_weights(X, max_iters=50, zero_threshold=0.0001, convergence_threshold=3, debug=True):
    M = X.copy()

    d_prev = np.zeros(len(M.shape))
    count_zero = 0

    for _ in range(max_iters):
        M, d_next = ipf_update(M, u)
        d = np.linalg.norm(d_prev - d_next, 2)

        if d < zero_threshold:
            count_zero += 1

        if debug:
            print(','.join([f'{v:.5f}' for v in d_next]), d)
        d_prev = d_next

        if count_zero >= convergence_threshold:
            break

    w = M / M.sum()
    return w
[7]:
w = get_weights(X)
758.02375,123.06909,3.16228 767.9557278906121
75.74299,7.28011,3.60555 692.0363565714724
6.70820,2.23607,2.23607 69.23235546955618
2.23607,2.23607,2.23607 4.47213595499958
2.23607,2.23607,2.23607 0.0
2.23607,2.23607,2.23607 0.0
2.23607,2.23607,2.23607 0.0
[8]:
w
[8]:
array([[[0.113334  , 0.13604081],
        [0.10423127, 0.0663199 ]],

       [[0.21406422, 0.256677  ],
        [0.0783235 , 0.0310093 ]]])

18.4. Sampling

Now that we have learned the weights, we can use them to sample (with replacement) from the synthetic data.

[9]:
import functools

def get_sampling_weights(df, f, w):
    get_filters = lambda df, fields, values: [df[f] == v for f, v in zip(fields, values)]
    get_total = lambda df, fields, values: df[functools.reduce(lambda a, b: a & b, get_filters(df, fields, values))].shape[0]

    return {k: v / get_total(df, f, k) for k, v in zip(list(itertools.product(*[sorted(df[c].unique()) for c in f])), np.ravel(w))}

def get_samples(df, f, w, n=10_000):
    weights = get_sampling_weights(df, f, w)
    s = df.apply(lambda r: weights[tuple([r[c] for c in f])], axis=1)
    return df.sample(n=n, replace=True, weights=s)

sample_df = get_samples(df, f, w)
sample_df
[9]:
race age gender height
1461 other minor female 5.337720
2034 other adult female 6.243500
1715 other adult male 5.261089
310 white adult male 5.682793
1884 other adult male 5.192252
... ... ... ... ...
2140 other adult female 6.026459
732 white adult female 5.155105
34 white minor male 5.273937
117 white minor female 5.599598
649 white adult female 6.546797

10000 rows × 4 columns

Here is the cross-tabulation of the resulting sampled matrix.

[10]:
ct = pd.crosstab(sample_df.race, [sample_df.age, sample_df.gender])
ct
[10]:
age adult minor
gender female male female male
race
other 1165 1344 1041 628
white 2225 2492 791 314

Let’s check and verify the marginals.

[11]:
sample_df.race.value_counts().sort_index() / sample_df.shape[0]
[11]:
other    0.4178
white    0.5822
Name: race, dtype: float64
[12]:
sample_df.age.value_counts().sort_index() / sample_df.shape[0]
[12]:
adult    0.7226
minor    0.2774
Name: age, dtype: float64
[13]:
sample_df.gender.value_counts().sort_index() / sample_df.shape[0]
[13]:
female    0.5222
male      0.4778
Name: gender, dtype: float64