18. Iterative Proportional Fitting, Higher Dimensions
Previously, we showed how to use Iterative Proportional Fitting IPF
to sample a two-dimensional data set such that the resulting proportions will match a defined target. In this notebook, we will show how to do so with 3 or more dimensions. First, let’s show the US national percentages of race, age and gender.
race
white: 58%
other: 42%
age
minor: 28%
adult: 72%
gender
male: 49%
female: 51%
Second, we will synthesize data whose distributions do not match the national ones. This synthetic data represents something like survey data. Lastly, we will provide code on how to apply IPF to this 3-dimensional data. However, the code is general enough to be applied for 2, 3 or more dimensions.
18.1. Synthetic data
We will generate synthetic data based on race, age and gender as follows.
white, minor, male \(\sim \mathcal{N}(5.5, 1.0)\)
white, minor, female \(\sim \mathcal{N}(5.3, 1.0)\)
white, adult, male \(\sim \mathcal{N}(5.9, 1.0)\)
white, adult, female \(\sim \mathcal{N}(5.7, 1.0)\)
other, minor, male \(\sim \mathcal{N}(5.3, 1.0)\)
other, minor, female \(\sim \mathcal{N}(5.2, 1.0)\)
other, adult, male \(\sim \mathcal{N}(5.8, 1.0)\)
other, adult, female \(\sim \mathcal{N}(5.5, 1.0)\)
[1]:
import pandas as pd
import numpy as np
import random
import itertools
np.random.seed(37)
random.seed(37)
height = [
np.random.normal(5.5, 1.0, 100),
np.random.normal(5.3, 1.0, 200),
np.random.normal(5.9, 1.0, 300),
np.random.normal(5.7, 1.0, 200),
np.random.normal(5.3, 1.0, 400),
np.random.normal(5.2, 1.0, 500),
np.random.normal(5.8, 1.0, 300),
np.random.normal(5.5, 1.0, 200)
]
demographic = [
['white', 'minor', 'male'],
['white', 'minor', 'female'],
['white', 'adult', 'male'],
['white', 'adult', 'female'],
['other', 'minor', 'male'],
['other', 'minor', 'female'],
['other', 'adult', 'male'],
['other', 'adult', 'female']
]
data = [[{'race': d[0], 'age': d[1], 'gender': d[2], 'height': h} for h in s]
for d, s in zip(demographic, height)]
data = list(itertools.chain(*data))
df = pd.DataFrame(data)
df.head()
[1]:
race | age | gender | height | |
---|---|---|---|---|
0 | white | minor | male | 5.445536 |
1 | white | minor | male | 6.174308 |
2 | white | minor | male | 5.846647 |
3 | white | minor | male | 4.199654 |
4 | white | minor | male | 7.018512 |
The percentages of each demographic dimension is as follows.
[2]:
df.race.value_counts().sort_index() / df.shape[0]
[2]:
other 0.636364
white 0.363636
Name: race, dtype: float64
[3]:
df.age.value_counts().sort_index() / df.shape[0]
[3]:
adult 0.454545
minor 0.545455
Name: age, dtype: float64
[4]:
df.gender.value_counts().sort_index() / df.shape[0]
[4]:
female 0.5
male 0.5
Name: gender, dtype: float64
18.2. Contigency table
Now we will create the contigency table \(X\) from the synthetic data.
\(X\) is the contigency table, which is a 3-dimensional matrix
\(u\) is the target marginals which directly reflects the target percentages
\(f\) is just a helper vector to keep track of how the dimensions of the matrix maps back to the demographics variable
[5]:
def get_target_marginals(d):
factors = list(d.keys())
targets = [sorted([(k2, v2) for k2, v2 in v.items()]) for k, v in d.items()]
targets = np.array([[v for _, v in item] for item in targets])
return factors, targets
def get_table(df, targets):
factors, target_marginals = get_target_marginals(targets)
cross_tab = pd.crosstab(df[factors[0]], [df[c] for c in factors[1:]])
shape = tuple([df[c].unique().shape[0] for c in factors])
table = cross_tab.values.reshape(shape)
return factors, target_marginals, table
f, u, X = get_table(df, {
'race': {'white': 5800, 'other': 4200},
'age': {'minor': 2800, 'adult': 7200},
'gender': {'male': 4900, 'female': 5100}
})
18.3. IPF algorithm
Using the IPF algorithm, we will learn the weights.
[6]:
def get_coordinates(M):
return list(itertools.product(*[list(range(n)) for n in M.shape]))
def get_marginals(M, i):
coordinates = get_coordinates(M)
key = lambda tup: tup[0]
counts = [(c[i], M[c]) for c in coordinates]
counts = sorted(counts, key=key)
counts = itertools.groupby(counts, key=key)
counts = {k: sum([v[1] for v in g]) for k, g in counts}
return counts
def get_all_marginals(M):
return np.array([[v for _, v in get_marginals(M, i).items()]
for i in range(len(M.shape))])
def get_counts(M, i):
coordinates = get_coordinates(M)
key = lambda tup: tup[0]
counts = [(c[i], M[c], c) for c in coordinates]
counts = sorted(counts, key=key)
counts = itertools.groupby(counts, key=key)
counts = {k: [(tup[1], tup[2]) for tup in g] for k, g in counts}
return counts
def update_values(M, i, u):
marg = get_marginals(M, i)
vals = get_counts(M, i)
d = [[(c, n * u[k] / marg[k]) for n, c in v] for k, v in vals.items()]
d = itertools.chain(*d)
d = list(d)
return d
def ipf_update(M, u):
for i in range(len(M.shape)):
values = update_values(M, i, u[i])
for idx, v in values:
M[idx] = v
o = get_all_marginals(M)
d = get_deltas(o, u)
return M, d
def get_deltas(o, t):
return np.array([np.linalg.norm(o[r] - t[r], 2) for r in range(o.shape[0])])
def get_weights(X, max_iters=50, zero_threshold=0.0001, convergence_threshold=3, debug=True):
M = X.copy()
d_prev = np.zeros(len(M.shape))
count_zero = 0
for _ in range(max_iters):
M, d_next = ipf_update(M, u)
d = np.linalg.norm(d_prev - d_next, 2)
if d < zero_threshold:
count_zero += 1
if debug:
print(','.join([f'{v:.5f}' for v in d_next]), d)
d_prev = d_next
if count_zero >= convergence_threshold:
break
w = M / M.sum()
return w
[7]:
w = get_weights(X)
758.02375,123.06909,3.16228 767.9557278906121
75.74299,7.28011,3.60555 692.0363565714724
6.70820,2.23607,2.23607 69.23235546955618
2.23607,2.23607,2.23607 4.47213595499958
2.23607,2.23607,2.23607 0.0
2.23607,2.23607,2.23607 0.0
2.23607,2.23607,2.23607 0.0
[8]:
w
[8]:
array([[[0.113334 , 0.13604081],
[0.10423127, 0.0663199 ]],
[[0.21406422, 0.256677 ],
[0.0783235 , 0.0310093 ]]])
18.4. Sampling
Now that we have learned the weights, we can use them to sample (with replacement) from the synthetic data.
[9]:
import functools
def get_sampling_weights(df, f, w):
get_filters = lambda df, fields, values: [df[f] == v for f, v in zip(fields, values)]
get_total = lambda df, fields, values: df[functools.reduce(lambda a, b: a & b, get_filters(df, fields, values))].shape[0]
return {k: v / get_total(df, f, k) for k, v in zip(list(itertools.product(*[sorted(df[c].unique()) for c in f])), np.ravel(w))}
def get_samples(df, f, w, n=10_000):
weights = get_sampling_weights(df, f, w)
s = df.apply(lambda r: weights[tuple([r[c] for c in f])], axis=1)
return df.sample(n=n, replace=True, weights=s)
sample_df = get_samples(df, f, w)
sample_df
[9]:
race | age | gender | height | |
---|---|---|---|---|
1461 | other | minor | female | 5.337720 |
2034 | other | adult | female | 6.243500 |
1715 | other | adult | male | 5.261089 |
310 | white | adult | male | 5.682793 |
1884 | other | adult | male | 5.192252 |
... | ... | ... | ... | ... |
2140 | other | adult | female | 6.026459 |
732 | white | adult | female | 5.155105 |
34 | white | minor | male | 5.273937 |
117 | white | minor | female | 5.599598 |
649 | white | adult | female | 6.546797 |
10000 rows × 4 columns
Here is the cross-tabulation of the resulting sampled matrix.
[10]:
ct = pd.crosstab(sample_df.race, [sample_df.age, sample_df.gender])
ct
[10]:
age | adult | minor | ||
---|---|---|---|---|
gender | female | male | female | male |
race | ||||
other | 1165 | 1344 | 1041 | 628 |
white | 2225 | 2492 | 791 | 314 |
Let’s check and verify the marginals.
[11]:
sample_df.race.value_counts().sort_index() / sample_df.shape[0]
[11]:
other 0.4178
white 0.5822
Name: race, dtype: float64
[12]:
sample_df.age.value_counts().sort_index() / sample_df.shape[0]
[12]:
adult 0.7226
minor 0.2774
Name: age, dtype: float64
[13]:
sample_df.gender.value_counts().sort_index() / sample_df.shape[0]
[13]:
female 0.5222
male 0.4778
Name: gender, dtype: float64