17. Iterative Proportional Fitting
Iterative Proportional Fitting IPF
is a technique to find a matrix \(X\) that is closest to another matrix \(Z\) subject to the constraint that the row and column marginals of \(X\) be (nearly) identical to a target matrix \(Y\). One way to cast the problem is as follows.
\(X = PZQ\)
Here, \(P\) and \(Q\) are diagonal matrices transforming \(Z\) into \(X\) such that the row and column marginals of \(X\) are like \(Y\). IPF goes by a lot of different names depending on the field.
biproportional fitting
in statisticsRAS algorithm
in economicsraking
in survey statisticsmatrix scaling
in computer science
In survey statistics, IPF may be used to weight data (or observations) and then apply weighted sampling (using these weights) so that the resulting sampled data reflects one with the desired proportions. For example, the percentage of White
people in the US is 58% and the percentage of Males
is 49%. Let’s say we did a survey and the percentage of White
people in our data was 47% and the percentage of Males
was 60%. Now, we want to make a population inference based on our sample,
but, we cannot do so on the raw sample since the sample percentages do not match with the population ones. We can use IPF to determine the weights of each observation, and then do weighted sampling (with replacement) to get our data to reflect the population percentages.
Let’s take a look at how IPF works. There are two well known algorithms.
classical IPF
factor estimation
17.1. Data
This data is taken from Wikipedia.
\(X\) is a contingency matrix of counts for 2 dimensions
\(u\) is the target row marginals
\(v\) is the target column marginals
[1]:
import numpy as np
X = np.array([
[40, 30, 20, 10],
[35, 50, 100, 75],
[30, 80, 70, 120],
[20, 30, 40, 50]
])
u = np.array([150, 300, 400, 150]) # row target
v = np.array([200, 300, 400, 100]) # col target
These are the observed and target row marginals.
[2]:
print(f'observed: {X.sum(axis=1)}')
print(f'target: {u}')
observed: [100 260 300 140]
target: [150 300 400 150]
These are the observed and target column marginals.
[3]:
print(f'observed: {X.sum(axis=0)}')
print(f'target: {v}')
observed: [125 190 230 255]
target: [200 300 400 100]
17.2. Classical IPF algorithm
The classifical IPF algorithm alternates between adjusting the elements to satisfy the target row marginals and then the target column marginals. Initially,
\(m_{ij}^0 = x_{ij}\)
Meaning, \(M\) is set to \(X\) at the start. Then, we adjust \(m_{ij}\) to satisfy the target row marginals. We are essentially do weighted normalization.
\(m_{ij}^1 = \dfrac{m_{ij}^0 u_i}{\sum_J m_{ij}^0}\)
Next, we adjust \(m_{ij}\) to satisfy the target column marginals.
\(m_{ij}^2 = \dfrac{m_{ij}^1 v_i}{\sum_I m_{ij}^1}\)
Then \(m_{ij}^0\) is set to \(m_{ij}^2\) and we repeat the procedure until the marginals of \(M\) are (nearly) identical to the target marginals \(u\) and \(v\). In the running example, after 6 iterations, the margins of \(M\) are nearly indentical to the target ones (we are using the L2-norm
to compute the distance).
[4]:
def ipf_update(M, u, v):
r_sums = M.sum(axis=1)
N = np.array([[M[r,c] * u[r] / r_sums[r] for c in range(M.shape[1])]
for r in range(M.shape[0])])
c_sums = N.sum(axis=0)
O = np.array([[N[r, c] * v[c] / c_sums[c] for c in range(N.shape[1])]
for r in range(N.shape[0])])
d_u = np.linalg.norm(u - O.sum(axis=1), 2)
d_v = np.linalg.norm(v - O.sum(axis=0), 2)
return O, d_u, d_v
M = X.copy()
for _ in range(10):
M, d_u, d_v = ipf_update(M, u, v)
print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
if d_u <= 0.0001 and d_v <= 0.0001:
break
d_u = 46.44167, d_v = 0.00000
d_u = 2.43362, d_v = 0.00000
d_u = 0.15757, d_v = 0.00000
d_u = 0.01076, d_v = 0.00000
d_u = 0.00074, d_v = 0.00000
d_u = 0.00005, d_v = 0.00000
\(M\) ends up looking like the following.
[5]:
M
[5]:
array([[ 64.55852549, 46.23247384, 35.38430991, 3.82473358],
[ 49.96791318, 68.15934981, 156.4985403 , 25.37417955],
[ 56.72193658, 144.42821667, 145.0824759 , 53.76734831],
[ 28.75162474, 41.17995969, 63.03467389, 17.03373857]])
Here are the column and row marginals of \(M\).
[6]:
M.sum(axis=0)
[6]:
array([200., 300., 400., 100.])
[7]:
M.sum(axis=1)
[7]:
array([150.00004282, 299.99998284, 399.99997746, 149.99999688])
17.3. Factor estimation
The factor estimation algorithm is more efficient than the IPF algorithm. It does not need to compute \(M\) directly, but we do so here as a matter of tracing
the intermediary outputs. In factor estimation, we have
\(a\) as the row factor (size \(I\)), and
\(b\) as the column factor (size \(J\)).
The idea is to find \(a\) and \(b\) such that
\(m_{ij} = a_i b_j x_{ij}\)
and the resulting matrix \(M\) will have its margins (nearly) identical to the target row \(u\) and column \(v\) marginals. The algorithm starts out with initializing all the elements of \(b\) to 1. Then we iterate between estimating \(a\) and \(b\) until convergence.
\(a_i^n = \dfrac{u_i}{\sum_J x_{ij} b_j^{n-1}}\)
\(b_j^n = \dfrac{v_j}{\sum_I x_{ij} a_i^n}\)
[8]:
def factor_update(X, u, v, b):
a = [u[i] / sum([X[i, j] * b[j] for j in range(X.shape[1])]) for i in range(X.shape[0])]
b = [v[j] / sum([X[i, j] * a[i] for i in range(X.shape[0])]) for j in range(X.shape[1])]
M = np.array([[X[i, j] * a[i] * b[j] for j in range(X.shape[1])] for i in range(X.shape[0])])
d_u = np.linalg.norm(u - M.sum(axis=1), 2)
d_v = np.linalg.norm(v - M.sum(axis=0), 2)
return M, a, b, d_u, d_v
b = np.ones(X.shape[1])
for _ in range(10):
M, a, b, d_u, d_v = factor_update(X, u, v, b)
print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
if d_u <= 0.0001 and d_v <= 0.0001:
break
d_u = 46.44167, d_v = 0.00000
d_u = 2.43362, d_v = 0.00000
d_u = 0.15757, d_v = 0.00000
d_u = 0.01076, d_v = 0.00000
d_u = 0.00074, d_v = 0.00000
d_u = 0.00005, d_v = 0.00000
The row and column factors are as follows.
[9]:
a
[9]:
[1.251391838473094, 1.106936925262328, 1.4659849174901225, 1.1146335288885565]
[10]:
b
[10]:
[1.2897344282495622,
1.231494735600841,
1.4137981734141623,
0.30563836678112183]
\(M\) may be reproduced from \(X\), \(a\) and \(b\).
[11]:
M
[11]:
array([[ 64.55852549, 46.23247384, 35.38430991, 3.82473358],
[ 49.96791318, 68.15934981, 156.4985403 , 25.37417955],
[ 56.72193658, 144.42821667, 145.0824759 , 53.76734831],
[ 28.75162474, 41.17995969, 63.03467389, 17.03373857]])
As you can see, the row and column marginals are nearly identical to the target ones.
[12]:
M.sum(axis=0)
[12]:
array([200., 300., 400., 100.])
[13]:
M.sum(axis=1)
[13]:
array([150.00004282, 299.99998284, 399.99997746, 149.99999688])
17.4. Weighting
After we estimate \(M\), we can normalize the values of all the elements so that they sum to 1.0. The normalized elements of \(M\) are the weights corresponding to the i
and j
values.
[14]:
M / M.sum()
[14]:
array([[0.06455853, 0.04623247, 0.03538431, 0.00382473],
[0.04996791, 0.06815935, 0.15649854, 0.02537418],
[0.05672194, 0.14442822, 0.14508248, 0.05376735],
[0.02875162, 0.04117996, 0.06303467, 0.01703374]])
17.5. Example
Let’s simulate a data set that has 3 fields.
race: {white, other}
gender: {male, female}
weight (the weight of a person, or how heavy a person is)
We will sample the weights for each invididual as follows.
\(\mathcal{N_{wm}}(150, 1.5)\) for white \(w\) male \(m\)
\(\mathcal{N_{om}}(160, 1.3)\) for other \(o\) male \(m\)
\(\mathcal{N_{wf}}(125, 1.1)\) for white \(w\) female \(f\)
\(\mathcal{N_{of}}(130, 1.1)\) for other \(o\) female \(f\)
[15]:
import pandas as pd
wm = np.random.normal(150, 1.5, 100)
om = np.random.normal(160, 1.3, 80)
wf = np.random.normal(125, 1.1, 200)
of = np.random.normal(130, 1.1, 150)
data = [{'race': 'white', 'gender': 'male', 'weight': w} for w in wm]
data += [{'race': 'other', 'gender': 'male', 'weight': w} for w in om]
data += [{'race': 'white', 'gender': 'female', 'weight': w} for w in wf]
data += [{'race': 'other', 'gender': 'female', 'weight': w} for w in of]
df = pd.DataFrame(data)
df.head()
[15]:
race | gender | weight | |
---|---|---|---|
0 | white | male | 151.141693 |
1 | white | male | 148.045344 |
2 | white | male | 151.683133 |
3 | white | male | 149.063184 |
4 | white | male | 150.203894 |
A contingency table gender and race of the synthetic data is as follows.
[16]:
pd.crosstab(df.race, df.gender)
[16]:
gender | female | male |
---|---|---|
race | ||
other | 150 | 80 |
white | 200 | 100 |
Here is the contingency table reprsented as percentages.
[17]:
pd.crosstab(df.race, df.gender) / df.shape[0]
[17]:
gender | female | male |
---|---|---|
race | ||
other | 0.283019 | 0.150943 |
white | 0.377358 | 0.188679 |
We want to upsample the synthetic data so that the sampled data has the following percentages.
race: white (58%), other (42%)
gender: male (49%), female (51%)
Let’s apply the IPF algorithm to see what the sampling weights will be.
[18]:
X = pd.crosstab(df.race, df.gender).values
M = X.copy()
M
[18]:
array([[150, 80],
[200, 100]], dtype=int64)
[19]:
u = np.array([1 - 0.58, 0.58]) * 1_000
v = np.array([0.51, 0.49]) * 1_000
u, v
[19]:
(array([420., 580.]), array([510., 490.]))
[20]:
for _ in range(10):
M, d_u, d_v = ipf_update(M, u, v)
print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
if d_u <= 0.0001 and d_v <= 0.0001:
break
d_u = 3.35310, d_v = 0.00000
d_u = 0.00085, d_v = 0.00000
d_u = 0.00000, d_v = 0.00000
[21]:
M
[21]:
array([[210.271138 , 209.72886215],
[299.728862 , 280.27113785]])
[22]:
M / M.sum()
[22]:
array([[0.21027114, 0.20972886],
[0.29972886, 0.28027114]])
Let’s use the factor estimation algorithm to estimate the weights.
[23]:
b = np.ones(X.shape[1])
for _ in range(10):
M, a, b, d_u, d_v = factor_update(X, u, v, b)
print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
if d_u <= 0.0001 and d_v <= 0.0001:
break
d_u = 3.35310, d_v = 0.00000
d_u = 0.00085, d_v = 0.00000
d_u = 0.00000, d_v = 0.00000
[24]:
a
[24]:
[1.8158335284361056, 1.9412711210103644]
[25]:
b
[25]:
[0.7719912452001316, 1.4437506168933019]
[26]:
M
[26]:
array([[210.271138 , 209.72886215],
[299.728862 , 280.27113785]])
[27]:
M / M.sum()
[27]:
array([[0.21027114, 0.20972886],
[0.29972886, 0.28027114]])
[28]:
import itertools
list(zip(np.ravel(M / M.sum()), itertools.product(*[['other', 'white'], ['female', 'male']])))
[28]:
[(0.21027113800403066, ('other', 'female')),
(0.20972886214841346, ('other', 'male')),
(0.29972886199596926, ('white', 'female')),
(0.2802711378515865, ('white', 'male'))]
Now that we have the weights of each cell in \(M\), we can divide then by the number of samples and re-assign these observation weights to the synthetic data.
[29]:
weights = {k: w / df[(df.race==k[0]) & (df.gender==k[1])].shape[0]
for w, k in (zip(np.ravel(M / M.sum()), itertools.product(*[['other', 'white'], ['female', 'male']])))}
weights
[29]:
{('other', 'female'): 0.0014018075866935376,
('other', 'male'): 0.002621610776855168,
('white', 'female'): 0.0014986443099798464,
('white', 'male'): 0.002802711378515865}
[30]:
w = df.apply(lambda r: weights[(r.race, r.gender)], axis=1)
The sampled data is created from the synthetic data. We will choose the number of samples to be 1,000.
[31]:
sampled_df = df.sample(n=1_000, replace=True, weights=w, random_state=37)
sampled_df.shape
[31]:
(1000, 3)
Now let’s check the percentages.
[32]:
pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]
[32]:
gender | female | male |
---|---|---|
race | ||
other | 0.205 | 0.219 |
white | 0.304 | 0.272 |
The gender percentages of the sampled data (sampled from the synthetic data) matches the national percentages.
[33]:
(pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]).sum(axis=0)
[33]:
gender
female 0.509
male 0.491
dtype: float64
The race percentages of the sampled data (sampled from the synthetic data) matches the national percentages.
[34]:
(pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]).sum(axis=1)
[34]:
race
other 0.424
white 0.576
dtype: float64
Finally, we can compute the average weight by gender and race.
[35]:
sampled_df.groupby(['race', 'gender']).agg('mean')
[35]:
weight | ||
---|---|---|
race | gender | |
other | female | 129.969962 |
male | 159.863323 | |
white | female | 125.064400 |
male | 150.371116 |