17. Iterative Proportional Fitting

Iterative Proportional Fitting IPF is a technique to find a matrix \(X\) that is closest to another matrix \(Z\) subject to the constraint that the row and column marginals of \(X\) be (nearly) identical to a target matrix \(Y\). One way to cast the problem is as follows.

\(X = PZQ\)

Here, \(P\) and \(Q\) are diagonal matrices transforming \(Z\) into \(X\) such that the row and column marginals of \(X\) are like \(Y\). IPF goes by a lot of different names depending on the field.

  • biproportional fitting in statistics

  • RAS algorithm in economics

  • raking in survey statistics

  • matrix scaling in computer science

In survey statistics, IPF may be used to weight data (or observations) and then apply weighted sampling (using these weights) so that the resulting sampled data reflects one with the desired proportions. For example, the percentage of White people in the US is 58% and the percentage of Males is 49%. Let’s say we did a survey and the percentage of White people in our data was 47% and the percentage of Males was 60%. Now, we want to make a population inference based on our sample, but, we cannot do so on the raw sample since the sample percentages do not match with the population ones. We can use IPF to determine the weights of each observation, and then do weighted sampling (with replacement) to get our data to reflect the population percentages.

Let’s take a look at how IPF works. There are two well known algorithms.

  • classical IPF

  • factor estimation

17.1. Data

This data is taken from Wikipedia.

  • \(X\) is a contingency matrix of counts for 2 dimensions

  • \(u\) is the target row marginals

  • \(v\) is the target column marginals

[1]:
import numpy as np

X = np.array([
    [40, 30, 20, 10],
    [35, 50, 100, 75],
    [30, 80, 70, 120],
    [20, 30, 40, 50]
])

u = np.array([150, 300, 400, 150]) # row target
v = np.array([200, 300, 400, 100]) # col target

These are the observed and target row marginals.

[2]:
print(f'observed: {X.sum(axis=1)}')
print(f'target: {u}')
observed: [100 260 300 140]
target: [150 300 400 150]

These are the observed and target column marginals.

[3]:
print(f'observed: {X.sum(axis=0)}')
print(f'target: {v}')
observed: [125 190 230 255]
target: [200 300 400 100]

17.2. Classical IPF algorithm

The classifical IPF algorithm alternates between adjusting the elements to satisfy the target row marginals and then the target column marginals. Initially,

\(m_{ij}^0 = x_{ij}\)

Meaning, \(M\) is set to \(X\) at the start. Then, we adjust \(m_{ij}\) to satisfy the target row marginals. We are essentially do weighted normalization.

\(m_{ij}^1 = \dfrac{m_{ij}^0 u_i}{\sum_J m_{ij}^0}\)

Next, we adjust \(m_{ij}\) to satisfy the target column marginals.

\(m_{ij}^2 = \dfrac{m_{ij}^1 v_i}{\sum_I m_{ij}^1}\)

Then \(m_{ij}^0\) is set to \(m_{ij}^2\) and we repeat the procedure until the marginals of \(M\) are (nearly) identical to the target marginals \(u\) and \(v\). In the running example, after 6 iterations, the margins of \(M\) are nearly indentical to the target ones (we are using the L2-norm to compute the distance).

[4]:
def ipf_update(M, u, v):
    r_sums = M.sum(axis=1)
    N = np.array([[M[r,c] * u[r] / r_sums[r] for c in range(M.shape[1])]
                  for r in range(M.shape[0])])

    c_sums = N.sum(axis=0)
    O = np.array([[N[r, c] * v[c] / c_sums[c] for c in range(N.shape[1])]
                  for r in range(N.shape[0])])

    d_u = np.linalg.norm(u - O.sum(axis=1), 2)
    d_v = np.linalg.norm(v - O.sum(axis=0), 2)

    return O, d_u, d_v

M = X.copy()

for _ in range(10):
    M, d_u, d_v = ipf_update(M, u, v)
    print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
    if d_u <= 0.0001 and d_v <= 0.0001:
        break
d_u = 46.44167, d_v = 0.00000
d_u = 2.43362, d_v = 0.00000
d_u = 0.15757, d_v = 0.00000
d_u = 0.01076, d_v = 0.00000
d_u = 0.00074, d_v = 0.00000
d_u = 0.00005, d_v = 0.00000

\(M\) ends up looking like the following.

[5]:
M
[5]:
array([[ 64.55852549,  46.23247384,  35.38430991,   3.82473358],
       [ 49.96791318,  68.15934981, 156.4985403 ,  25.37417955],
       [ 56.72193658, 144.42821667, 145.0824759 ,  53.76734831],
       [ 28.75162474,  41.17995969,  63.03467389,  17.03373857]])

Here are the column and row marginals of \(M\).

[6]:
M.sum(axis=0)
[6]:
array([200., 300., 400., 100.])
[7]:
M.sum(axis=1)
[7]:
array([150.00004282, 299.99998284, 399.99997746, 149.99999688])

17.3. Factor estimation

The factor estimation algorithm is more efficient than the IPF algorithm. It does not need to compute \(M\) directly, but we do so here as a matter of tracing the intermediary outputs. In factor estimation, we have

  • \(a\) as the row factor (size \(I\)), and

  • \(b\) as the column factor (size \(J\)).

The idea is to find \(a\) and \(b\) such that

\(m_{ij} = a_i b_j x_{ij}\)

and the resulting matrix \(M\) will have its margins (nearly) identical to the target row \(u\) and column \(v\) marginals. The algorithm starts out with initializing all the elements of \(b\) to 1. Then we iterate between estimating \(a\) and \(b\) until convergence.

  • \(a_i^n = \dfrac{u_i}{\sum_J x_{ij} b_j^{n-1}}\)

  • \(b_j^n = \dfrac{v_j}{\sum_I x_{ij} a_i^n}\)

[8]:
def factor_update(X, u, v, b):
    a = [u[i] / sum([X[i, j] * b[j] for j in range(X.shape[1])]) for i in range(X.shape[0])]
    b = [v[j] / sum([X[i, j] * a[i] for i in range(X.shape[0])]) for j in range(X.shape[1])]

    M = np.array([[X[i, j] * a[i] * b[j] for j in range(X.shape[1])] for i in range(X.shape[0])])

    d_u = np.linalg.norm(u - M.sum(axis=1), 2)
    d_v = np.linalg.norm(v - M.sum(axis=0), 2)

    return M, a, b, d_u, d_v

b = np.ones(X.shape[1])
for _ in range(10):
    M, a, b, d_u, d_v = factor_update(X, u, v, b)
    print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
    if d_u <= 0.0001 and d_v <= 0.0001:
        break
d_u = 46.44167, d_v = 0.00000
d_u = 2.43362, d_v = 0.00000
d_u = 0.15757, d_v = 0.00000
d_u = 0.01076, d_v = 0.00000
d_u = 0.00074, d_v = 0.00000
d_u = 0.00005, d_v = 0.00000

The row and column factors are as follows.

[9]:
a
[9]:
[1.251391838473094, 1.106936925262328, 1.4659849174901225, 1.1146335288885565]
[10]:
b
[10]:
[1.2897344282495622,
 1.231494735600841,
 1.4137981734141623,
 0.30563836678112183]

\(M\) may be reproduced from \(X\), \(a\) and \(b\).

[11]:
M
[11]:
array([[ 64.55852549,  46.23247384,  35.38430991,   3.82473358],
       [ 49.96791318,  68.15934981, 156.4985403 ,  25.37417955],
       [ 56.72193658, 144.42821667, 145.0824759 ,  53.76734831],
       [ 28.75162474,  41.17995969,  63.03467389,  17.03373857]])

As you can see, the row and column marginals are nearly identical to the target ones.

[12]:
M.sum(axis=0)
[12]:
array([200., 300., 400., 100.])
[13]:
M.sum(axis=1)
[13]:
array([150.00004282, 299.99998284, 399.99997746, 149.99999688])

17.4. Weighting

After we estimate \(M\), we can normalize the values of all the elements so that they sum to 1.0. The normalized elements of \(M\) are the weights corresponding to the i and j values.

[14]:
M / M.sum()
[14]:
array([[0.06455853, 0.04623247, 0.03538431, 0.00382473],
       [0.04996791, 0.06815935, 0.15649854, 0.02537418],
       [0.05672194, 0.14442822, 0.14508248, 0.05376735],
       [0.02875162, 0.04117996, 0.06303467, 0.01703374]])

17.5. Example

Let’s simulate a data set that has 3 fields.

  • race: {white, other}

  • gender: {male, female}

  • weight (the weight of a person, or how heavy a person is)

We will sample the weights for each invididual as follows.

  • \(\mathcal{N_{wm}}(150, 1.5)\) for white \(w\) male \(m\)

  • \(\mathcal{N_{om}}(160, 1.3)\) for other \(o\) male \(m\)

  • \(\mathcal{N_{wf}}(125, 1.1)\) for white \(w\) female \(f\)

  • \(\mathcal{N_{of}}(130, 1.1)\) for other \(o\) female \(f\)

[15]:
import pandas as pd

wm = np.random.normal(150, 1.5, 100)
om = np.random.normal(160, 1.3, 80)
wf = np.random.normal(125, 1.1, 200)
of = np.random.normal(130, 1.1, 150)

data = [{'race': 'white', 'gender': 'male', 'weight': w} for w in wm]
data += [{'race': 'other', 'gender': 'male', 'weight': w} for w in om]
data += [{'race': 'white', 'gender': 'female', 'weight': w} for w in wf]
data += [{'race': 'other', 'gender': 'female', 'weight': w} for w in of]

df = pd.DataFrame(data)
df.head()
[15]:
race gender weight
0 white male 151.141693
1 white male 148.045344
2 white male 151.683133
3 white male 149.063184
4 white male 150.203894

A contingency table gender and race of the synthetic data is as follows.

[16]:
pd.crosstab(df.race, df.gender)
[16]:
gender female male
race
other 150 80
white 200 100

Here is the contingency table reprsented as percentages.

[17]:
pd.crosstab(df.race, df.gender) / df.shape[0]
[17]:
gender female male
race
other 0.283019 0.150943
white 0.377358 0.188679

We want to upsample the synthetic data so that the sampled data has the following percentages.

  • race: white (58%), other (42%)

  • gender: male (49%), female (51%)

Let’s apply the IPF algorithm to see what the sampling weights will be.

[18]:
X = pd.crosstab(df.race, df.gender).values
M = X.copy()

M
[18]:
array([[150,  80],
       [200, 100]], dtype=int64)
[19]:
u = np.array([1 - 0.58, 0.58]) * 1_000
v = np.array([0.51, 0.49]) * 1_000

u, v
[19]:
(array([420., 580.]), array([510., 490.]))
[20]:
for _ in range(10):
    M, d_u, d_v = ipf_update(M, u, v)
    print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
    if d_u <= 0.0001 and d_v <= 0.0001:
        break
d_u = 3.35310, d_v = 0.00000
d_u = 0.00085, d_v = 0.00000
d_u = 0.00000, d_v = 0.00000
[21]:
M
[21]:
array([[210.271138  , 209.72886215],
       [299.728862  , 280.27113785]])
[22]:
M / M.sum()
[22]:
array([[0.21027114, 0.20972886],
       [0.29972886, 0.28027114]])

Let’s use the factor estimation algorithm to estimate the weights.

[23]:
b = np.ones(X.shape[1])

for _ in range(10):
    M, a, b, d_u, d_v = factor_update(X, u, v, b)
    print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
    if d_u <= 0.0001 and d_v <= 0.0001:
        break
d_u = 3.35310, d_v = 0.00000
d_u = 0.00085, d_v = 0.00000
d_u = 0.00000, d_v = 0.00000
[24]:
a
[24]:
[1.8158335284361056, 1.9412711210103644]
[25]:
b
[25]:
[0.7719912452001316, 1.4437506168933019]
[26]:
M
[26]:
array([[210.271138  , 209.72886215],
       [299.728862  , 280.27113785]])
[27]:
M / M.sum()
[27]:
array([[0.21027114, 0.20972886],
       [0.29972886, 0.28027114]])
[28]:
import itertools

list(zip(np.ravel(M / M.sum()), itertools.product(*[['other', 'white'], ['female', 'male']])))
[28]:
[(0.21027113800403066, ('other', 'female')),
 (0.20972886214841346, ('other', 'male')),
 (0.29972886199596926, ('white', 'female')),
 (0.2802711378515865, ('white', 'male'))]

Now that we have the weights of each cell in \(M\), we can divide then by the number of samples and re-assign these observation weights to the synthetic data.

[29]:
weights = {k: w / df[(df.race==k[0]) & (df.gender==k[1])].shape[0]
           for w, k in (zip(np.ravel(M / M.sum()), itertools.product(*[['other', 'white'], ['female', 'male']])))}
weights
[29]:
{('other', 'female'): 0.0014018075866935376,
 ('other', 'male'): 0.002621610776855168,
 ('white', 'female'): 0.0014986443099798464,
 ('white', 'male'): 0.002802711378515865}
[30]:
w = df.apply(lambda r: weights[(r.race, r.gender)], axis=1)

The sampled data is created from the synthetic data. We will choose the number of samples to be 1,000.

[31]:
sampled_df = df.sample(n=1_000, replace=True, weights=w, random_state=37)
sampled_df.shape
[31]:
(1000, 3)

Now let’s check the percentages.

[32]:
pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]
[32]:
gender female male
race
other 0.205 0.219
white 0.304 0.272

The gender percentages of the sampled data (sampled from the synthetic data) matches the national percentages.

[33]:
(pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]).sum(axis=0)
[33]:
gender
female    0.509
male      0.491
dtype: float64

The race percentages of the sampled data (sampled from the synthetic data) matches the national percentages.

[34]:
(pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]).sum(axis=1)
[34]:
race
other    0.424
white    0.576
dtype: float64

Finally, we can compute the average weight by gender and race.

[35]:
sampled_df.groupby(['race', 'gender']).agg('mean')
[35]:
weight
race gender
other female 129.969962
male 159.863323
white female 125.064400
male 150.371116