17. Iterative Proportional Fitting

Iterative Proportional Fitting IPF is a technique to find a matrix \(X\) that is closest to another matrix \(Z\) subject to the constraint that the row and column marginals of \(X\) be (nearly) identical to a target matrix \(Y\). One way to cast the problem is as follows.

\(X = PZQ\)

Here, \(P\) and \(Q\) are diagonal matrices transforming \(Z\) into \(X\) such that the row and column marginals of \(X\) are like \(Y\). IPF goes by a lot of different names depending on the field.

biproportional fitting in statistics
RAS algorithm in economics
raking in survey statistics
matrix scaling in computer science

In survey statistics, IPF may be used to weight data (or observations) and then apply weighted sampling (using these weights) so that the resulting sampled data reflects one with the desired proportions. For example, the percentage of White people in the US is 58% and the percentage of Males is 49%. Let’s say we did a survey and the percentage of White people in our data was 47% and the percentage of Males was 60%. Now, we want to make a population inference based on our sample, but, we cannot do so on the raw sample since the sample percentages do not match with the population ones. We can use IPF to determine the weights of each observation, and then do weighted sampling (with replacement) to get our data to reflect the population percentages.

Let’s take a look at how IPF works. There are two well known algorithms.

classical IPF
factor estimation

17.1. Data

This data is taken from Wikipedia.

\(X\) is a contingency matrix of counts for 2 dimensions
\(u\) is the target row marginals
\(v\) is the target column marginals

[1]:

import numpy as np

X = np.array([
    [40, 30, 20, 10],
    [35, 50, 100, 75],
    [30, 80, 70, 120],
    [20, 30, 40, 50]
])

u = np.array([150, 300, 400, 150]) # row target
v = np.array([200, 300, 400, 100]) # col target

These are the observed and target row marginals.

[2]:

print(f'observed: {X.sum(axis=1)}')
print(f'target: {u}')

observed: [100 260 300 140]
target: [150 300 400 150]

These are the observed and target column marginals.

[3]:

print(f'observed: {X.sum(axis=0)}')
print(f'target: {v}')

observed: [125 190 230 255]
target: [200 300 400 100]

17.2. Classical IPF algorithm

The classifical IPF algorithm alternates between adjusting the elements to satisfy the target row marginals and then the target column marginals. Initially,

\(m_{ij}^0 = x_{ij}\)

Meaning, \(M\) is set to \(X\) at the start. Then, we adjust \(m_{ij}\) to satisfy the target row marginals. We are essentially do weighted normalization.

\(m_{ij}^1 = \dfrac{m_{ij}^0 u_i}{\sum_J m_{ij}^0}\)

Next, we adjust \(m_{ij}\) to satisfy the target column marginals.

\(m_{ij}^2 = \dfrac{m_{ij}^1 v_i}{\sum_I m_{ij}^1}\)

Then \(m_{ij}^0\) is set to \(m_{ij}^2\) and we repeat the procedure until the marginals of \(M\) are (nearly) identical to the target marginals \(u\) and \(v\). In the running example, after 6 iterations, the margins of \(M\) are nearly indentical to the target ones (we are using the L2-norm to compute the distance).

[4]:

def ipf_update(M, u, v):
    r_sums = M.sum(axis=1)
    N = np.array([[M[r,c] * u[r] / r_sums[r] for c in range(M.shape[1])]
                  for r in range(M.shape[0])])

    c_sums = N.sum(axis=0)
    O = np.array([[N[r, c] * v[c] / c_sums[c] for c in range(N.shape[1])]
                  for r in range(N.shape[0])])

    d_u = np.linalg.norm(u - O.sum(axis=1), 2)
    d_v = np.linalg.norm(v - O.sum(axis=0), 2)

    return O, d_u, d_v

M = X.copy()

for _ in range(10):
    M, d_u, d_v = ipf_update(M, u, v)
    print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
    if d_u <= 0.0001 and d_v <= 0.0001:
        break

d_u = 46.44167, d_v = 0.00000
d_u = 2.43362, d_v = 0.00000
d_u = 0.15757, d_v = 0.00000
d_u = 0.01076, d_v = 0.00000
d_u = 0.00074, d_v = 0.00000
d_u = 0.00005, d_v = 0.00000

\(M\) ends up looking like the following.

[5]:

[5]:

array([[ 64.55852549,  46.23247384,  35.38430991,   3.82473358],
       [ 49.96791318,  68.15934981, 156.4985403 ,  25.37417955],
       [ 56.72193658, 144.42821667, 145.0824759 ,  53.76734831],
       [ 28.75162474,  41.17995969,  63.03467389,  17.03373857]])

Here are the column and row marginals of \(M\).

[6]:

M.sum(axis=0)

[6]:

array([200., 300., 400., 100.])

[7]:

M.sum(axis=1)

[7]:

array([150.00004282, 299.99998284, 399.99997746, 149.99999688])

17.3. Factor estimation

The factor estimation algorithm is more efficient than the IPF algorithm. It does not need to compute \(M\) directly, but we do so here as a matter of tracing the intermediary outputs. In factor estimation, we have

\(a\) as the row factor (size \(I\)), and
\(b\) as the column factor (size \(J\)).

The idea is to find \(a\) and \(b\) such that

\(m_{ij} = a_i b_j x_{ij}\)

and the resulting matrix \(M\) will have its margins (nearly) identical to the target row \(u\) and column \(v\) marginals. The algorithm starts out with initializing all the elements of \(b\) to 1. Then we iterate between estimating \(a\) and \(b\) until convergence.

\(a_i^n = \dfrac{u_i}{\sum_J x_{ij} b_j^{n-1}}\)
\(b_j^n = \dfrac{v_j}{\sum_I x_{ij} a_i^n}\)

[8]:

def factor_update(X, u, v, b):
    a = [u[i] / sum([X[i, j] * b[j] for j in range(X.shape[1])]) for i in range(X.shape[0])]
    b = [v[j] / sum([X[i, j] * a[i] for i in range(X.shape[0])]) for j in range(X.shape[1])]

    M = np.array([[X[i, j] * a[i] * b[j] for j in range(X.shape[1])] for i in range(X.shape[0])])

    d_u = np.linalg.norm(u - M.sum(axis=1), 2)
    d_v = np.linalg.norm(v - M.sum(axis=0), 2)

    return M, a, b, d_u, d_v

b = np.ones(X.shape[1])
for _ in range(10):
    M, a, b, d_u, d_v = factor_update(X, u, v, b)
    print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
    if d_u <= 0.0001 and d_v <= 0.0001:
        break

d_u = 46.44167, d_v = 0.00000
d_u = 2.43362, d_v = 0.00000
d_u = 0.15757, d_v = 0.00000
d_u = 0.01076, d_v = 0.00000
d_u = 0.00074, d_v = 0.00000
d_u = 0.00005, d_v = 0.00000

The row and column factors are as follows.

[9]:

[9]:

[1.251391838473094, 1.106936925262328, 1.4659849174901225, 1.1146335288885565]

[10]:

[10]:

[1.2897344282495622,
 1.231494735600841,
 1.4137981734141623,
 0.30563836678112183]

\(M\) may be reproduced from \(X\), \(a\) and \(b\).

[11]:

[11]:

array([[ 64.55852549,  46.23247384,  35.38430991,   3.82473358],
       [ 49.96791318,  68.15934981, 156.4985403 ,  25.37417955],
       [ 56.72193658, 144.42821667, 145.0824759 ,  53.76734831],
       [ 28.75162474,  41.17995969,  63.03467389,  17.03373857]])

As you can see, the row and column marginals are nearly identical to the target ones.

[12]:

M.sum(axis=0)

[12]:

array([200., 300., 400., 100.])

[13]:

M.sum(axis=1)

[13]:

array([150.00004282, 299.99998284, 399.99997746, 149.99999688])

17.4. Weighting

After we estimate \(M\), we can normalize the values of all the elements so that they sum to 1.0. The normalized elements of \(M\) are the weights corresponding to the i and j values.

[14]:

M / M.sum()

[14]:

array([[0.06455853, 0.04623247, 0.03538431, 0.00382473],
       [0.04996791, 0.06815935, 0.15649854, 0.02537418],
       [0.05672194, 0.14442822, 0.14508248, 0.05376735],
       [0.02875162, 0.04117996, 0.06303467, 0.01703374]])

17.5. Example

Let’s simulate a data set that has 3 fields.

race: {white, other}
gender: {male, female}
weight (the weight of a person, or how heavy a person is)

We will sample the weights for each invididual as follows.

\(\mathcal{N_{wm}}(150, 1.5)\) for white \(w\) male \(m\)
\(\mathcal{N_{om}}(160, 1.3)\) for other \(o\) male \(m\)
\(\mathcal{N_{wf}}(125, 1.1)\) for white \(w\) female \(f\)
\(\mathcal{N_{of}}(130, 1.1)\) for other \(o\) female \(f\)

[15]:

import pandas as pd

wm = np.random.normal(150, 1.5, 100)
om = np.random.normal(160, 1.3, 80)
wf = np.random.normal(125, 1.1, 200)
of = np.random.normal(130, 1.1, 150)

data = [{'race': 'white', 'gender': 'male', 'weight': w} for w in wm]
data += [{'race': 'other', 'gender': 'male', 'weight': w} for w in om]
data += [{'race': 'white', 'gender': 'female', 'weight': w} for w in wf]
data += [{'race': 'other', 'gender': 'female', 'weight': w} for w in of]

df = pd.DataFrame(data)
df.head()

[15]:

	race	gender	weight
0	white	male	151.141693
1	white	male	148.045344
2	white	male	151.683133
3	white	male	149.063184
4	white	male	150.203894

A contingency table gender and race of the synthetic data is as follows.

[16]:

pd.crosstab(df.race, df.gender)

[16]:

gender	female	male
race
other	150	80
white	200	100

Here is the contingency table reprsented as percentages.

[17]:

pd.crosstab(df.race, df.gender) / df.shape[0]

[17]:

gender	female	male
race
other	0.283019	0.150943
white	0.377358	0.188679

We want to upsample the synthetic data so that the sampled data has the following percentages.

race: white (58%), other (42%)
gender: male (49%), female (51%)

Let’s apply the IPF algorithm to see what the sampling weights will be.

[18]:

X = pd.crosstab(df.race, df.gender).values
M = X.copy()

M

[18]:

array([[150,  80],
       [200, 100]], dtype=int64)

[19]:

u = np.array([1 - 0.58, 0.58]) * 1_000
v = np.array([0.51, 0.49]) * 1_000

u, v

[19]:

(array([420., 580.]), array([510., 490.]))

[20]:

for _ in range(10):
    M, d_u, d_v = ipf_update(M, u, v)
    print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
    if d_u <= 0.0001 and d_v <= 0.0001:
        break

d_u = 3.35310, d_v = 0.00000
d_u = 0.00085, d_v = 0.00000
d_u = 0.00000, d_v = 0.00000

[21]:

[21]:

array([[210.271138  , 209.72886215],
       [299.728862  , 280.27113785]])

[22]:

M / M.sum()

[22]:

array([[0.21027114, 0.20972886],
       [0.29972886, 0.28027114]])

Let’s use the factor estimation algorithm to estimate the weights.

[23]:

b = np.ones(X.shape[1])

for _ in range(10):
    M, a, b, d_u, d_v = factor_update(X, u, v, b)
    print(f'd_u = {d_u:.5f}, d_v = {d_v:.5f}')
    if d_u <= 0.0001 and d_v <= 0.0001:
        break

d_u = 3.35310, d_v = 0.00000
d_u = 0.00085, d_v = 0.00000
d_u = 0.00000, d_v = 0.00000

[24]:

[24]:

[1.8158335284361056, 1.9412711210103644]

[25]:

[25]:

[0.7719912452001316, 1.4437506168933019]

[26]:

[26]:

array([[210.271138  , 209.72886215],
       [299.728862  , 280.27113785]])

[27]:

M / M.sum()

[27]:

array([[0.21027114, 0.20972886],
       [0.29972886, 0.28027114]])

[28]:

import itertools

list(zip(np.ravel(M / M.sum()), itertools.product(*[['other', 'white'], ['female', 'male']])))

[28]:

[(0.21027113800403066, ('other', 'female')),
 (0.20972886214841346, ('other', 'male')),
 (0.29972886199596926, ('white', 'female')),
 (0.2802711378515865, ('white', 'male'))]

Now that we have the weights of each cell in \(M\), we can divide then by the number of samples and re-assign these observation weights to the synthetic data.

[29]:

weights = {k: w / df[(df.race==k[0]) & (df.gender==k[1])].shape[0]
           for w, k in (zip(np.ravel(M / M.sum()), itertools.product(*[['other', 'white'], ['female', 'male']])))}
weights

[29]:

{('other', 'female'): 0.0014018075866935376,
 ('other', 'male'): 0.002621610776855168,
 ('white', 'female'): 0.0014986443099798464,
 ('white', 'male'): 0.002802711378515865}

[30]:

w = df.apply(lambda r: weights[(r.race, r.gender)], axis=1)

The sampled data is created from the synthetic data. We will choose the number of samples to be 1,000.

[31]:

sampled_df = df.sample(n=1_000, replace=True, weights=w, random_state=37)
sampled_df.shape

[31]:

(1000, 3)

Now let’s check the percentages.

[32]:

pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]

[32]:

gender	female	male
race
other	0.205	0.219
white	0.304	0.272

The gender percentages of the sampled data (sampled from the synthetic data) matches the national percentages.

[33]:

(pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]).sum(axis=0)

[33]:

gender
female    0.509
male      0.491
dtype: float64

The race percentages of the sampled data (sampled from the synthetic data) matches the national percentages.

[34]:

(pd.crosstab(sampled_df.race, sampled_df.gender) / sampled_df.shape[0]).sum(axis=1)

[34]:

race
other    0.424
white    0.576
dtype: float64

Finally, we can compute the average weight by gender and race.

[35]:

sampled_df.groupby(['race', 'gender']).agg('mean')

[35]:

		weight
race	gender
other	female	129.969962
other	male	159.863323
white	female	125.064400
white	male	150.371116