1. Simpson’s Paradox

Simpson’s Paradox is a statistical phenomenon when the association between two variables appears, disappears or reverses when the population is divided into sub-populations. For example, admissions data from UC-Berkeley showed that males are more likely to be admitted than females. However, when breaking the admissions data down by departments, males were less likely to be admitted than females.

Here’s the data that captured admissions of males and females by departments.

[1]:
import pandas as pd

df = pd.read_csv('./data/berkeley-admission.csv')
df.shape
[1]:
(24, 4)
[2]:
df
[2]:
department sex admission n
0 A Male Admitted 512
1 A Male Rejected 313
2 A Female Admitted 89
3 A Female Rejected 19
4 B Male Admitted 353
5 B Male Rejected 207
6 B Female Admitted 17
7 B Female Rejected 8
8 C Male Admitted 120
9 C Male Rejected 205
10 C Female Admitted 202
11 C Female Rejected 391
12 D Male Admitted 139
13 D Male Rejected 279
14 D Female Admitted 131
15 D Female Rejected 244
16 E Male Admitted 53
17 E Male Rejected 138
18 E Female Admitted 94
19 E Female Rejected 299
20 F Male Admitted 22
21 F Male Rejected 351
22 F Female Admitted 24
23 F Female Rejected 317

We can just look at the percentage of admitted and rejected students (regardless of department or sex).

[3]:
df \
    .groupby(['admission']) \
    .agg('sum') \
    .assign(p=lambda d: d['n'] / d['n'].sum())
[3]:
n p
admission
Admitted 1756 0.387895
Rejected 2771 0.612105

When admissions was broken down by sex, females were admitted at 30% and males at 45%. The data, when viewed from this angle, suggested that admission was biased in favor of males.

[4]:
df \
    .groupby(['sex', 'admission']) \
    .agg('sum') \
    .reset_index() \
    .assign(p=lambda d: d.apply(lambda r: r['n'] / d[d['sex']==r['sex']]['n'].sum(), axis=1))
[4]:
sex admission n p
0 Female Admitted 557 0.303542
1 Female Rejected 1278 0.696458
2 Male Admitted 1199 0.445394
3 Male Rejected 1493 0.554606

If we apply a \(\chi^2\) test of independence, we will see that the sex and admissions are significantly correlated.

[5]:
from scipy.stats import chi2_contingency

o = df.groupby(['sex', 'admission']).agg('sum').reset_index(drop=True).values.reshape(2, 2)

chi2, p, dof, e = chi2_contingency(o)

print(f'X^2 = {chi2}')
print(f'p = {p}')
print(f'dof = {dof}')
X^2 = 91.87943233270713
p = 9.212151878377876e-22
dof = 1

Here are the observed cell counts.

[6]:
pd.DataFrame(o,
             columns=pd.MultiIndex.from_tuples([('Admission', 'Admitted'), ('Admission', 'Rejected')]),
             index=pd.MultiIndex.from_tuples([('Sex', 'Female'), ('Sex', 'Male')])
)
[6]:
Admission
Admitted Rejected
Sex Female 557 1278
Male 1199 1493

These are the expected cell counts.

[7]:
pd.DataFrame(e,
             columns=pd.MultiIndex.from_tuples([('Admission', 'Admitted'), ('Admission', 'Rejected')]),
             index=pd.MultiIndex.from_tuples([('Sex', 'Female'), ('Sex', 'Male')])
)
[7]:
Admission
Admitted Rejected
Sex Female 711.787055 1123.212945
Male 1044.212945 1647.787055

Here comes Simpson’s Paradox. What happens when we partition admission by sex and department? Notice that in only 2 departments do males have higher admission rates than females. Whereas before, it seemed admission was biased in favor of males, now, broken down additionally by department, admission bias is reversed to be in favor of females.

[8]:
def highlight(s):
    color = 'background-color: rgb(0, 255, 0, 0.18)'
    if s.iloc[0] > s.iloc[2]:
        return [color, '', '', '']
    else:
        return ['', '', color, '']

(df.pivot(index='department', columns=['sex', 'admission'], values='n') / \
    df.pivot(index='department', columns=['sex', 'admission'], values='n').sum()) \
    .style \
    .apply(highlight, axis=1)
[8]:
sex Male Female
admission Admitted Rejected Admitted Rejected
department
A 0.427023 0.209645 0.159785 0.014867
B 0.294412 0.138647 0.030521 0.006260
C 0.100083 0.137307 0.362657 0.305947
D 0.115930 0.186872 0.235189 0.190923
E 0.044204 0.092431 0.168761 0.233959
F 0.018349 0.235097 0.043088 0.248044

We can also use log-linear models to see what the relationships between admission, sex and department really is. Below, we have

  • 1 independence model,

  • 3 joint independence models,

  • 3 conditional independence models and

  • 1 homogenous model.

The homogenous model (AB, AC, BC) has the lowest deviance \(G^2\) at 20.2.

[9]:
from patsy import dmatrices
import statsmodels.api as sm

dept_levels = ['F', 'A', 'B', 'C', 'D', 'E']

def get_deviance(m, f):
    y, X = dmatrices(f, df, return_type='dataframe')
    r = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    return {'model': m, 'df': r.df_resid, 'deviance': r.deviance}

formulas = {
    '(A, B, C)': 'n ~ admission + sex + C(department, levels=dept_levels)',
    '(A, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + sex * C(department, levels=dept_levels)',
    '(B, AC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * C(department, levels=dept_levels)',
    '(C, AB)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex',
    '(AC, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * C(department, levels=dept_levels) + sex * C(department, levels=dept_levels)',
    '(AB, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex + sex * C(department, levels=dept_levels)',
    '(AB, AC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex + admission * C(department, levels=dept_levels)',
    '(AB, AC, BC)': 'n ~ (admission + sex + C(department, levels=dept_levels))**2'
}

result_df = pd.DataFrame([get_deviance(m, f) for m, f in formulas.items()]).set_index('model')
result_df
[9]:
df deviance
model
(A, B, C) 16 2097.116655
(A, BC) 11 876.743972
(B, AC) 11 1242.058607
(C, AB) 15 2003.390913
(AC, BC) 6 21.685924
(AB, BC) 10 783.018230
(AB, AC) 10 1148.332865
(AB, AC, BC) 5 20.217417

If we perform a likehood ratio test of the homogenous model with all other ones, we see that the conditional independence model (AC, BC) does not differ from the homogenous one (p = 0.23). This conditional independence model may be viewed as “simpler” than the homogenous one, and favored. So what? Well, (AC, BC) means that admission and sex are independent given the department, and whatever association we saw was spurious between sex and admission.

[10]:
from scipy.stats import chi2

def do_likelihood_test(model1, model2='(AB, AC, BC)'):
    chi_sq = result_df.loc[model1].deviance - result_df.loc[model2].deviance
    dof = result_df.loc[model1].df - result_df.loc[model2].df

    return 1 - chi2.cdf(chi_sq, dof)

pd.DataFrame(
    [(m, do_likelihood_test(m)) for m in result_df.index.drop('(AB, AC, BC)')],
    columns=['model', 'p-value']
)
[10]:
model p-value
0 (A, B, C) 0.000000
1 (A, BC) 0.000000
2 (B, AC) 0.000000
3 (C, AB) 0.000000
4 (AC, BC) 0.225581
5 (AB, BC) 0.000000
6 (AB, AC) 0.000000

How about propensity score matching? We can treat sex as the treatment T and admission as the outcome O with department as the feature X.

[11]:
import itertools

def to_records(r):
    return [(r['department'], r['sex'], r['admission']) for _ in range(r['n'])]

raw_df = pd.DataFrame(
            itertools.chain(*df.apply(to_records, axis=1)),
            columns=['department', 'sex', 'admission']) \
        .assign(
            sex=lambda d: d['sex'].apply(lambda s: 1 if s == 'Male' else 0),
            admission=lambda d: d['admission'].apply(lambda s: 1 if s == 'Admitted' else 0)
        )

raw_df.shape
[11]:
(4527, 3)
[12]:
raw_df.head()
[12]:
department sex admission
0 A 1 1
1 A 1 1
2 A 1 1
3 A 1 1
4 A 1 1

Let’s learn the Inverse Probability Treatment Weight IPTW.

[13]:
import numpy as np
from sklearn.linear_model import LogisticRegression

dept_levels = ['F', 'A', 'B', 'C', 'D', 'E']

y, X = dmatrices('sex ~ C(department, levels=dept_levels)', raw_df, return_type='dataframe')
X = X.iloc[:,1:]
y = np.ravel(y.values)

model = LogisticRegression(random_state=37)
model.fit(X, y)
[13]:
LogisticRegression(random_state=37)
[14]:
def get_propensity(r):
    return 1 / r['propensity_score'] if r['T'] == 1 else 1 / (1 - r['propensity_score'])

psm_df = pd.DataFrame({
            'T': y,
            'propensity_score': model.predict_proba(X)[:,1]
        }) \
        .assign(
            inverse_weight=lambda d: d.apply(get_propensity, axis=1),
            sampling_weight=lambda d: d['inverse_weight'] / d['inverse_weight'].sum()
        )

And then we can use the IPTW to do weighted logistic regression.

[15]:
y, X = dmatrices('admission ~ sex + C(department, levels=dept_levels) - 1', raw_df, return_type='dataframe')
X = X.iloc[:,1:]
y = np.ravel(y.values)
w = psm_df['inverse_weight']

model = LogisticRegression()
model.fit(X, y, w)

coef = pd.concat([
    pd.Series(model.intercept_, ['intercept']),
    pd.Series(model.coef_[0], X.columns)
])

As you can see, sex (being male) has a negative average treatment effect ATE on admission. It decreases the log odds by 0.22.

[16]:
coef
[16]:
intercept                              -2.436449
C(department, levels=dept_levels)[A]    3.491369
C(department, levels=dept_levels)[B]    3.177035
C(department, levels=dept_levels)[C]    1.941869
C(department, levels=dept_levels)[D]    1.879617
C(department, levels=dept_levels)[E]    1.481090
sex                                    -0.218998
dtype: float64

Or, sex (being male), decreases the odds of admission by about 20%.

[17]:
np.exp(coef)
[17]:
intercept                                0.087471
C(department, levels=dept_levels)[A]    32.830876
C(department, levels=dept_levels)[B]    23.975554
C(department, levels=dept_levels)[C]     6.971766
C(department, levels=dept_levels)[D]     6.550998
C(department, levels=dept_levels)[E]     4.397738
sex                                      0.803323
dtype: float64