1. Simpson’s Paradox
Simpson’s Paradox is a statistical phenomenon when the association between two variables appears, disappears or reverses when the population is divided into sub-populations. For example, admissions data from UC-Berkeley showed that males are more likely to be admitted than females. However, when breaking the admissions data down by departments, males were less likely to be admitted than females.
Here’s the data that captured admissions of males and females by departments.
[1]:
import pandas as pd
df = pd.read_csv('./data/berkeley-admission.csv')
df.shape
[1]:
(24, 4)
[2]:
df
[2]:
department | sex | admission | n | |
---|---|---|---|---|
0 | A | Male | Admitted | 512 |
1 | A | Male | Rejected | 313 |
2 | A | Female | Admitted | 89 |
3 | A | Female | Rejected | 19 |
4 | B | Male | Admitted | 353 |
5 | B | Male | Rejected | 207 |
6 | B | Female | Admitted | 17 |
7 | B | Female | Rejected | 8 |
8 | C | Male | Admitted | 120 |
9 | C | Male | Rejected | 205 |
10 | C | Female | Admitted | 202 |
11 | C | Female | Rejected | 391 |
12 | D | Male | Admitted | 139 |
13 | D | Male | Rejected | 279 |
14 | D | Female | Admitted | 131 |
15 | D | Female | Rejected | 244 |
16 | E | Male | Admitted | 53 |
17 | E | Male | Rejected | 138 |
18 | E | Female | Admitted | 94 |
19 | E | Female | Rejected | 299 |
20 | F | Male | Admitted | 22 |
21 | F | Male | Rejected | 351 |
22 | F | Female | Admitted | 24 |
23 | F | Female | Rejected | 317 |
We can just look at the percentage of admitted and rejected students (regardless of department or sex).
[3]:
df \
.groupby(['admission']) \
.agg('sum') \
.assign(p=lambda d: d['n'] / d['n'].sum())
[3]:
n | p | |
---|---|---|
admission | ||
Admitted | 1756 | 0.387895 |
Rejected | 2771 | 0.612105 |
When admissions was broken down by sex, females were admitted at 30% and males at 45%. The data, when viewed from this angle, suggested that admission was biased in favor of males.
[4]:
df \
.groupby(['sex', 'admission']) \
.agg('sum') \
.reset_index() \
.assign(p=lambda d: d.apply(lambda r: r['n'] / d[d['sex']==r['sex']]['n'].sum(), axis=1))
[4]:
sex | admission | n | p | |
---|---|---|---|---|
0 | Female | Admitted | 557 | 0.303542 |
1 | Female | Rejected | 1278 | 0.696458 |
2 | Male | Admitted | 1199 | 0.445394 |
3 | Male | Rejected | 1493 | 0.554606 |
If we apply a \(\chi^2\) test of independence, we will see that the sex and admissions are significantly correlated.
[5]:
from scipy.stats import chi2_contingency
o = df.groupby(['sex', 'admission']).agg('sum').reset_index(drop=True).values.reshape(2, 2)
chi2, p, dof, e = chi2_contingency(o)
print(f'X^2 = {chi2}')
print(f'p = {p}')
print(f'dof = {dof}')
X^2 = 91.87943233270713
p = 9.212151878377876e-22
dof = 1
Here are the observed cell counts.
[6]:
pd.DataFrame(o,
columns=pd.MultiIndex.from_tuples([('Admission', 'Admitted'), ('Admission', 'Rejected')]),
index=pd.MultiIndex.from_tuples([('Sex', 'Female'), ('Sex', 'Male')])
)
[6]:
Admission | |||
---|---|---|---|
Admitted | Rejected | ||
Sex | Female | 557 | 1278 |
Male | 1199 | 1493 |
These are the expected cell counts.
[7]:
pd.DataFrame(e,
columns=pd.MultiIndex.from_tuples([('Admission', 'Admitted'), ('Admission', 'Rejected')]),
index=pd.MultiIndex.from_tuples([('Sex', 'Female'), ('Sex', 'Male')])
)
[7]:
Admission | |||
---|---|---|---|
Admitted | Rejected | ||
Sex | Female | 711.787055 | 1123.212945 |
Male | 1044.212945 | 1647.787055 |
Here comes Simpson’s Paradox. What happens when we partition admission by sex and department? Notice that in only 2 departments do males have higher admission rates than females. Whereas before, it seemed admission was biased in favor of males, now, broken down additionally by department, admission bias is reversed to be in favor of females.
[8]:
def highlight(s):
color = 'background-color: rgb(0, 255, 0, 0.18)'
if s.iloc[0] > s.iloc[2]:
return [color, '', '', '']
else:
return ['', '', color, '']
(df.pivot(index='department', columns=['sex', 'admission'], values='n') / \
df.pivot(index='department', columns=['sex', 'admission'], values='n').sum()) \
.style \
.apply(highlight, axis=1)
[8]:
sex | Male | Female | ||
---|---|---|---|---|
admission | Admitted | Rejected | Admitted | Rejected |
department | ||||
A | 0.427023 | 0.209645 | 0.159785 | 0.014867 |
B | 0.294412 | 0.138647 | 0.030521 | 0.006260 |
C | 0.100083 | 0.137307 | 0.362657 | 0.305947 |
D | 0.115930 | 0.186872 | 0.235189 | 0.190923 |
E | 0.044204 | 0.092431 | 0.168761 | 0.233959 |
F | 0.018349 | 0.235097 | 0.043088 | 0.248044 |
We can also use log-linear models to see what the relationships between admission, sex and department really is. Below, we have
1 independence model,
3 joint independence models,
3 conditional independence models and
1 homogenous model.
The homogenous model (AB, AC, BC)
has the lowest deviance \(G^2\) at 20.2.
[9]:
from patsy import dmatrices
import statsmodels.api as sm
dept_levels = ['F', 'A', 'B', 'C', 'D', 'E']
def get_deviance(m, f):
y, X = dmatrices(f, df, return_type='dataframe')
r = sm.GLM(y, X, family=sm.families.Poisson()).fit()
return {'model': m, 'df': r.df_resid, 'deviance': r.deviance}
formulas = {
'(A, B, C)': 'n ~ admission + sex + C(department, levels=dept_levels)',
'(A, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + sex * C(department, levels=dept_levels)',
'(B, AC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * C(department, levels=dept_levels)',
'(C, AB)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex',
'(AC, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * C(department, levels=dept_levels) + sex * C(department, levels=dept_levels)',
'(AB, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex + sex * C(department, levels=dept_levels)',
'(AB, AC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex + admission * C(department, levels=dept_levels)',
'(AB, AC, BC)': 'n ~ (admission + sex + C(department, levels=dept_levels))**2'
}
result_df = pd.DataFrame([get_deviance(m, f) for m, f in formulas.items()]).set_index('model')
result_df
[9]:
df | deviance | |
---|---|---|
model | ||
(A, B, C) | 16 | 2097.116655 |
(A, BC) | 11 | 876.743972 |
(B, AC) | 11 | 1242.058607 |
(C, AB) | 15 | 2003.390913 |
(AC, BC) | 6 | 21.685924 |
(AB, BC) | 10 | 783.018230 |
(AB, AC) | 10 | 1148.332865 |
(AB, AC, BC) | 5 | 20.217417 |
If we perform a likehood ratio test of the homogenous model with all other ones, we see that the conditional independence model (AC, BC)
does not differ from the homogenous one (p = 0.23). This conditional independence model may be viewed as “simpler” than the homogenous one, and favored. So what? Well, (AC, BC)
means that admission and sex are independent given the department, and whatever association we saw was spurious between sex and admission.
[10]:
from scipy.stats import chi2
def do_likelihood_test(model1, model2='(AB, AC, BC)'):
chi_sq = result_df.loc[model1].deviance - result_df.loc[model2].deviance
dof = result_df.loc[model1].df - result_df.loc[model2].df
return 1 - chi2.cdf(chi_sq, dof)
pd.DataFrame(
[(m, do_likelihood_test(m)) for m in result_df.index.drop('(AB, AC, BC)')],
columns=['model', 'p-value']
)
[10]:
model | p-value | |
---|---|---|
0 | (A, B, C) | 0.000000 |
1 | (A, BC) | 0.000000 |
2 | (B, AC) | 0.000000 |
3 | (C, AB) | 0.000000 |
4 | (AC, BC) | 0.225581 |
5 | (AB, BC) | 0.000000 |
6 | (AB, AC) | 0.000000 |
How about propensity score matching? We can treat sex
as the treatment T
and admission
as the outcome O
with department
as the feature X
.
[11]:
import itertools
def to_records(r):
return [(r['department'], r['sex'], r['admission']) for _ in range(r['n'])]
raw_df = pd.DataFrame(
itertools.chain(*df.apply(to_records, axis=1)),
columns=['department', 'sex', 'admission']) \
.assign(
sex=lambda d: d['sex'].apply(lambda s: 1 if s == 'Male' else 0),
admission=lambda d: d['admission'].apply(lambda s: 1 if s == 'Admitted' else 0)
)
raw_df.shape
[11]:
(4527, 3)
[12]:
raw_df.head()
[12]:
department | sex | admission | |
---|---|---|---|
0 | A | 1 | 1 |
1 | A | 1 | 1 |
2 | A | 1 | 1 |
3 | A | 1 | 1 |
4 | A | 1 | 1 |
Let’s learn the Inverse Probability Treatment Weight IPTW
.
[13]:
import numpy as np
from sklearn.linear_model import LogisticRegression
dept_levels = ['F', 'A', 'B', 'C', 'D', 'E']
y, X = dmatrices('sex ~ C(department, levels=dept_levels)', raw_df, return_type='dataframe')
X = X.iloc[:,1:]
y = np.ravel(y.values)
model = LogisticRegression(random_state=37)
model.fit(X, y)
[13]:
LogisticRegression(random_state=37)
[14]:
def get_propensity(r):
return 1 / r['propensity_score'] if r['T'] == 1 else 1 / (1 - r['propensity_score'])
psm_df = pd.DataFrame({
'T': y,
'propensity_score': model.predict_proba(X)[:,1]
}) \
.assign(
inverse_weight=lambda d: d.apply(get_propensity, axis=1),
sampling_weight=lambda d: d['inverse_weight'] / d['inverse_weight'].sum()
)
And then we can use the IPTW to do weighted logistic regression.
[15]:
y, X = dmatrices('admission ~ sex + C(department, levels=dept_levels) - 1', raw_df, return_type='dataframe')
X = X.iloc[:,1:]
y = np.ravel(y.values)
w = psm_df['inverse_weight']
model = LogisticRegression()
model.fit(X, y, w)
coef = pd.concat([
pd.Series(model.intercept_, ['intercept']),
pd.Series(model.coef_[0], X.columns)
])
As you can see, sex (being male) has a negative average treatment effect ATE
on admission. It decreases the log odds by 0.22.
[16]:
coef
[16]:
intercept -2.436449
C(department, levels=dept_levels)[A] 3.491369
C(department, levels=dept_levels)[B] 3.177035
C(department, levels=dept_levels)[C] 1.941869
C(department, levels=dept_levels)[D] 1.879617
C(department, levels=dept_levels)[E] 1.481090
sex -0.218998
dtype: float64
Or, sex (being male), decreases the odds of admission by about 20%.
[17]:
np.exp(coef)
[17]:
intercept 0.087471
C(department, levels=dept_levels)[A] 32.830876
C(department, levels=dept_levels)[B] 23.975554
C(department, levels=dept_levels)[C] 6.971766
C(department, levels=dept_levels)[D] 6.550998
C(department, levels=dept_levels)[E] 4.397738
sex 0.803323
dtype: float64