1. Simpson’s Paradox

Simpson’s Paradox is a statistical phenomenon when the association between two variables appears, disappears or reverses when the population is divided into sub-populations. For example, admissions data from UC-Berkeley showed that males are more likely to be admitted than females. However, when breaking the admissions data down by departments, males were less likely to be admitted than females.

Here’s the data that captured admissions of males and females by departments.

[1]:

import pandas as pd

df = pd.read_csv('./data/berkeley-admission.csv')
df.shape

[1]:

(24, 4)

[2]:

df

[2]:

	department	sex	admission	n
0	A	Male	Admitted	512
1	A	Male	Rejected	313
2	A	Female	Admitted	89
3	A	Female	Rejected	19
4	B	Male	Admitted	353
5	B	Male	Rejected	207
6	B	Female	Admitted	17
7	B	Female	Rejected	8
8	C	Male	Admitted	120
9	C	Male	Rejected	205
10	C	Female	Admitted	202
11	C	Female	Rejected	391
12	D	Male	Admitted	139
13	D	Male	Rejected	279
14	D	Female	Admitted	131
15	D	Female	Rejected	244
16	E	Male	Admitted	53
17	E	Male	Rejected	138
18	E	Female	Admitted	94
19	E	Female	Rejected	299
20	F	Male	Admitted	22
21	F	Male	Rejected	351
22	F	Female	Admitted	24
23	F	Female	Rejected	317

We can just look at the percentage of admitted and rejected students (regardless of department or sex).

[3]:

df \
    .groupby(['admission']) \
    .agg('sum') \
    .assign(p=lambda d: d['n'] / d['n'].sum())

[3]:

	n	p
admission
Admitted	1756	0.387895
Rejected	2771	0.612105

When admissions was broken down by sex, females were admitted at 30% and males at 45%. The data, when viewed from this angle, suggested that admission was biased in favor of males.

[4]:

df \
    .groupby(['sex', 'admission']) \
    .agg('sum') \
    .reset_index() \
    .assign(p=lambda d: d.apply(lambda r: r['n'] / d[d['sex']==r['sex']]['n'].sum(), axis=1))

[4]:

	sex	admission	n	p
0	Female	Admitted	557	0.303542
1	Female	Rejected	1278	0.696458
2	Male	Admitted	1199	0.445394
3	Male	Rejected	1493	0.554606

If we apply a \(\chi^2\) test of independence, we will see that the sex and admissions are significantly correlated.

[5]:

from scipy.stats import chi2_contingency

o = df.groupby(['sex', 'admission']).agg('sum').reset_index(drop=True).values.reshape(2, 2)

chi2, p, dof, e = chi2_contingency(o)

print(f'X^2 = {chi2}')
print(f'p = {p}')
print(f'dof = {dof}')

X^2 = 91.87943233270713
p = 9.212151878377876e-22
dof = 1

Here are the observed cell counts.

[6]:

pd.DataFrame(o,
             columns=pd.MultiIndex.from_tuples([('Admission', 'Admitted'), ('Admission', 'Rejected')]),
             index=pd.MultiIndex.from_tuples([('Sex', 'Female'), ('Sex', 'Male')])
)

[6]:

		Admission
		Admitted	Rejected
Sex	Female	557	1278
Sex	Male	1199	1493

These are the expected cell counts.

[7]:

pd.DataFrame(e,
             columns=pd.MultiIndex.from_tuples([('Admission', 'Admitted'), ('Admission', 'Rejected')]),
             index=pd.MultiIndex.from_tuples([('Sex', 'Female'), ('Sex', 'Male')])
)

[7]:

		Admission
		Admitted	Rejected
Sex	Female	711.787055	1123.212945
Sex	Male	1044.212945	1647.787055

Here comes Simpson’s Paradox. What happens when we partition admission by sex and department? Notice that in only 2 departments do males have higher admission rates than females. Whereas before, it seemed admission was biased in favor of males, now, broken down additionally by department, admission bias is reversed to be in favor of females.

[8]:

def highlight(s):
    color = 'background-color: rgb(0, 255, 0, 0.18)'
    if s.iloc[0] > s.iloc[2]:
        return [color, '', '', '']
    else:
        return ['', '', color, '']

(df.pivot(index='department', columns=['sex', 'admission'], values='n') / \
    df.pivot(index='department', columns=['sex', 'admission'], values='n').sum()) \
    .style \
    .apply(highlight, axis=1)

[8]:

sex	Male		Female
admission	Admitted	Rejected	Admitted	Rejected
department
A	0.427023	0.209645	0.159785	0.014867
B	0.294412	0.138647	0.030521	0.006260
C	0.100083	0.137307	0.362657	0.305947
D	0.115930	0.186872	0.235189	0.190923
E	0.044204	0.092431	0.168761	0.233959
F	0.018349	0.235097	0.043088	0.248044

We can also use log-linear models to see what the relationships between admission, sex and department really is. Below, we have

1 independence model,
3 joint independence models,
3 conditional independence models and
1 homogenous model.

The homogenous model (AB, AC, BC) has the lowest deviance \(G^2\) at 20.2.

[9]:

from patsy import dmatrices
import statsmodels.api as sm

dept_levels = ['F', 'A', 'B', 'C', 'D', 'E']

def get_deviance(m, f):
    y, X = dmatrices(f, df, return_type='dataframe')
    r = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    return {'model': m, 'df': r.df_resid, 'deviance': r.deviance}

formulas = {
    '(A, B, C)': 'n ~ admission + sex + C(department, levels=dept_levels)',
    '(A, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + sex * C(department, levels=dept_levels)',
    '(B, AC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * C(department, levels=dept_levels)',
    '(C, AB)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex',
    '(AC, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * C(department, levels=dept_levels) + sex * C(department, levels=dept_levels)',
    '(AB, BC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex + sex * C(department, levels=dept_levels)',
    '(AB, AC)': 'n ~ admission + sex + C(department, levels=dept_levels) + admission * sex + admission * C(department, levels=dept_levels)',
    '(AB, AC, BC)': 'n ~ (admission + sex + C(department, levels=dept_levels))**2'
}

result_df = pd.DataFrame([get_deviance(m, f) for m, f in formulas.items()]).set_index('model')
result_df

[9]:

	df	deviance
model
(A, B, C)	16	2097.116655
(A, BC)	11	876.743972
(B, AC)	11	1242.058607
(C, AB)	15	2003.390913
(AC, BC)	6	21.685924
(AB, BC)	10	783.018230
(AB, AC)	10	1148.332865
(AB, AC, BC)	5	20.217417

If we perform a likehood ratio test of the homogenous model with all other ones, we see that the conditional independence model (AC, BC) does not differ from the homogenous one (p = 0.23). This conditional independence model may be viewed as “simpler” than the homogenous one, and favored. So what? Well, (AC, BC) means that admission and sex are independent given the department, and whatever association we saw was spurious between sex and admission.

[10]:

from scipy.stats import chi2

def do_likelihood_test(model1, model2='(AB, AC, BC)'):
    chi_sq = result_df.loc[model1].deviance - result_df.loc[model2].deviance
    dof = result_df.loc[model1].df - result_df.loc[model2].df

    return 1 - chi2.cdf(chi_sq, dof)

pd.DataFrame(
    [(m, do_likelihood_test(m)) for m in result_df.index.drop('(AB, AC, BC)')],
    columns=['model', 'p-value']
)

[10]:

	model	p-value
0	(A, B, C)	0.000000
1	(A, BC)	0.000000
2	(B, AC)	0.000000
3	(C, AB)	0.000000
4	(AC, BC)	0.225581
5	(AB, BC)	0.000000
6	(AB, AC)	0.000000

How about propensity score matching? We can treat sex as the treatment T and admission as the outcome O with department as the feature X.

[11]:

import itertools

def to_records(r):
    return [(r['department'], r['sex'], r['admission']) for _ in range(r['n'])]

raw_df = pd.DataFrame(
            itertools.chain(*df.apply(to_records, axis=1)),
            columns=['department', 'sex', 'admission']) \
        .assign(
            sex=lambda d: d['sex'].apply(lambda s: 1 if s == 'Male' else 0),
            admission=lambda d: d['admission'].apply(lambda s: 1 if s == 'Admitted' else 0)
        )

raw_df.shape

[11]:

(4527, 3)

[12]:

raw_df.head()

[12]:

	department	sex	admission
0	A	1	1
1	A	1	1
2	A	1	1
3	A	1	1
4	A	1	1

Let’s learn the Inverse Probability Treatment Weight IPTW.

[13]:

import numpy as np
from sklearn.linear_model import LogisticRegression

dept_levels = ['F', 'A', 'B', 'C', 'D', 'E']

y, X = dmatrices('sex ~ C(department, levels=dept_levels)', raw_df, return_type='dataframe')
X = X.iloc[:,1:]
y = np.ravel(y.values)

model = LogisticRegression(random_state=37)
model.fit(X, y)

[13]:

LogisticRegression(random_state=37)

[14]:

def get_propensity(r):
    return 1 / r['propensity_score'] if r['T'] == 1 else 1 / (1 - r['propensity_score'])

psm_df = pd.DataFrame({
            'T': y,
            'propensity_score': model.predict_proba(X)[:,1]
        }) \
        .assign(
            inverse_weight=lambda d: d.apply(get_propensity, axis=1),
            sampling_weight=lambda d: d['inverse_weight'] / d['inverse_weight'].sum()
        )

And then we can use the IPTW to do weighted logistic regression.

[15]:

y, X = dmatrices('admission ~ sex + C(department, levels=dept_levels) - 1', raw_df, return_type='dataframe')
X = X.iloc[:,1:]
y = np.ravel(y.values)
w = psm_df['inverse_weight']

model = LogisticRegression()
model.fit(X, y, w)

coef = pd.concat([
    pd.Series(model.intercept_, ['intercept']),
    pd.Series(model.coef_[0], X.columns)
])

As you can see, sex (being male) has a negative average treatment effect ATE on admission. It decreases the log odds by 0.22.

[16]:

coef

[16]:

intercept                              -2.436449
C(department, levels=dept_levels)[A]    3.491369
C(department, levels=dept_levels)[B]    3.177035
C(department, levels=dept_levels)[C]    1.941869
C(department, levels=dept_levels)[D]    1.879617
C(department, levels=dept_levels)[E]    1.481090
sex                                    -0.218998
dtype: float64

Or, sex (being male), decreases the odds of admission by about 20%.

[17]:

np.exp(coef)

[17]:

intercept                                0.087471
C(department, levels=dept_levels)[A]    32.830876
C(department, levels=dept_levels)[B]    23.975554
C(department, levels=dept_levels)[C]     6.971766
C(department, levels=dept_levels)[D]     6.550998
C(department, levels=dept_levels)[E]     4.397738
sex                                      0.803323
dtype: float64