12. Propensity Score Matching
Propensity Score Matching PSM
is an approach for causal inference. In particular, PSM is used to estimate the Average Treatment Effect ATE
one variable, termed the treatment variable T
, on another, termed the outcome variable O
. In general, T
and O
are pairs that are assumed to be in a causal relationship, where manipulating T
changes O
. However, how T
impacts O
in magnitude and direction is usually confounded by many variables
\(X = {X_1, X_2, \ldots, X_n}\). For example, if T
represents the application of a novel surgical procedure used for a certain disease, then T=1
denotes the use of such procedure and T=0
denotes the non-use of such procedure. The outcome O
could be that the patient has been cured O=1
or not O=0
. What we want to do is to estimate how T
impacts O
, and one way to do so is to simply taken the differences of the expected values of the outcomes when treatment is
applied or not.
\(\mathrm{ATE} = E[O|T=1] - E[O|T=0]\)
The problem is with confounding variables; it could be that healthier patients are biasedly selected into treatment T=1
and less healthy patients are selected into non-treatment T=0
. If we were to estimate ATE
, then the result would be biased. If there are confounding variables (bias) that we have not removed, then \(E[O|T=1] - E[O|T=0]\) is the associational effect and not the causal effect of T
on O
.
Before we compute ATE
, we should remove confounding variables, and PSM is one way to do so by introducing a propensity score which balances the data/covariates (balancing means removing confounding variables). The propensity score of an item (e.g. a patient) is simply its probability of being assigned to treatment T=1
. Commonly, the propensity score of an item is produced through logistic regression, where T
is the dependent variable and \(X\) are the independent variables. Once
every item has a propensity score, then paired samples from the population where T=0
and T=1
can be created based on their propensity scores (you want to create a pair of samples such that their propensity scores are nearly, if not entirely, identical). This approach makes sense since we are essentially pairing up items that are identical in every way except for whether or not they were treated; that the items in a pair are identical in every way is the removal of confounding
variables/factors. This approach addresses the Fundamental Problem of Causal Inference FPCI
. Without going into depth, after we have a list of pairings of individuals (a treated item is matched and paired with a nearly/entirely identical untreated item), then we can compute the ATE
.
There are many ways to conduct PSM, but in this notebook, use double regression
to estimate ATE. Denote the following.
T is the treatment variable
O is the outcome variable
X is the set of covariates (possibly confounding variables)
Then the procedure is as follows.
Use logistic regression
T ~ X
to assign a propensity score \(\pi_i\) on each itemCalculate the Inverse Probability Treatment Weight
IPTW
from the propensity score and T\(w_i = \dfrac{T_i}{\pi_i} + \dfrac{1 - T_i}{1 - \pi_i}\)
Conduct weighted regression
O ~ T + X
using \(w\)
The coefficient associated with T
will be the ATE.
12.1. Data
The data is taken from Causal Analysis and Comparative Effectiveness Course. There are 6,657 patients 9 covariates indicating the presence (1) or absence (0) of a disease.
hypertension
anemia
diabetes
hiv
stomach cancer
lung cancer
myocardial infarction
heart failure
metastatic cancer
The treatment variable T
denotes whether a patient was treated by Dr. Smith.
1 indicates a patient was treated by Dr. Smith
0 indicates a patient was
not
treated by Dr. Smith
The outcome variable O
denotes the length of stay (LOS) of each patient.
We want to estimate the average treatment effect (ATE) of Dr. Smith on the LOS.
[1]:
import pandas as pd
df = pd.read_csv('./psm-data.csv')
df.shape
[1]:
(6657, 11)
[2]:
df.head()
[2]:
hypertension | anemia | diabetes | hiv | stomach_cancer | lung_cancer | myocardial_infarction | heart_failure | metastetic_cancer | T | O | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 6 |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 6 |
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
12.2. Visualization
12.2.1. Overall disease proportions
[3]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
X_cols = [c for c in df.columns if c not in {'T', 'O'}]
s = df[X_cols].sum() / df.shape[0] * 100.0
_ = s.plot(kind='bar', title='Proportion of Patients with Each Disease', ylabel='percentage')
12.2.2. Treatment percentage
[4]:
s = df['T'].value_counts().sort_index() / df['T'].value_counts().sum() * 100.0
s.index = [f'T={i} ({v:.1f}%)' for i, v in zip(s.index, s.values)]
_ = s.plot(kind='pie', title='Percentage of Patients Treated by Dr. Smith', ylabel='')
[5]:
_ = df['O'].value_counts().sort_index().plot(kind='bar', title='Distribution of LOS')
[6]:
import seaborn as sns
s = df[['T', 'O', 'hiv']].groupby(['T', 'O']).agg('count').reset_index().rename(columns={'hiv': 'n'})
ax = sns.barplot(x='O', y='n', hue='T', data=s)
_ = ax.set_title('Distribution of LOS by Treatment')
12.2.3. Proportions by disease and treatment
[7]:
s0 = df[df['T'] == 0][X_cols].sum() / df[df['T'] == 0][X_cols].shape[0] * 100.0
s1 = df[df['T'] == 1][X_cols].sum() / df[df['T'] == 1][X_cols].shape[0] * 100.0
ax = pd.DataFrame({'T=0': s0, 'T=1': s1}).plot(kind='bar', title='Proportions of Diseases by Treatment', ylabel='percentage')
_ = ax.legend(bbox_to_anchor=(1.0, 1.05))
12.2.4. Two-proportion z-test
We perform a two-proportion z-test to see if there is any difference in the proportions by treatment.
[8]:
from statsmodels.stats.proportion import proportions_ztest
count_df = pd.DataFrame({'T=0': df[df['T'] == 0][X_cols].sum(), 'T=1': df[df['T'] == 1][X_cols].sum()})
n = {
0: df[df['T'] == 0][X_cols].shape[0],
1: df[df['T'] == 1][X_cols].shape[0]
}
ztest_df = pd.DataFrame([(i,) + proportions_ztest(count=[r['T=0'], r['T=1']], nobs=[n[0], n[1]], alternative='two-sided')
for i, r in count_df.iterrows()], columns=['variable', 'stats', 'p_value'])
ztest_df['p < 0.01'] = ztest_df['p_value'].apply(lambda p: p < 0.01)
ztest_df
[8]:
variable | stats | p_value | p < 0.01 | |
---|---|---|---|---|
0 | hypertension | -0.109640 | 9.126952e-01 | False |
1 | anemia | 0.238975 | 8.111251e-01 | False |
2 | diabetes | 0.019694 | 9.842877e-01 | False |
3 | hiv | 0.158755 | 8.738619e-01 | False |
4 | stomach_cancer | 0.158755 | 8.738619e-01 | False |
5 | lung_cancer | 0.794437 | 4.269413e-01 | False |
6 | myocardial_infarction | 2.890758 | 3.843136e-03 | True |
7 | heart_failure | 12.233395 | 2.061393e-34 | True |
8 | metastetic_cancer | 12.233395 | 2.061393e-34 | True |
12.3. Propensity Score
[9]:
from sklearn.linear_model import LogisticRegression
X = df[df.columns.drop(['O', 'T'])]
y = df['T']
model = LogisticRegression(random_state=37)
model.fit(X, y)
[9]:
LogisticRegression(random_state=37)
[10]:
psm_df = pd.DataFrame({
'T': y,
'propensity_score': model.predict_proba(X)[:,1]
})
psm_df['inverse_weight'] = psm_df.apply(lambda r: 1 / r['propensity_score'] if r['T'] == 1 else 1 / (1 - r['propensity_score']), axis=1)
psm_df['sampling_weight'] = psm_df['inverse_weight'] / psm_df['inverse_weight'].sum()
[11]:
psm_df.head()
[11]:
T | propensity_score | inverse_weight | sampling_weight | |
---|---|---|---|---|
0 | 0 | 0.706267 | 3.404448 | 0.000256 |
1 | 0 | 0.707120 | 3.414373 | 0.000257 |
2 | 1 | 0.706267 | 1.415896 | 0.000107 |
3 | 1 | 0.707120 | 1.414186 | 0.000106 |
4 | 0 | 0.704355 | 3.382430 | 0.000254 |
Check that the weights can reconstruct the original population.
[12]:
psm_df.shape[0], psm_df[psm_df['T'] == 0].inverse_weight.sum(), psm_df[psm_df['T'] == 1].inverse_weight.sum()
[12]:
(6657, 6614.186072184293, 6678.808997206876)
Check for evidence of confounding.
[13]:
ax = sns.boxplot(x='T', y='inverse_weight', data=psm_df)
_ = ax.set_title('Confounding Evidence')
The positivity check ensures that there is overlap between the treatment and non-treatment groups.
[14]:
ax = sns.histplot(data=psm_df, x='inverse_weight', hue='T', kde=True)
_ = ax.set_title('Positivity Check')
We can check to see if the covariates are balanced by conducting two-sample z-test over each variable.
[15]:
weighted_df = df.sample(frac=1, weights=psm_df.inverse_weight, replace=True).drop(columns=['O'])
weighted_df.shape
[15]:
(6657, 10)
[16]:
cols = weighted_df.columns.drop(['T'])
count_df = pd.DataFrame({'T=0': weighted_df[weighted_df['T'] == 0][cols].sum(), 'T=1': weighted_df[weighted_df['T'] == 1][cols].sum()})
n = {
0: weighted_df[weighted_df['T'] == 0][cols].shape[0],
1: weighted_df[weighted_df['T'] == 1][cols].shape[0]
}
ztest_df = pd.DataFrame([(i,) + proportions_ztest(count=[r['T=0'], r['T=1']], nobs=[n[0], n[1]], alternative='two-sided')
for i, r in count_df.iterrows()], columns=['variable', 'stats', 'p_value'])
ztest_df['p < 0.01'] = ztest_df['p_value'].apply(lambda p: p < 0.01)
ztest_df
[16]:
variable | stats | p_value | p < 0.01 | |
---|---|---|---|---|
0 | hypertension | 1.130927 | 0.258086 | False |
1 | anemia | 0.900596 | 0.367803 | False |
2 | diabetes | -1.017216 | 0.309051 | False |
3 | hiv | 0.263115 | 0.792462 | False |
4 | stomach_cancer | 1.495739 | 0.134722 | False |
5 | lung_cancer | 0.156585 | 0.875572 | False |
6 | myocardial_infarction | -0.548840 | 0.583115 | False |
7 | heart_failure | 0.473499 | 0.635857 | False |
8 | metastetic_cancer | 0.027691 | 0.977908 | False |
12.4. Average Treatment Effect
The ATE of Dr. Smith on LOS is a reduction of 0.06 days.
[17]:
from sklearn.linear_model import LinearRegression
X = df[df.columns.drop(['O'])]
y = df['O']
w = psm_df['inverse_weight']
model = LinearRegression()
model.fit(X, y, w)
pd.concat([
pd.Series(model.intercept_, ['intercept']),
pd.Series(model.coef_, df.columns.drop(['O']))
])
[17]:
intercept 3.838081
hypertension 0.082137
anemia -0.066576
diabetes 0.122406
hiv -0.003851
stomach_cancer 0.018383
lung_cancer -0.029819
myocardial_infarction 0.033302
heart_failure 0.015607
metastetic_cancer 0.109735
T -0.062517
dtype: float64
12.5. ATE with confidence intervals
If we want to place a confidence interval around the ATE, we can do bootstrap sampling. The procedure is conducted over and over many times, each time selecting random items from the population (with replacement).
[18]:
def estimate_ate(df):
sample_df = df.sample(frac=1, replace=True)
X = sample_df[sample_df.columns.drop(['O', 'T'])]
y = sample_df['T']
model = LogisticRegression(random_state=37)
model.fit(X, y)
psm_df = pd.DataFrame({
'T': y,
'propensity_score': model.predict_proba(X)[:,1]
})
psm_df['inverse_weight'] = psm_df.apply(lambda r: 1 / r['propensity_score'] if r['T'] == 1 else 1 / (1 - r['propensity_score']), axis=1)
psm_df['sampling_weight'] = psm_df['inverse_weight'] / psm_df['inverse_weight'].sum()
X = sample_df[sample_df.columns.drop(['O'])]
y = sample_df['O']
w = psm_df['inverse_weight']
model = LinearRegression()
model.fit(X, y, w)
s = pd.concat([
pd.Series(model.intercept_, ['intercept']),
pd.Series(model.coef_, df.columns.drop(['O']))
])
return s
bootstrap_df = pd.DataFrame([estimate_ate(df) for _ in range(1_000)])
[20]:
import numpy as np
ate_mean = bootstrap_df['T'].mean()
lower_ci, upper_ci = np.percentile(bootstrap_df['T'], 2.5), np.percentile(bootstrap_df['T'], 97.5)
lower_ci, upper_ci
[20]:
(-0.1513225823433984, 0.027079589368043645)
[21]:
fg = sns.displot(bootstrap_df['T'], kde=False)
_ = fg.ax.vlines(lower_ci, 0, 30, linestyles='dotted', color='blue')
_ = fg.ax.vlines(upper_ci, 0, 30, linestyles='dotted', color='blue')
_ = fg.ax.set_title(rf'ATE 95% CI $[{lower_ci:.2f}, {upper_ci:.2f}]$, $\mu={ate_mean:.2f}$')
[ ]: