17. Pearl Causal Hierarchy

Dr. Judea Pearl is credited with organizing causal inference into a hierarchy. The causal hierarchy is called the Pearl Causal Hierarchy (PCH) in his honor. Sometimes, the PCH is viewed as a ladder of causal inference. In PCH, the types of inferences from “easiest” to “hardest” are as follows.

Associational Inference (Rung 1): This type of inference includes marginal, correlational or conditional queries.
- For example, what is the probability of cancer given that a person smokes?
- This type of inference is a “mere filtering of the world.”
Interventional Inference (Rung 2): This type of inference involves manipulating the system/model.
- For example, what is the effect of smoking on having cancer versus the effect of not smoking on having cancer?
- This type of inference involves “changing the world.”
Counterfactual Inference (Rung 3): This type of inference is asking about outcomes in a world that did not happen given information about a world that did happen.
- For example, given that a person smoked and had cancer, would that person have cancer had they not smoked?
- This type of inference involves sharing information between a “factual world” and a “counterfactual world” and estimating outcomes in the hypothetical world. It’s interesting to note that both worlds are irrefutable (the factual world did indeed happen, and there’s no way to refute the counterfactual world). (On another note, is this counterfactual inference pseudo-science under the lenses of Karl Popper’s scientific philosophy?)

Let’s generate some data to demonstrate PCH.

17.1. Generate data

We will generate data as follows.

\(C \sim \mathcal{N}(5, 1)\)
\(X \sim \mathcal{N}(10 + C, 1)\)
\(Y \sim \mathcal{N}(4 + 2 C + 1.5 X, 1)\)

Clearly, this data generation is a causal model with

\(C\) as a confounder of \(X\) and \(Y\) and
\(X\) is also a cause of \(Y\).

[1]:

import pandas as pd
import numpy as np

np.random.seed(37)

N = 1_000

Xy = pd.DataFrame({'C': np.random.normal(5, 1, N)}) \
    .assign(X=lambda d: np.random.normal(10 + d['C'], 1)) \
    .assign(Y=lambda d: np.random.normal(4 + 2 * d['C'] + 1.5 * d['X'])) \
    [['X', 'C', 'Y']]
Xy

[1]:

	X	C	Y
0	14.285582	4.945536	35.373967
1	15.601920	5.674308	37.632544
2	14.358371	5.346647	36.012563
3	14.210090	3.699654	30.766832
4	16.961011	6.518512	43.044552
...	...	...	...
995	14.177629	4.401151	34.393250
996	13.753148	4.748119	32.859676
997	15.704895	5.371860	37.839380
998	14.997756	6.022343	38.169453
999	15.674885	5.429555	38.106253

1000 rows × 3 columns

17.2. Associational inference

Associational inference is a common inference in data science (machine learning, artificial intelligence). A “predictive model” like a regression model is associational in nature; given some \(X\), what is \(Y\)? Below, a random forest regressor is an example of a model used in associational/predictive inference.

[2]:

from sklearn.ensemble import RandomForestRegressor

X = Xy[['X', 'C']]
y = Xy['Y']

model = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37) \
    .fit(X, y)

[3]:

from sklearn.metrics import mean_absolute_error, explained_variance_score

y_pred = model.predict(X)

pd.Series({
    'mae': mean_absolute_error(y, y_pred),
    'evs': explained_variance_score(y, y_pred)
})

[3]:

mae    0.323627
evs    0.987774
dtype: float64

17.3. Interventional inference

Interventional inference is formalized by Pearl’s do-calculus. In interventional inference, we are interested in the causal impact of doing something. If we estimate impact by using associational means, we are prone to get into problems of biased estimates due to confounding variables. The example below does not use Pearl’s do-calculus to estimate causal effect, but rather, a technique called DoubleML. In Pearl’s do-calculus, we identify confounders and block pathways between what we are manipulating and the outcome variable to estimate the Average Causal Effect (ACE). In DoubleML, Partial Linear Regression (PLR) is used to estimate the ACE. The PLR is denoted as follows.

\(Y = D \theta_0 + g_0(X) + \epsilon_Y\)

Where,

\(Y\) is the outcome variable (effect),
\(D\) is the policy variable (cause),
- \(D = m_0(X) + \epsilon_D\)
\(X\) is the set of confounding variables,
\(\theta_0\) is the causal impact (ACE) of \(D\) on \(Y\),
\(g_0(X)\) is a “machine learning” model (mapping \(X\) to \(Y\)),
\(m_0(X)\) is a “machine learning” model (mapping \(X\) to \(D\)), and
\(\epsilon_Y\) and \(\epsilon_D\) are error terms in the corresponding models.

Note that we do not care about predicting \(Y\) per say, but just in \(\theta_0\), which is the causal impact of \(D\) on \(Y\).

[4]:

from doubleml import DoubleMLData

data = DoubleMLData(Xy, y_col='Y', d_cols='X', x_cols=['C'])

[5]:

from sklearn.base import clone
from doubleml import DoubleMLPLR

learner = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37)
ml_l = clone(learner)
ml_m = clone(learner)

model = DoubleMLPLR(data, ml_l, ml_m)
model.fit()

[5]:

<doubleml.double_ml_plr.DoubleMLPLR at 0x309c80490>

Using DoubleML, the causal impact of \(X\) on \(Y\), controlling for confounding, is 1.44. In other words, the ACE of \(X\) on \(Y\) is 1.44, which is pretty close to the true value (1.5).

[6]:

model.summary

[6]:

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
X	1.444011	0.030134	47.919877	0.0	1.38495	1.503073

17.4. Counterfactual inference

Now we are ready for counterfactual inference. Consider the following structural causal models (SCMs).

\(C := U_C\)
\(X := f_X(C) + U_X\)
\(Y := f_Y(X, C) + U_Y\)

Where,

\(C\) is the confounder,
\(X\) is the cause,
\(Y\) is the effect,
\(f_X(C)\) is some function that maps \(C\) to \(X\),
\(f_Y(X, C)\) is some function that maps \(X\) and \(C\) to \(Y\), and
\(U=\{U_C, U_X, U_Y\}\) are the (hidden) error terms.

Then we can invert the SCMs to define \(U\) as follows.

\(U_C = C\)
\(U_X = X - f_X(C)\)
\(U_Y = Y - f_Y(X, C)\)

Counterfactual inference is done in 3 steps.

Abduction: Given \(E=e\) (the evidence or the facts), compute \(U\).
Intervention: For each \(V\), set \(V=v\), where \(V = \{X, C\}\).
Prediction: Estimate \(Y := f_Y(X, C) + U_Y\) using the results from abduction and intervention.

In counterfactual inference, everything is observed (the factual). That “everything is observed” is the evidence \(E=e\). So, for example, the factual or evidence that is observed is

\(X=x, C=c, Y=y\).

The counterfactual is the hypothetical to have happened, \(X=x'\). We want to know, given the factual, what would \(Y\) be had the hypothetical/counterfactual happened. That is, we want to estimate the following.

\(Y_{X=x'}(U_C, U_X, U_Y)\)

We cannot stress enough that the factual/evidence \(X=x\) is different from the counterfactual/hypothetical \(X=x'\).

Let’see how counterfactual reasoning works. Let’s say we observe the following (it’s just the first example from the simulated data).

\(X=14.285582, C=4.945536, Y=35.373967\)

[7]:

Xy.iloc[[0]]

[7]:

	X	C	Y
0	14.285582	4.945536	35.373967

In this example, we ask the following counterfactual question.

Given, \(X=14.285582, C=4.945536, Y=35.373967\), what would the value of \(Y\) be had \(X=0\)?

Let’s see how we can use SCMs to answer this counterfactual question. First, we need to estimate \(U_Y\). Before we can estimate \(U_Y\), we need to estimate \(f_Y\). We can learn \(f_Y\) as follows.

[8]:

X = Xy[['X', 'C']]
y = Xy['Y']

f_Y = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37) \
    .fit(X, y)

17.4.1. Abduction

Remember, \(U_Y\) is just the difference between what we observed and what is predicted. It is the error.

[9]:

y_true = Xy.iloc[0]['Y']
y_pred = f_Y.predict(Xy.iloc[[0]].drop(columns=['Y']))[0]
U_Y = y_true - y_pred

pd.Series({
    'y_true': y_true,
    'y_pred': y_pred,
    'U_Y': U_Y
})

[9]:

y_true    35.373967
y_pred    35.556930
U_Y       -0.182963
dtype: float64

17.4.2. Intervention

Intervention is formally described as modifying the (causal) graph (remove edges from parents for nodes that we observe evidence). But we can also view it as overriding the evidence with the counterfactual/hypothetical. Below,

the factual \(X=x=14.285582\), and
the counterfactual is \(X=x'=0\).

[10]:

I = pd.DataFrame([[0, 4.945536]], columns=['X', 'C'])
I

[10]:

	X	C
0	0	4.945536

17.4.3. Prediction

Now we can return to the SCM, \(Y := f_Y(X, C) + U_Y\), and conduct the counterfactual inference.

[11]:

f_Y.predict(I) + U_Y

[11]:

array([30.82695271])

Let’s loop back to the counterfactual question.

Given, \(X=14.285582, C=4.945536, Y=35.373967\), what would the value of \(Y\) be had \(X=0\)?

The answer is as follows.

Given, \(X=14.285582, C=4.945536, Y=35.373967\), \(Y\) would have been 30.82695271 had \(X=0\).

The key to SCM and counterfactual inference is in the error terms \(U\). The error terms are the carrier of information from the factual world to the counterfactual world.

Dr. Pearl says “counterfactual is easy”. Below, we compute 41 counterfactuals “easily”.

X’ are the counterfactual values
C stays the same
Y_H are the values of Y had \(X=x'\) happened

[12]:

I = [[x, 4.945536] for x in np.linspace(10, 20, 40)]
I = I + [[14.285582, 4.945536]]
I = pd.DataFrame(I, columns=['X', 'C']) \
    .sort_values(['X']) \
    .reset_index(drop=True)

I.assign(Y=lambda d: f_Y.predict(d) + U_Y) \
    .rename(columns={'X': "X'", 'Y': 'Y_H'})

[12]:

	X'	C	Y_H
0	10.000000	4.945536	30.826953
1	10.256410	4.945536	30.826953
2	10.512821	4.945536	30.826953
3	10.769231	4.945536	30.826953
4	11.025641	4.945536	30.826953
5	11.282051	4.945536	30.803332
6	11.538462	4.945536	30.917718
7	11.794872	4.945536	31.154380
8	12.051282	4.945536	31.324719
9	12.307692	4.945536	31.542255
10	12.564103	4.945536	31.845835
11	12.820513	4.945536	32.753854
12	13.076923	4.945536	34.193431
13	13.333333	4.945536	34.111905
14	13.589744	4.945536	34.367379
15	13.846154	4.945536	34.759914
16	14.102564	4.945536	35.247893
17	14.285582	4.945536	35.373967
18	14.358974	4.945536	35.187050
19	14.615385	4.945536	35.411903
20	14.871795	4.945536	36.303457
21	15.128205	4.945536	36.356752
22	15.384615	4.945536	36.288716
23	15.641026	4.945536	37.023865
24	15.897436	4.945536	37.648353
25	16.153846	4.945536	37.844800
26	16.410256	4.945536	38.028266
27	16.666667	4.945536	38.577302
28	16.923077	4.945536	38.981302
29	17.179487	4.945536	39.178049
30	17.435897	4.945536	39.277327
31	17.692308	4.945536	39.297267
32	17.948718	4.945536	39.433114
33	18.205128	4.945536	39.433114
34	18.461538	4.945536	39.578524
35	18.717949	4.945536	39.578524
36	18.974359	4.945536	39.578524
37	19.230769	4.945536	39.578524
38	19.487179	4.945536	39.578524
39	19.743590	4.945536	39.578524
40	20.000000	4.945536	39.578524

17.5. Summary

It’s easy to do causal inference in the way described above when

you have the causal structure, and
you are operating over continuous variables.

After reading this guide, you should be wondering, how do I know \(X\) causes \(Y\) and \(C\) is a confounder for my data? The short answer is, either use a subject matter expert or use a structure learning algorithm. You do not need the causal structure for associational inference, but for interventional and counterfactual inferences, you do.

Additionally, what happens when we have categorical variables and/or operate in probability space instead of raw value space? For example, in this example, we framed the SCMs in terms of raw values (eg \(X=10\), \(X=20\), etc.). How about if we framed them in terms of probabilities? (eg \(P(X=10)\), \(P(X=20)\), etc.)?