17. Pearl Causal Hierarchy

Dr. Judea Pearl is credited with organizing causal inference into a hierarchy. The causal hierarchy is called the Pearl Causal Hierarchy (PCH) in his honor. Sometimes, the PCH is viewed as a ladder of causal inference. In PCH, the types of inferences from “easiest” to “hardest” are as follows.

  • Associational Inference (Rung 1): This type of inference includes marginal, correlational or conditional queries.

    • For example, what is the probability of cancer given that a person smokes?

    • This type of inference is a “mere filtering of the world.”

  • Interventional Inference (Rung 2): This type of inference involves manipulating the system/model.

    • For example, what is the effect of smoking on having cancer versus the effect of not smoking on having cancer?

    • This type of inference involves “changing the world.”

  • Counterfactual Inference (Rung 3): This type of inference is asking about outcomes in a world that did not happen given information about a world that did happen.

    • For example, given that a person smoked and had cancer, would that person have cancer had they not smoked?

    • This type of inference involves sharing information between a “factual world” and a “counterfactual world” and estimating outcomes in the hypothetical world. It’s interesting to note that both worlds are irrefutable (the factual world did indeed happen, and there’s no way to refute the counterfactual world). (On another note, is this counterfactual inference pseudo-science under the lenses of Karl Popper’s scientific philosophy?)

Let’s generate some data to demonstrate PCH.

17.1. Generate data

We will generate data as follows.

  • \(C \sim \mathcal{N}(5, 1)\)

  • \(X \sim \mathcal{N}(10 + C, 1)\)

  • \(Y \sim \mathcal{N}(4 + 2 C + 1.5 X, 1)\)

Clearly, this data generation is a causal model with

  • \(C\) as a confounder of \(X\) and \(Y\) and

  • \(X\) is also a cause of \(Y\).

[1]:
import pandas as pd
import numpy as np

np.random.seed(37)

N = 1_000

Xy = pd.DataFrame({'C': np.random.normal(5, 1, N)}) \
    .assign(X=lambda d: np.random.normal(10 + d['C'], 1)) \
    .assign(Y=lambda d: np.random.normal(4 + 2 * d['C'] + 1.5 * d['X'])) \
    [['X', 'C', 'Y']]
Xy
[1]:
X C Y
0 14.285582 4.945536 35.373967
1 15.601920 5.674308 37.632544
2 14.358371 5.346647 36.012563
3 14.210090 3.699654 30.766832
4 16.961011 6.518512 43.044552
... ... ... ...
995 14.177629 4.401151 34.393250
996 13.753148 4.748119 32.859676
997 15.704895 5.371860 37.839380
998 14.997756 6.022343 38.169453
999 15.674885 5.429555 38.106253

1000 rows × 3 columns

17.2. Associational inference

Associational inference is a common inference in data science (machine learning, artificial intelligence). A “predictive model” like a regression model is associational in nature; given some \(X\), what is \(Y\)? Below, a random forest regressor is an example of a model used in associational/predictive inference.

[2]:
from sklearn.ensemble import RandomForestRegressor

X = Xy[['X', 'C']]
y = Xy['Y']

model = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37) \
    .fit(X, y)
[3]:
from sklearn.metrics import mean_absolute_error, explained_variance_score

y_pred = model.predict(X)

pd.Series({
    'mae': mean_absolute_error(y, y_pred),
    'evs': explained_variance_score(y, y_pred)
})
[3]:
mae    0.323627
evs    0.987774
dtype: float64

17.3. Interventional inference

Interventional inference is formalized by Pearl’s do-calculus. In interventional inference, we are interested in the causal impact of doing something. If we estimate impact by using associational means, we are prone to get into problems of biased estimates due to confounding variables. The example below does not use Pearl’s do-calculus to estimate causal effect, but rather, a technique called DoubleML. In Pearl’s do-calculus, we identify confounders and block pathways between what we are manipulating and the outcome variable to estimate the Average Causal Effect (ACE). In DoubleML, Partial Linear Regression (PLR) is used to estimate the ACE. The PLR is denoted as follows.

\(Y = D \theta_0 + g_0(X) + \epsilon_Y\)

Where,

  • \(Y\) is the outcome variable (effect),

  • \(D\) is the policy variable (cause),

    • \(D = m_0(X) + \epsilon_D\)

  • \(X\) is the set of confounding variables,

  • \(\theta_0\) is the causal impact (ACE) of \(D\) on \(Y\),

  • \(g_0(X)\) is a “machine learning” model (mapping \(X\) to \(Y\)),

  • \(m_0(X)\) is a “machine learning” model (mapping \(X\) to \(D\)), and

  • \(\epsilon_Y\) and \(\epsilon_D\) are error terms in the corresponding models.

Note that we do not care about predicting \(Y\) per say, but just in \(\theta_0\), which is the causal impact of \(D\) on \(Y\).

[4]:
from doubleml import DoubleMLData

data = DoubleMLData(Xy, y_col='Y', d_cols='X', x_cols=['C'])
[5]:
from sklearn.base import clone
from doubleml import DoubleMLPLR

learner = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37)
ml_l = clone(learner)
ml_m = clone(learner)

model = DoubleMLPLR(data, ml_l, ml_m)
model.fit()
[5]:
<doubleml.double_ml_plr.DoubleMLPLR at 0x309c80490>

Using DoubleML, the causal impact of \(X\) on \(Y\), controlling for confounding, is 1.44. In other words, the ACE of \(X\) on \(Y\) is 1.44, which is pretty close to the true value (1.5).

[6]:
model.summary
[6]:
coef std err t P>|t| 2.5 % 97.5 %
X 1.444011 0.030134 47.919877 0.0 1.38495 1.503073

17.4. Counterfactual inference

Now we are ready for counterfactual inference. Consider the following structural causal models (SCMs).

  • \(C := U_C\)

  • \(X := f_X(C) + U_X\)

  • \(Y := f_Y(X, C) + U_Y\)

Where,

  • \(C\) is the confounder,

  • \(X\) is the cause,

  • \(Y\) is the effect,

  • \(f_X(C)\) is some function that maps \(C\) to \(X\),

  • \(f_Y(X, C)\) is some function that maps \(X\) and \(C\) to \(Y\), and

  • \(U=\{U_C, U_X, U_Y\}\) are the (hidden) error terms.

Then we can invert the SCMs to define \(U\) as follows.

  • \(U_C = C\)

  • \(U_X = X - f_X(C)\)

  • \(U_Y = Y - f_Y(X, C)\)

Counterfactual inference is done in 3 steps.

  1. Abduction: Given \(E=e\) (the evidence or the facts), compute \(U\).

  2. Intervention: For each \(V\), set \(V=v\), where \(V = \{X, C\}\).

  3. Prediction: Estimate \(Y := f_Y(X, C) + U_Y\) using the results from abduction and intervention.

In counterfactual inference, everything is observed (the factual). That “everything is observed” is the evidence \(E=e\). So, for example, the factual or evidence that is observed is

\(X=x, C=c, Y=y\).

The counterfactual is the hypothetical to have happened, \(X=x'\). We want to know, given the factual, what would \(Y\) be had the hypothetical/counterfactual happened. That is, we want to estimate the following.

\(Y_{X=x'}(U_C, U_X, U_Y)\)

We cannot stress enough that the factual/evidence \(X=x\) is different from the counterfactual/hypothetical \(X=x'\).

Let’see how counterfactual reasoning works. Let’s say we observe the following (it’s just the first example from the simulated data).

\(X=14.285582, C=4.945536, Y=35.373967\)

[7]:
Xy.iloc[[0]]
[7]:
X C Y
0 14.285582 4.945536 35.373967

In this example, we ask the following counterfactual question.

  • Given, \(X=14.285582, C=4.945536, Y=35.373967\), what would the value of \(Y\) be had \(X=0\)?

Let’s see how we can use SCMs to answer this counterfactual question. First, we need to estimate \(U_Y\). Before we can estimate \(U_Y\), we need to estimate \(f_Y\). We can learn \(f_Y\) as follows.

[8]:
X = Xy[['X', 'C']]
y = Xy['Y']

f_Y = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37) \
    .fit(X, y)

17.4.1. Abduction

Remember, \(U_Y\) is just the difference between what we observed and what is predicted. It is the error.

[9]:
y_true = Xy.iloc[0]['Y']
y_pred = f_Y.predict(Xy.iloc[[0]].drop(columns=['Y']))[0]
U_Y = y_true - y_pred

pd.Series({
    'y_true': y_true,
    'y_pred': y_pred,
    'U_Y': U_Y
})
[9]:
y_true    35.373967
y_pred    35.556930
U_Y       -0.182963
dtype: float64

17.4.2. Intervention

Intervention is formally described as modifying the (causal) graph (remove edges from parents for nodes that we observe evidence). But we can also view it as overriding the evidence with the counterfactual/hypothetical. Below,

  • the factual \(X=x=14.285582\), and

  • the counterfactual is \(X=x'=0\).

[10]:
I = pd.DataFrame([[0, 4.945536]], columns=['X', 'C'])
I
[10]:
X C
0 0 4.945536

17.4.3. Prediction

Now we can return to the SCM, \(Y := f_Y(X, C) + U_Y\), and conduct the counterfactual inference.

[11]:
f_Y.predict(I) + U_Y
[11]:
array([30.82695271])

Let’s loop back to the counterfactual question.

  • Given, \(X=14.285582, C=4.945536, Y=35.373967\), what would the value of \(Y\) be had \(X=0\)?

The answer is as follows.

  • Given, \(X=14.285582, C=4.945536, Y=35.373967\), \(Y\) would have been 30.82695271 had \(X=0\).

The key to SCM and counterfactual inference is in the error terms \(U\). The error terms are the carrier of information from the factual world to the counterfactual world.

Dr. Pearl says “counterfactual is easy”. Below, we compute 41 counterfactuals “easily”.

  • X’ are the counterfactual values

  • C stays the same

  • Y_H are the values of Y had \(X=x'\) happened

[12]:
I = [[x, 4.945536] for x in np.linspace(10, 20, 40)]
I = I + [[14.285582, 4.945536]]
I = pd.DataFrame(I, columns=['X', 'C']) \
    .sort_values(['X']) \
    .reset_index(drop=True)

I.assign(Y=lambda d: f_Y.predict(d) + U_Y) \
    .rename(columns={'X': "X'", 'Y': 'Y_H'})
[12]:
X' C Y_H
0 10.000000 4.945536 30.826953
1 10.256410 4.945536 30.826953
2 10.512821 4.945536 30.826953
3 10.769231 4.945536 30.826953
4 11.025641 4.945536 30.826953
5 11.282051 4.945536 30.803332
6 11.538462 4.945536 30.917718
7 11.794872 4.945536 31.154380
8 12.051282 4.945536 31.324719
9 12.307692 4.945536 31.542255
10 12.564103 4.945536 31.845835
11 12.820513 4.945536 32.753854
12 13.076923 4.945536 34.193431
13 13.333333 4.945536 34.111905
14 13.589744 4.945536 34.367379
15 13.846154 4.945536 34.759914
16 14.102564 4.945536 35.247893
17 14.285582 4.945536 35.373967
18 14.358974 4.945536 35.187050
19 14.615385 4.945536 35.411903
20 14.871795 4.945536 36.303457
21 15.128205 4.945536 36.356752
22 15.384615 4.945536 36.288716
23 15.641026 4.945536 37.023865
24 15.897436 4.945536 37.648353
25 16.153846 4.945536 37.844800
26 16.410256 4.945536 38.028266
27 16.666667 4.945536 38.577302
28 16.923077 4.945536 38.981302
29 17.179487 4.945536 39.178049
30 17.435897 4.945536 39.277327
31 17.692308 4.945536 39.297267
32 17.948718 4.945536 39.433114
33 18.205128 4.945536 39.433114
34 18.461538 4.945536 39.578524
35 18.717949 4.945536 39.578524
36 18.974359 4.945536 39.578524
37 19.230769 4.945536 39.578524
38 19.487179 4.945536 39.578524
39 19.743590 4.945536 39.578524
40 20.000000 4.945536 39.578524

17.5. Summary

It’s easy to do causal inference in the way described above when

  1. you have the causal structure, and

  2. you are operating over continuous variables.

After reading this guide, you should be wondering, how do I know \(X\) causes \(Y\) and \(C\) is a confounder for my data? The short answer is, either use a subject matter expert or use a structure learning algorithm. You do not need the causal structure for associational inference, but for interventional and counterfactual inferences, you do.

Additionally, what happens when we have categorical variables and/or operate in probability space instead of raw value space? For example, in this example, we framed the SCMs in terms of raw values (eg \(X=10\), \(X=20\), etc.). How about if we framed them in terms of probabilities? (eg \(P(X=10)\), \(P(X=20)\), etc.)?