# 17. Pearl Causal Hierarchy

Dr. Judea Pearl is credited with organizing causal inference into a hierarchy. The causal hierarchy is called the Pearl Causal Hierarchy (PCH) in his honor. Sometimes, the PCH is viewed as a ladder of causal inference. In PCH, the types of inferences from “easiest” to “hardest” are as follows.

Associational Inference (Rung 1): This type of inference includes marginal, correlational or conditional queries.

For example, what is the probability of cancer given that a person smokes?

This type of inference is a “mere filtering of the world.”

Interventional Inference (Rung 2): This type of inference involves manipulating the system/model.

For example, what is the effect of smoking on having cancer versus the effect of not smoking on having cancer?

This type of inference involves “changing the world.”

Counterfactual Inference (Rung 3): This type of inference is asking about outcomes in a world that did not happen given information about a world that did happen.

For example, given that a person smoked and had cancer, would that person have cancer had they not smoked?

This type of inference involves sharing information between a “factual world” and a “counterfactual world” and estimating outcomes in the hypothetical world. It’s interesting to note that both worlds are irrefutable (the factual world did indeed happen, and there’s no way to refute the counterfactual world). (On another note, is this counterfactual inference pseudo-science under the lenses of Karl Popper’s scientific philosophy?)

Let’s generate some data to demonstrate PCH.

## 17.1. Generate data

We will generate data as follows.

\(C \sim \mathcal{N}(5, 1)\)

\(X \sim \mathcal{N}(10 + C, 1)\)

\(Y \sim \mathcal{N}(4 + 2 C + 1.5 X, 1)\)

Clearly, this data generation is a causal model with

\(C\) as a confounder of \(X\) and \(Y\) and

\(X\) is also a cause of \(Y\).

```
[1]:
```

```
import pandas as pd
import numpy as np
np.random.seed(37)
N = 1_000
Xy = pd.DataFrame({'C': np.random.normal(5, 1, N)}) \
.assign(X=lambda d: np.random.normal(10 + d['C'], 1)) \
.assign(Y=lambda d: np.random.normal(4 + 2 * d['C'] + 1.5 * d['X'])) \
[['X', 'C', 'Y']]
Xy
```

```
[1]:
```

X | C | Y | |
---|---|---|---|

0 | 14.285582 | 4.945536 | 35.373967 |

1 | 15.601920 | 5.674308 | 37.632544 |

2 | 14.358371 | 5.346647 | 36.012563 |

3 | 14.210090 | 3.699654 | 30.766832 |

4 | 16.961011 | 6.518512 | 43.044552 |

... | ... | ... | ... |

995 | 14.177629 | 4.401151 | 34.393250 |

996 | 13.753148 | 4.748119 | 32.859676 |

997 | 15.704895 | 5.371860 | 37.839380 |

998 | 14.997756 | 6.022343 | 38.169453 |

999 | 15.674885 | 5.429555 | 38.106253 |

1000 rows × 3 columns

## 17.2. Associational inference

Associational inference is a common inference in data science (machine learning, artificial intelligence). A “predictive model” like a regression model is associational in nature; given some \(X\), what is \(Y\)? Below, a random forest regressor is an example of a model used in associational/predictive inference.

```
[2]:
```

```
from sklearn.ensemble import RandomForestRegressor
X = Xy[['X', 'C']]
y = Xy['Y']
model = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37) \
.fit(X, y)
```

```
[3]:
```

```
from sklearn.metrics import mean_absolute_error, explained_variance_score
y_pred = model.predict(X)
pd.Series({
'mae': mean_absolute_error(y, y_pred),
'evs': explained_variance_score(y, y_pred)
})
```

```
[3]:
```

```
mae 0.323627
evs 0.987774
dtype: float64
```

## 17.3. Interventional inference

Interventional inference is formalized by Pearl’s `do-calculus`

. In interventional inference, we are interested in the causal impact of doing something. If we estimate impact by using associational means, we are prone to get into problems of biased estimates due to confounding variables. The example below does not use Pearl’s do-calculus to estimate causal effect, but rather, a technique called `DoubleML`

. In Pearl’s do-calculus, we identify confounders and block pathways between what we are
manipulating and the outcome variable to estimate the Average Causal Effect (ACE). In DoubleML, Partial Linear Regression (PLR) is used to estimate the ACE. The PLR is denoted as follows.

\(Y = D \theta_0 + g_0(X) + \epsilon_Y\)

Where,

\(Y\) is the outcome variable (effect),

\(D\) is the policy variable (cause),

\(D = m_0(X) + \epsilon_D\)

\(X\) is the set of confounding variables,

\(\theta_0\) is the causal impact (ACE) of \(D\) on \(Y\),

\(g_0(X)\) is a “machine learning” model (mapping \(X\) to \(Y\)),

\(m_0(X)\) is a “machine learning” model (mapping \(X\) to \(D\)), and

\(\epsilon_Y\) and \(\epsilon_D\) are error terms in the corresponding models.

Note that we do not care about predicting \(Y\) per say, but just in \(\theta_0\), which is the causal impact of \(D\) on \(Y\).

```
[4]:
```

```
from doubleml import DoubleMLData
data = DoubleMLData(Xy, y_col='Y', d_cols='X', x_cols=['C'])
```

```
[5]:
```

```
from sklearn.base import clone
from doubleml import DoubleMLPLR
learner = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37)
ml_l = clone(learner)
ml_m = clone(learner)
model = DoubleMLPLR(data, ml_l, ml_m)
model.fit()
```

```
[5]:
```

```
<doubleml.double_ml_plr.DoubleMLPLR at 0x309c80490>
```

Using DoubleML, the causal impact of \(X\) on \(Y\), controlling for confounding, is 1.44. In other words, the ACE of \(X\) on \(Y\) is 1.44, which is pretty close to the true value (1.5).

```
[6]:
```

```
model.summary
```

```
[6]:
```

coef | std err | t | P>|t| | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|

X | 1.444011 | 0.030134 | 47.919877 | 0.0 | 1.38495 | 1.503073 |

## 17.4. Counterfactual inference

Now we are ready for counterfactual inference. Consider the following structural causal models (SCMs).

\(C := U_C\)

\(X := f_X(C) + U_X\)

\(Y := f_Y(X, C) + U_Y\)

Where,

\(C\) is the confounder,

\(X\) is the cause,

\(Y\) is the effect,

\(f_X(C)\) is some function that maps \(C\) to \(X\),

\(f_Y(X, C)\) is some function that maps \(X\) and \(C\) to \(Y\), and

\(U=\{U_C, U_X, U_Y\}\) are the (hidden) error terms.

Then we can invert the SCMs to define \(U\) as follows.

\(U_C = C\)

\(U_X = X - f_X(C)\)

\(U_Y = Y - f_Y(X, C)\)

Counterfactual inference is done in 3 steps.

Abduction: Given \(E=e\) (the evidence or the facts), compute \(U\).

Intervention: For each \(V\), set \(V=v\), where \(V = \{X, C\}\).

Prediction: Estimate \(Y := f_Y(X, C) + U_Y\) using the results from abduction and intervention.

In counterfactual inference, everything is observed (the factual). That “everything is observed” is the evidence \(E=e\). So, for example, the factual or evidence that is observed is

\(X=x, C=c, Y=y\).

The counterfactual is the hypothetical to have happened, \(X=x'\). We want to know, given the factual, what would \(Y\) be had the hypothetical/counterfactual happened. That is, we want to estimate the following.

\(Y_{X=x'}(U_C, U_X, U_Y)\)

We cannot stress enough that the factual/evidence \(X=x\) is different from the counterfactual/hypothetical \(X=x'\).

Let’see how counterfactual reasoning works. Let’s say we observe the following (it’s just the first example from the simulated data).

\(X=14.285582, C=4.945536, Y=35.373967\)

```
[7]:
```

```
Xy.iloc[[0]]
```

```
[7]:
```

X | C | Y | |
---|---|---|---|

0 | 14.285582 | 4.945536 | 35.373967 |

In this example, we ask the following counterfactual question.

Given, \(X=14.285582, C=4.945536, Y=35.373967\), what would the value of \(Y\) be had \(X=0\)?

Let’s see how we can use SCMs to answer this counterfactual question. First, we need to estimate \(U_Y\). Before we can estimate \(U_Y\), we need to estimate \(f_Y\). We can learn \(f_Y\) as follows.

```
[8]:
```

```
X = Xy[['X', 'C']]
y = Xy['Y']
f_Y = RandomForestRegressor(n_estimators=50, n_jobs=-1, random_state=37) \
.fit(X, y)
```

### 17.4.1. Abduction

Remember, \(U_Y\) is just the difference between what we observed and what is predicted. It is the error.

```
[9]:
```

```
y_true = Xy.iloc[0]['Y']
y_pred = f_Y.predict(Xy.iloc[[0]].drop(columns=['Y']))[0]
U_Y = y_true - y_pred
pd.Series({
'y_true': y_true,
'y_pred': y_pred,
'U_Y': U_Y
})
```

```
[9]:
```

```
y_true 35.373967
y_pred 35.556930
U_Y -0.182963
dtype: float64
```

### 17.4.2. Intervention

Intervention is formally described as modifying the (causal) graph (remove edges from parents for nodes that we observe evidence). But we can also view it as overriding the evidence with the counterfactual/hypothetical. Below,

the factual \(X=x=14.285582\), and

the counterfactual is \(X=x'=0\).

```
[10]:
```

```
I = pd.DataFrame([[0, 4.945536]], columns=['X', 'C'])
I
```

```
[10]:
```

X | C | |
---|---|---|

0 | 0 | 4.945536 |

### 17.4.3. Prediction

Now we can return to the SCM, \(Y := f_Y(X, C) + U_Y\), and conduct the counterfactual inference.

```
[11]:
```

```
f_Y.predict(I) + U_Y
```

```
[11]:
```

```
array([30.82695271])
```

Let’s loop back to the counterfactual question.

Given, \(X=14.285582, C=4.945536, Y=35.373967\), what would the value of \(Y\) be had \(X=0\)?

The answer is as follows.

Given, \(X=14.285582, C=4.945536, Y=35.373967\), \(Y\) would have been 30.82695271 had \(X=0\).

The key to SCM and counterfactual inference is in the error terms \(U\). The error terms are the carrier of information from the factual world to the counterfactual world.

Dr. Pearl says “counterfactual is easy”. Below, we compute 41 counterfactuals “easily”.

X’ are the counterfactual values

C stays the same

Y_H are the values of Y had \(X=x'\) happened

```
[12]:
```

```
I = [[x, 4.945536] for x in np.linspace(10, 20, 40)]
I = I + [[14.285582, 4.945536]]
I = pd.DataFrame(I, columns=['X', 'C']) \
.sort_values(['X']) \
.reset_index(drop=True)
I.assign(Y=lambda d: f_Y.predict(d) + U_Y) \
.rename(columns={'X': "X'", 'Y': 'Y_H'})
```

```
[12]:
```

X' | C | Y_H | |
---|---|---|---|

0 | 10.000000 | 4.945536 | 30.826953 |

1 | 10.256410 | 4.945536 | 30.826953 |

2 | 10.512821 | 4.945536 | 30.826953 |

3 | 10.769231 | 4.945536 | 30.826953 |

4 | 11.025641 | 4.945536 | 30.826953 |

5 | 11.282051 | 4.945536 | 30.803332 |

6 | 11.538462 | 4.945536 | 30.917718 |

7 | 11.794872 | 4.945536 | 31.154380 |

8 | 12.051282 | 4.945536 | 31.324719 |

9 | 12.307692 | 4.945536 | 31.542255 |

10 | 12.564103 | 4.945536 | 31.845835 |

11 | 12.820513 | 4.945536 | 32.753854 |

12 | 13.076923 | 4.945536 | 34.193431 |

13 | 13.333333 | 4.945536 | 34.111905 |

14 | 13.589744 | 4.945536 | 34.367379 |

15 | 13.846154 | 4.945536 | 34.759914 |

16 | 14.102564 | 4.945536 | 35.247893 |

17 | 14.285582 | 4.945536 | 35.373967 |

18 | 14.358974 | 4.945536 | 35.187050 |

19 | 14.615385 | 4.945536 | 35.411903 |

20 | 14.871795 | 4.945536 | 36.303457 |

21 | 15.128205 | 4.945536 | 36.356752 |

22 | 15.384615 | 4.945536 | 36.288716 |

23 | 15.641026 | 4.945536 | 37.023865 |

24 | 15.897436 | 4.945536 | 37.648353 |

25 | 16.153846 | 4.945536 | 37.844800 |

26 | 16.410256 | 4.945536 | 38.028266 |

27 | 16.666667 | 4.945536 | 38.577302 |

28 | 16.923077 | 4.945536 | 38.981302 |

29 | 17.179487 | 4.945536 | 39.178049 |

30 | 17.435897 | 4.945536 | 39.277327 |

31 | 17.692308 | 4.945536 | 39.297267 |

32 | 17.948718 | 4.945536 | 39.433114 |

33 | 18.205128 | 4.945536 | 39.433114 |

34 | 18.461538 | 4.945536 | 39.578524 |

35 | 18.717949 | 4.945536 | 39.578524 |

36 | 18.974359 | 4.945536 | 39.578524 |

37 | 19.230769 | 4.945536 | 39.578524 |

38 | 19.487179 | 4.945536 | 39.578524 |

39 | 19.743590 | 4.945536 | 39.578524 |

40 | 20.000000 | 4.945536 | 39.578524 |

## 17.5. Summary

It’s easy to do causal inference in the way described above when

you have the causal structure, and

you are operating over continuous variables.

After reading this guide, you should be wondering, how do I know \(X\) causes \(Y\) and \(C\) is a confounder for my data? The short answer is, either use a subject matter expert or use a structure learning algorithm. You do not need the causal structure for associational inference, but for interventional and counterfactual inferences, you do.

Additionally, what happens when we have categorical variables and/or operate in probability space instead of raw value space? For example, in this example, we framed the SCMs in terms of raw values (eg \(X=10\), \(X=20\), etc.). How about if we framed them in terms of probabilities? (eg \(P(X=10)\), \(P(X=20)\), etc.)?