16. Frisch-Waugh-Lowell Theorem

Suppose we have a regression equation defined as follows.

\(y \sim \beta_1 x_1 + \beta_2 x_2 + \epsilon\)

The Frisch-Waugh-Lowell Theorem (FWLT) states that \(\beta_1\) can be estimated equivalently by the following estimators.

  • \(y \sim \beta_1 x_1 + \beta_2 x_2\)

  • \(y \sim \beta_1 r_1\) where \(m_1 = x_1 \sim x_2\) and \(r_1 = x_1 - m_1(x_2)\)

  • \(r_2 \sim \beta_1 r_1\) where \(m_2 = y \sim x_2\) and \(r_2 = y - m_2(x_2)\)

In plain words, corresponding to each of the estimators above:

  • \(\beta_1\) is estimated as usual from a regression model

  • \(\beta_1\) is estimated from regression of \(y\) on the residuals of \(x_1 \sim x_2\)

  • \(\beta_1\) is estimated from regressing the residuals of \(y \sim x_2\) on the residuals of \(x_1 \sim x_2\)

The approach of the second estimator is called partialling-out, orthogonalization or residualization. Partialling-out removes information on \(x_1\) that is explained by \(x_2\); or, equivalently, partialling-out keeps information about \(x_1\) that is not explained by \(x_2\).

Let’s verify FWLT.

16.1. Load data

We will load the diabetes data.

[22]:
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True, as_frame=True)
X.shape, y.shape
[22]:
((442, 10), (442,))

Since we will focus on bmi (body mass index), age and disease progression (target/y), let’s visualize their distributions.

[112]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(10, 3.5))

X[['bmi', 'age']].plot(kind='box', ax=ax[0])
y.plot(kind='box')

fig.tight_layout()
_images/fwl-theorem_4_0.png

Below is a heatmap of the pairwise correlations between these variables.

[116]:
import seaborn as sns

sns.heatmap(X.assign(y=y)[['bmi', 'age', 'y']].corr())
[116]:
<Axes: >
_images/fwl-theorem_6_1.png

16.2. Normal estimation

Let’s try to estimate the parameters for the following equation.

\(y \sim \beta_1 \mathrm{bmi} + \beta_2 \mathrm{age}\)

[117]:
from sklearn.linear_model import LinearRegression

m = LinearRegression(n_jobs=-1)
m.fit(X[['bmi', 'age']], y)
m.intercept_, m.coef_
[117]:
(152.13348416289614, array([924.81645876, 133.01372901]))

As you can see, for every unit increase in bmi, there is a about a 925 unit increase in y.

The mean absolute error MAE and root mean squared errors RMSE are shown below.

[124]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

y_pred = m.predict(X[['bmi', 'age']])
mean_absolute_error(y, y_pred), mean_squared_error(y, y_pred, squared=False)
[124]:
(51.96753230222509, 62.06286486063625)

Without validation, the linear model seems to be a nearly perfect one based on the plots below.

[132]:
fig, ax = plt.subplots(1, 2, figsize=(10, 3.5))

pd.DataFrame({'y_true': y, 'y_pred': y}) \
    .sort_values(['y_true']) \
    .plot(kind='scatter', x='y_true', y='y_pred', ax=ax[0])

pd.DataFrame({'y_true': y, 'y_pred': y}) \
    .assign(residual=lambda d: d['y_true'] - d['y_pred']) \
    .plot(kind='scatter', x='y_pred', y='residual', ax=ax[1])

fig.tight_layout()
_images/fwl-theorem_12_0.png

Finally, let’s look at some partial dependence plots. Individually and together, an increase of bmi and age tends to increase disease progression.

[137]:
from sklearn.inspection import PartialDependenceDisplay

fig, ax = plt.subplots(1, 3, figsize=(10, 3.5))

PartialDependenceDisplay.from_estimator(m, X[['bmi', 'age']], ['bmi'], ax=ax[0])
PartialDependenceDisplay.from_estimator(m, X[['bmi', 'age']], ['age'], ax=ax[1])
PartialDependenceDisplay.from_estimator(m, X[['bmi', 'age']], [('bmi', 'age')], ax=ax[2])

fig.tight_layout()
_images/fwl-theorem_14_0.png

16.3. Partialling-out, orthogonalization, residualization

Back, to FWLT. Denote the following.

  • \(f = \mathrm{bmi} \sim \mathrm{age}\) (\(f\) is the regression of bmi on age)

  • \(r = bmi - f(\mathrm{age})\) (\(r\) is the residual)

  • \(g = y \sim r\) (\(g\) is the regression of \(y\) on the residual)

The coefficient associated with \(r\) in the last regression should have the same value as with the full regression.

[138]:
f = LinearRegression(n_jobs=-1)
f.fit(X[['age']], X['bmi'])

r = X['bmi'] - f.predict(X[['age']])
[139]:
g = LinearRegression(n_jobs=-1)
g.fit(r.to_frame(), y)
g.coef_
[139]:
array([924.81645876])

The last way to estimate \(\beta_1\) is as follows.

  • \(h = y \sim \mathrm{age}\) (\(h\) is the regression of \(y\) on age)

  • \(s = y - h(\mathrm{age})\) (\(s\) is the residual)

  • \(i = s \sim r\) (\(i\) is the regression of \(y\) on the residual)

Again, the coefficient associated with \(r\) in the last regression is the same value as with the full regression.

[140]:
h = LinearRegression(n_jobs=-1)
h.fit(X[['age']], y)

s = y - h.predict(X[['age']])
[141]:
i = LinearRegression(n_jobs=-1)
i.fit(r.to_frame(), s)
i.coef_
[141]:
array([924.81645876])

Out of curiousity, let’s look at a scatter plot of the residuals \(s\) and \(r\).

[45]:
pd.DataFrame({'r': r, 's': s}) \
    .plot(kind='scatter', x='r', y='s')
[45]:
<Axes: xlabel='r', ylabel='s'>
_images/fwl-theorem_22_1.png

The term partialling out represents the removal of levels and trend (the linear fit) from the residuals. The term orthogonalization represents the visual interpretation of what’s happening; we are reprojecting the residuals onto an axis orthogonal to the variable of interest. The residuals hold all information about the dependent variable that is not in the indepent variable(s). Residualization effectively refers to the use of residuals in the estimation.

16.4. Multiple controls

FWLT extends to multiple controls. Denote the following.

  • \(m = y \sim \mathrm{bmi} + \mathrm{age} + \mathrm{sex} + \mathrm{bp} + \mathrm{s1} + \mathrm{s2} + \mathrm{s3} + \mathrm{s4} + \mathrm{s5} + \mathrm{s6}\)

  • \(m_1 = \mathrm{bmi} \sim \mathrm{age} + \mathrm{sex} + \mathrm{bp} + \mathrm{s1} + \mathrm{s2} + \mathrm{s3} + \mathrm{s4} + \mathrm{s5} + \mathrm{s6}\)

  • \(m_2 = y \sim \mathrm{age} + \mathrm{sex} + \mathrm{bp} + \mathrm{s1} + \mathrm{s2} + \mathrm{s3} + \mathrm{s4} + \mathrm{s5} + \mathrm{s6}\)

  • \(r_1 = bmi - m_1(\mathrm{age}, \mathrm{sex}, \mathrm{bp}, \mathrm{s1}, \mathrm{s2}, \mathrm{s3}, \mathrm{s4}, \mathrm{s5}, \mathrm{s6})\)

  • \(r_2 = y - m_2(\mathrm{age}, \mathrm{sex}, \mathrm{bp}, \mathrm{s1}, \mathrm{s2}, \mathrm{s3}, \mathrm{s4}, \mathrm{s5}, \mathrm{s6})\)

  • \(m_3 = r_2 \sim r_1\)

The coefficient associated with bmi in \(m\) is equivalent to the coefficient associated with \(r_1\) in \(m_3\).

[142]:
m1 = LinearRegression()
m2 = LinearRegression()
m3 = LinearRegression()

m1.fit(X.drop(columns=['bmi']), X['bmi'])
m2.fit(X.drop(columns=['bmi']), y)

r1 = X['bmi'] - m1.predict(X.drop(columns=['bmi']))
r2 = y - m2.predict(X.drop(columns=['bmi']))

m3.fit(r1.to_frame(), r2)
m3.coef_
[142]:
array([519.84592005])
[143]:
m = LinearRegression()
m.fit(X, y)
pd.Series(m.coef_, X.columns)
[143]:
age    -10.009866
sex   -239.815644
bmi    519.845920
bp     324.384646
s1    -792.175639
s2     476.739021
s3     101.043268
s4     177.063238
s5     751.273700
s6      67.626692
dtype: float64

16.5. Non-linear estimators

We do not have to use linear estimators for \(m_1\) and \(m_2\); below, we use random forest models instead. The final estimation from the non-linear estimators is lower than the one from linear estimators.

[144]:
from sklearn.ensemble import RandomForestRegressor

m1 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
m2 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
m3 = LinearRegression()

m1.fit(X.drop(columns=['bmi']), X['bmi'])
m2.fit(X.drop(columns=['bmi']), y)

r1 = X['bmi'] - m1.predict(X.drop(columns=['bmi']))
r2 = y - m2.predict(X.drop(columns=['bmi']))

m3.fit(r1.to_frame(), r2)
m3.coef_
[144]:
array([508.34043333])

16.6. Sample-splitting

When there is overfitting with \(m_1\) and \(m_2\), the estimation will be biased. To combat biased estimation, sample-splitting is applied. In sample-splitting, for a large number of times,

  • we split the data into training and testing,

  • learn the estimators \(m_1\) and \(m_2\) from the training data,

  • generate the residuals off the testing data, and

  • regress the residuals to find the estimation (coefficient).

The sample splitting estimations are then averaged to produce the final estimation. Below, we do this for all variables.

[255]:
from sklearn.model_selection import train_test_split

def get_estimate(d_col='bmi', random_state=37):
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, shuffle=True, test_size=0.5, random_state=random_state)

    X_tr, d_tr = X_tr.drop(columns=[d_col]), X_tr[d_col]
    X_te, d_te = X_te.drop(columns=[d_col]), X_te[d_col]

    m1 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
    m2 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
    m3 = LinearRegression()

    m1.fit(X_tr, d_tr)
    m2.fit(X_tr, y_tr)

    r1 = d_te - m1.predict(X_te)
    r2 = y_te - m2.predict(X_te)

    m3.fit(r1.to_frame(), r2)

    return m3.coef_[0]

def estimate(d_col='bmi', max_iter=200):
    estimates = [get_estimate(d_col, i) for i in range(max_iter)]
    return {
        'variable': d_col,
        'estimate': np.mean(estimates),
        'std': np.std(estimates)
    }

pd.DataFrame([estimate(c) for c in X.columns])
[255]:
variable estimate std
0 age 2.107160 69.651271
1 sex -205.918395 73.585760
2 bmi 495.455222 81.473690
3 bp 265.241904 71.549371
4 s1 -134.198358 313.943890
5 s2 -52.306987 286.652365
6 s3 -236.446231 163.779218
7 s4 97.210133 257.698259
8 s5 515.953055 113.054241
9 s6 83.226874 75.599235

16.7. Sample-splitting and cross-fitting

We can combine sample-splitting with cross-fitting to provide an unbiased estimate. In cross-fitting, we split the data into two folds, \(a\) and \(b\). We train a model on \(a\) and estimate the coefficient on \(b\), we also train a model on \(b\) and estimate the coefficient on \(a\); the average of these two coefficients is the final estimation.

[257]:
def cross_fit(X_tr, d_tr, y_tr, X_te, d_te, y_te):
    m1 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
    m2 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
    m3 = LinearRegression()

    m1.fit(X_tr, d_tr)
    m2.fit(X_tr, y_tr)

    r1 = d_te - m1.predict(X_te)
    r2 = y_te - m2.predict(X_te)

    m3.fit(r1.to_frame(), r2)

    return m3.coef_[0]

def split_fit(d_col='bmi', random_state=37):
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, shuffle=True, test_size=0.5, random_state=random_state)

    X_tr, d_tr = X_tr.drop(columns=[d_col]), X_tr[d_col]
    X_te, d_te = X_te.drop(columns=[d_col]), X_te[d_col]

    lhs = cross_fit(X_tr, d_tr, y_tr, X_te, d_te, y_te)
    rhs = cross_fit(X_te, d_te, y_te, X_tr, d_tr, y_tr)

    return np.mean([lhs, rhs])

def estimate(d_col='bmi', max_iter=100):
    estimates = [split_fit(d_col, i) for i in range(max_iter)]
    return {
        'variable': d_col,
        'estimate': np.mean(estimates),
        'std': np.std(estimates)
    }

pd.DataFrame([estimate(c) for c in X.columns])
[257]:
variable estimate std
0 age 6.990086 40.040529
1 sex -208.788784 44.207837
2 bmi 491.990834 45.080000
3 bp 262.765973 37.226123
4 s1 -110.913374 154.213316
5 s2 -78.543704 126.827543
6 s3 -233.844736 87.138051
7 s4 117.667220 128.956083
8 s5 519.283738 67.865800
9 s6 75.133127 41.054708

16.8. Double machine learning

FWLT has implications for causal inference in terms of estimating average treatment effect (ATE). Denote the following.

  • \(y\) the outcome variable

  • \(d\) the treatment variable (the variable that we want to estimate ATE for)

  • \(X\) the set of confounding variables

Then, denote the following models.

  • \(m_1 = d \sim X\)

  • \(m_2 = y \sim X\)

Denote the corresponding residuals to the models.

  • \(r_1 = d - m_1(X)\)

  • \(r_2 = y - m_2(X)\)

Finally, define a linear model as follows.

  • \(r_2 \sim \beta_1 r_1\)

\(\beta_1\) is the ATE.

Double machine learning DML generalizes the FWLT approach with many benefits. A key benefit is the correcting for regularization and overfitting biases; the former is achieved through orthogonalization (FWLT) and the latter is achieved to sample-splitting and cross-fitting. Let’s see how DML estimates ATE for each of the variables.

16.8.1. DoubleML

Here are the ATE estimates using DoubleML.

[259]:
from doubleml import DoubleMLData
from doubleml import DoubleMLPLR
from sklearn.base import clone

def get_model(d_col):
    np.random.seed(3141)
    data = DoubleMLData(
        X.assign(y=y),
        y_col='y',
        d_cols=d_col,
        x_cols=list(X.columns.drop([d_col]))
    )

    learner = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
    ml_l = clone(learner)
    ml_m = clone(learner)

    m = DoubleMLPLR(data, ml_l, ml_m, n_folds=5, n_rep=1, score='partialling out', dml_procedure='dml2')
    m.fit(store_models=True)

    return m

def get_estimate(d_col):
    m = get_model(d_col)
    return m.summary

pd.concat([get_estimate(c) for c in X.columns])
[259]:
coef std err t P>|t| 2.5 % 97.5 %
age -8.725229 59.042732 -0.147778 8.825178e-01 -124.446858 106.996400
sex -198.485390 59.895497 -3.313862 9.201703e-04 -315.878408 -81.092373
bmi 448.273300 75.330901 5.950723 2.669611e-09 300.627447 595.919152
bp 271.593295 64.953301 4.181362 2.897675e-05 144.287163 398.899426
s1 -237.891157 237.396920 -1.002082 3.163040e-01 -703.180571 227.398256
s2 -115.889975 245.066520 -0.472892 6.362903e-01 -596.211528 364.431579
s3 -123.176182 150.923436 -0.816150 4.144143e-01 -418.980681 172.628317
s4 54.571481 218.437700 0.249826 8.027217e-01 -373.558543 482.701505
s5 507.240573 101.580684 4.993475 5.930256e-07 308.146091 706.335055
s6 101.971064 65.382567 1.559606 1.188529e-01 -26.176413 230.118540

16.8.2. EconML

Here are the ATE estimates using EconML.

[262]:
from econml.dml import DML
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor

def get_model():
    return DML(
        model_y=RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10),
        model_t=RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10),
        model_final=LassoCV(fit_intercept=False, n_jobs=-1, random_state=37),
        random_state=37
    )

def get_estimate(T):
    m = get_model()
    m.fit(Y=y, T=X[T], X=None, W=X.drop(columns=[T]))


    return {
        'variable': T,
        'coef': m.effect()[0]
    }

pd.DataFrame([get_estimate(c) for c in X.columns])
[262]:
variable coef
0 age 0.000000e+00
1 sex -1.090364e+02
2 bmi 5.110700e+02
3 bp 1.867952e+02
4 s1 -9.176047e-15
5 s2 0.000000e+00
6 s3 -3.120313e+02
7 s4 2.140634e+02
8 s5 5.167653e+02
9 s6 0.000000e+00