16. Frisch-Waugh-Lowell Theorem
Suppose we have a regression equation defined as follows.
\(y \sim \beta_1 x_1 + \beta_2 x_2 + \epsilon\)
The Frisch-Waugh-Lowell Theorem (FWLT) states that \(\beta_1\) can be estimated equivalently by the following estimators.
\(y \sim \beta_1 x_1 + \beta_2 x_2\)
\(y \sim \beta_1 r_1\) where \(m_1 = x_1 \sim x_2\) and \(r_1 = x_1 - m_1(x_2)\)
\(r_2 \sim \beta_1 r_1\) where \(m_2 = y \sim x_2\) and \(r_2 = y - m_2(x_2)\)
In plain words, corresponding to each of the estimators above:
\(\beta_1\) is estimated as usual from a regression model
\(\beta_1\) is estimated from regression of \(y\) on the residuals of \(x_1 \sim x_2\)
\(\beta_1\) is estimated from regressing the residuals of \(y \sim x_2\) on the residuals of \(x_1 \sim x_2\)
The approach of the second estimator is called partialling-out
, orthogonalization
or residualization
. Partialling-out removes information on \(x_1\) that is explained by \(x_2\); or, equivalently, partialling-out keeps information about \(x_1\) that is not explained by \(x_2\).
Let’s verify FWLT.
16.1. Load data
We will load the diabetes data.
[22]:
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True, as_frame=True)
X.shape, y.shape
[22]:
((442, 10), (442,))
Since we will focus on bmi
(body mass index), age
and disease progression (target/y), let’s visualize their distributions.
[112]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(10, 3.5))
X[['bmi', 'age']].plot(kind='box', ax=ax[0])
y.plot(kind='box')
fig.tight_layout()
Below is a heatmap of the pairwise correlations between these variables.
[116]:
import seaborn as sns
sns.heatmap(X.assign(y=y)[['bmi', 'age', 'y']].corr())
[116]:
<Axes: >
16.2. Normal estimation
Let’s try to estimate the parameters for the following equation.
\(y \sim \beta_1 \mathrm{bmi} + \beta_2 \mathrm{age}\)
[117]:
from sklearn.linear_model import LinearRegression
m = LinearRegression(n_jobs=-1)
m.fit(X[['bmi', 'age']], y)
m.intercept_, m.coef_
[117]:
(152.13348416289614, array([924.81645876, 133.01372901]))
As you can see, for every unit increase in bmi, there is a about a 925 unit increase in y.
The mean absolute error MAE
and root mean squared errors RMSE
are shown below.
[124]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
y_pred = m.predict(X[['bmi', 'age']])
mean_absolute_error(y, y_pred), mean_squared_error(y, y_pred, squared=False)
[124]:
(51.96753230222509, 62.06286486063625)
Without validation, the linear model seems to be a nearly perfect one based on the plots below.
[132]:
fig, ax = plt.subplots(1, 2, figsize=(10, 3.5))
pd.DataFrame({'y_true': y, 'y_pred': y}) \
.sort_values(['y_true']) \
.plot(kind='scatter', x='y_true', y='y_pred', ax=ax[0])
pd.DataFrame({'y_true': y, 'y_pred': y}) \
.assign(residual=lambda d: d['y_true'] - d['y_pred']) \
.plot(kind='scatter', x='y_pred', y='residual', ax=ax[1])
fig.tight_layout()
Finally, let’s look at some partial dependence plots. Individually and together, an increase of bmi and age tends to increase disease progression.
[137]:
from sklearn.inspection import PartialDependenceDisplay
fig, ax = plt.subplots(1, 3, figsize=(10, 3.5))
PartialDependenceDisplay.from_estimator(m, X[['bmi', 'age']], ['bmi'], ax=ax[0])
PartialDependenceDisplay.from_estimator(m, X[['bmi', 'age']], ['age'], ax=ax[1])
PartialDependenceDisplay.from_estimator(m, X[['bmi', 'age']], [('bmi', 'age')], ax=ax[2])
fig.tight_layout()
16.3. Partialling-out, orthogonalization, residualization
Back, to FWLT. Denote the following.
\(f = \mathrm{bmi} \sim \mathrm{age}\) (\(f\) is the regression of bmi on age)
\(r = bmi - f(\mathrm{age})\) (\(r\) is the residual)
\(g = y \sim r\) (\(g\) is the regression of \(y\) on the residual)
The coefficient associated with \(r\) in the last regression should have the same value as with the full regression.
[138]:
f = LinearRegression(n_jobs=-1)
f.fit(X[['age']], X['bmi'])
r = X['bmi'] - f.predict(X[['age']])
[139]:
g = LinearRegression(n_jobs=-1)
g.fit(r.to_frame(), y)
g.coef_
[139]:
array([924.81645876])
The last way to estimate \(\beta_1\) is as follows.
\(h = y \sim \mathrm{age}\) (\(h\) is the regression of \(y\) on age)
\(s = y - h(\mathrm{age})\) (\(s\) is the residual)
\(i = s \sim r\) (\(i\) is the regression of \(y\) on the residual)
Again, the coefficient associated with \(r\) in the last regression is the same value as with the full regression.
[140]:
h = LinearRegression(n_jobs=-1)
h.fit(X[['age']], y)
s = y - h.predict(X[['age']])
[141]:
i = LinearRegression(n_jobs=-1)
i.fit(r.to_frame(), s)
i.coef_
[141]:
array([924.81645876])
Out of curiousity, let’s look at a scatter plot of the residuals \(s\) and \(r\).
[45]:
pd.DataFrame({'r': r, 's': s}) \
.plot(kind='scatter', x='r', y='s')
[45]:
<Axes: xlabel='r', ylabel='s'>
The term partialling out represents the removal of levels and trend (the linear fit) from the residuals. The term orthogonalization represents the visual interpretation of what’s happening; we are reprojecting the residuals onto an axis orthogonal to the variable of interest. The residuals hold all information about the dependent variable that is not in the indepent variable(s). Residualization effectively refers to the use of residuals in the estimation.
16.4. Multiple controls
FWLT extends to multiple controls. Denote the following.
\(m = y \sim \mathrm{bmi} + \mathrm{age} + \mathrm{sex} + \mathrm{bp} + \mathrm{s1} + \mathrm{s2} + \mathrm{s3} + \mathrm{s4} + \mathrm{s5} + \mathrm{s6}\)
\(m_1 = \mathrm{bmi} \sim \mathrm{age} + \mathrm{sex} + \mathrm{bp} + \mathrm{s1} + \mathrm{s2} + \mathrm{s3} + \mathrm{s4} + \mathrm{s5} + \mathrm{s6}\)
\(m_2 = y \sim \mathrm{age} + \mathrm{sex} + \mathrm{bp} + \mathrm{s1} + \mathrm{s2} + \mathrm{s3} + \mathrm{s4} + \mathrm{s5} + \mathrm{s6}\)
\(r_1 = bmi - m_1(\mathrm{age}, \mathrm{sex}, \mathrm{bp}, \mathrm{s1}, \mathrm{s2}, \mathrm{s3}, \mathrm{s4}, \mathrm{s5}, \mathrm{s6})\)
\(r_2 = y - m_2(\mathrm{age}, \mathrm{sex}, \mathrm{bp}, \mathrm{s1}, \mathrm{s2}, \mathrm{s3}, \mathrm{s4}, \mathrm{s5}, \mathrm{s6})\)
\(m_3 = r_2 \sim r_1\)
The coefficient associated with bmi in \(m\) is equivalent to the coefficient associated with \(r_1\) in \(m_3\).
[142]:
m1 = LinearRegression()
m2 = LinearRegression()
m3 = LinearRegression()
m1.fit(X.drop(columns=['bmi']), X['bmi'])
m2.fit(X.drop(columns=['bmi']), y)
r1 = X['bmi'] - m1.predict(X.drop(columns=['bmi']))
r2 = y - m2.predict(X.drop(columns=['bmi']))
m3.fit(r1.to_frame(), r2)
m3.coef_
[142]:
array([519.84592005])
[143]:
m = LinearRegression()
m.fit(X, y)
pd.Series(m.coef_, X.columns)
[143]:
age -10.009866
sex -239.815644
bmi 519.845920
bp 324.384646
s1 -792.175639
s2 476.739021
s3 101.043268
s4 177.063238
s5 751.273700
s6 67.626692
dtype: float64
16.5. Non-linear estimators
We do not have to use linear estimators for \(m_1\) and \(m_2\); below, we use random forest models instead. The final estimation from the non-linear estimators is lower than the one from linear estimators.
[144]:
from sklearn.ensemble import RandomForestRegressor
m1 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
m2 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
m3 = LinearRegression()
m1.fit(X.drop(columns=['bmi']), X['bmi'])
m2.fit(X.drop(columns=['bmi']), y)
r1 = X['bmi'] - m1.predict(X.drop(columns=['bmi']))
r2 = y - m2.predict(X.drop(columns=['bmi']))
m3.fit(r1.to_frame(), r2)
m3.coef_
[144]:
array([508.34043333])
16.6. Sample-splitting
When there is overfitting with \(m_1\) and \(m_2\), the estimation will be biased. To combat biased estimation, sample-splitting
is applied. In sample-splitting, for a large number of times,
we split the data into training and testing,
learn the estimators \(m_1\) and \(m_2\) from the training data,
generate the residuals off the testing data, and
regress the residuals to find the estimation (coefficient).
The sample splitting estimations are then averaged to produce the final estimation. Below, we do this for all variables.
[255]:
from sklearn.model_selection import train_test_split
def get_estimate(d_col='bmi', random_state=37):
X_tr, X_te, y_tr, y_te = train_test_split(X, y, shuffle=True, test_size=0.5, random_state=random_state)
X_tr, d_tr = X_tr.drop(columns=[d_col]), X_tr[d_col]
X_te, d_te = X_te.drop(columns=[d_col]), X_te[d_col]
m1 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
m2 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
m3 = LinearRegression()
m1.fit(X_tr, d_tr)
m2.fit(X_tr, y_tr)
r1 = d_te - m1.predict(X_te)
r2 = y_te - m2.predict(X_te)
m3.fit(r1.to_frame(), r2)
return m3.coef_[0]
def estimate(d_col='bmi', max_iter=200):
estimates = [get_estimate(d_col, i) for i in range(max_iter)]
return {
'variable': d_col,
'estimate': np.mean(estimates),
'std': np.std(estimates)
}
pd.DataFrame([estimate(c) for c in X.columns])
[255]:
variable | estimate | std | |
---|---|---|---|
0 | age | 2.107160 | 69.651271 |
1 | sex | -205.918395 | 73.585760 |
2 | bmi | 495.455222 | 81.473690 |
3 | bp | 265.241904 | 71.549371 |
4 | s1 | -134.198358 | 313.943890 |
5 | s2 | -52.306987 | 286.652365 |
6 | s3 | -236.446231 | 163.779218 |
7 | s4 | 97.210133 | 257.698259 |
8 | s5 | 515.953055 | 113.054241 |
9 | s6 | 83.226874 | 75.599235 |
16.7. Sample-splitting and cross-fitting
We can combine sample-splitting with cross-fitting to provide an unbiased estimate. In cross-fitting, we split the data into two folds, \(a\) and \(b\). We train a model on \(a\) and estimate the coefficient on \(b\), we also train a model on \(b\) and estimate the coefficient on \(a\); the average of these two coefficients is the final estimation.
[257]:
def cross_fit(X_tr, d_tr, y_tr, X_te, d_te, y_te):
m1 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
m2 = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
m3 = LinearRegression()
m1.fit(X_tr, d_tr)
m2.fit(X_tr, y_tr)
r1 = d_te - m1.predict(X_te)
r2 = y_te - m2.predict(X_te)
m3.fit(r1.to_frame(), r2)
return m3.coef_[0]
def split_fit(d_col='bmi', random_state=37):
X_tr, X_te, y_tr, y_te = train_test_split(X, y, shuffle=True, test_size=0.5, random_state=random_state)
X_tr, d_tr = X_tr.drop(columns=[d_col]), X_tr[d_col]
X_te, d_te = X_te.drop(columns=[d_col]), X_te[d_col]
lhs = cross_fit(X_tr, d_tr, y_tr, X_te, d_te, y_te)
rhs = cross_fit(X_te, d_te, y_te, X_tr, d_tr, y_tr)
return np.mean([lhs, rhs])
def estimate(d_col='bmi', max_iter=100):
estimates = [split_fit(d_col, i) for i in range(max_iter)]
return {
'variable': d_col,
'estimate': np.mean(estimates),
'std': np.std(estimates)
}
pd.DataFrame([estimate(c) for c in X.columns])
[257]:
variable | estimate | std | |
---|---|---|---|
0 | age | 6.990086 | 40.040529 |
1 | sex | -208.788784 | 44.207837 |
2 | bmi | 491.990834 | 45.080000 |
3 | bp | 262.765973 | 37.226123 |
4 | s1 | -110.913374 | 154.213316 |
5 | s2 | -78.543704 | 126.827543 |
6 | s3 | -233.844736 | 87.138051 |
7 | s4 | 117.667220 | 128.956083 |
8 | s5 | 519.283738 | 67.865800 |
9 | s6 | 75.133127 | 41.054708 |
16.8. Double machine learning
FWLT has implications for causal inference in terms of estimating average treatment effect (ATE). Denote the following.
\(y\) the outcome variable
\(d\) the treatment variable (the variable that we want to estimate
ATE
for)\(X\) the set of confounding variables
Then, denote the following models.
\(m_1 = d \sim X\)
\(m_2 = y \sim X\)
Denote the corresponding residuals to the models.
\(r_1 = d - m_1(X)\)
\(r_2 = y - m_2(X)\)
Finally, define a linear model as follows.
\(r_2 \sim \beta_1 r_1\)
\(\beta_1\) is the ATE.
Double machine learning DML
generalizes the FWLT approach with many benefits. A key benefit is the correcting for regularization and overfitting biases; the former is achieved through orthogonalization (FWLT) and the latter is achieved to sample-splitting and cross-fitting. Let’s see how DML estimates ATE for each of the variables.
16.8.1. DoubleML
Here are the ATE estimates using DoubleML.
[259]:
from doubleml import DoubleMLData
from doubleml import DoubleMLPLR
from sklearn.base import clone
def get_model(d_col):
np.random.seed(3141)
data = DoubleMLData(
X.assign(y=y),
y_col='y',
d_cols=d_col,
x_cols=list(X.columns.drop([d_col]))
)
learner = RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10)
ml_l = clone(learner)
ml_m = clone(learner)
m = DoubleMLPLR(data, ml_l, ml_m, n_folds=5, n_rep=1, score='partialling out', dml_procedure='dml2')
m.fit(store_models=True)
return m
def get_estimate(d_col):
m = get_model(d_col)
return m.summary
pd.concat([get_estimate(c) for c in X.columns])
[259]:
coef | std err | t | P>|t| | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|
age | -8.725229 | 59.042732 | -0.147778 | 8.825178e-01 | -124.446858 | 106.996400 |
sex | -198.485390 | 59.895497 | -3.313862 | 9.201703e-04 | -315.878408 | -81.092373 |
bmi | 448.273300 | 75.330901 | 5.950723 | 2.669611e-09 | 300.627447 | 595.919152 |
bp | 271.593295 | 64.953301 | 4.181362 | 2.897675e-05 | 144.287163 | 398.899426 |
s1 | -237.891157 | 237.396920 | -1.002082 | 3.163040e-01 | -703.180571 | 227.398256 |
s2 | -115.889975 | 245.066520 | -0.472892 | 6.362903e-01 | -596.211528 | 364.431579 |
s3 | -123.176182 | 150.923436 | -0.816150 | 4.144143e-01 | -418.980681 | 172.628317 |
s4 | 54.571481 | 218.437700 | 0.249826 | 8.027217e-01 | -373.558543 | 482.701505 |
s5 | 507.240573 | 101.580684 | 4.993475 | 5.930256e-07 | 308.146091 | 706.335055 |
s6 | 101.971064 | 65.382567 | 1.559606 | 1.188529e-01 | -26.176413 | 230.118540 |
16.8.2. EconML
Here are the ATE estimates using EconML.
[262]:
from econml.dml import DML
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
def get_model():
return DML(
model_y=RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10),
model_t=RandomForestRegressor(random_state=37, n_jobs=-1, n_estimators=10),
model_final=LassoCV(fit_intercept=False, n_jobs=-1, random_state=37),
random_state=37
)
def get_estimate(T):
m = get_model()
m.fit(Y=y, T=X[T], X=None, W=X.drop(columns=[T]))
return {
'variable': T,
'coef': m.effect()[0]
}
pd.DataFrame([get_estimate(c) for c in X.columns])
[262]:
variable | coef | |
---|---|---|
0 | age | 0.000000e+00 |
1 | sex | -1.090364e+02 |
2 | bmi | 5.110700e+02 |
3 | bp | 1.867952e+02 |
4 | s1 | -9.176047e-15 |
5 | s2 | 0.000000e+00 |
6 | s3 | -3.120313e+02 |
7 | s4 | 2.140634e+02 |
8 | s5 | 5.167653e+02 |
9 | s6 | 0.000000e+00 |