11. Linear Mixed Model
When we have data that is grouped or somehow correlated and want to use regression to model the data, one may attempt the following:
run one regression model (eg what we would do anyways); this approach breaks the IID assumption and underestimates the standard error variance
run multiple regression models (eg run a regression for each cluster of grouped/non-IID data); this approach increases the chances of Type I error (multiple comparison) and uses a reduced data set for each regression
Linear mixed modeling LMM
is a type of regression modeling approach that accounts for the violation of non-IID (independently and identically distributed) data and is a balance between the two approaches above. The problem of non-IID data is that using ordinary regression may underestimate the variance (standard error) of the parameter estimations and lead to incorrect hypothesis testing as well as a model that is no longer the Best Linear Unbiased Estimator BLUE
. Additionally, LMM are
useful for understanding within and between group differences.
There are many ways to combat non-IID:
get a better model (eg use multilevel or linear mixed models)
get a better variance estimator (eg Huber sandwich estimator)
sample the data (eg Iteratively Reweighting Least Square)
In this notebook, we will focus on LMM and two ways to estimate the standard errors using maximum likelihood ML
and restricted maximum likelihood REML
. It should be noted that multilevel models are a type of LMM, but not all LMMs are multilevel. A LMM is denoted as follows,
\(Y = X \beta + Z u + e\),
where
\(\beta\) are the coefficients (intercept and slope),
\(X\) is the design matrix (independent variables) of dimension \(N\) x \(M\) (\(N\) rows and \(M\) columns),
\(Z\) is a sparse indicator matrix of the group (of each observation in \(X\)) of dimension \(N\) x \(J\) (\(N\) rows and \(J\) columns),
\(u\) and \(e\) are the error terms corresponding to \(X\) and \(Z\), respectively, and
\(Y\) is the dependent variable.
Note that \(u\) and \(e\) are column vectors assumed to be normally distributed with zero means and covariance \(G\) and \(R\) as follows,
\(\begin{pmatrix} u \\ e \end{pmatrix} \sim \mathcal{N} \begin{pmatrix} \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} G & 0 \\ 0 & R \end{pmatrix} \end{pmatrix}\)
where
\(G_{JxJ} = I \sigma^2_{\mathrm{group}}\), and
\(R_{NxN} = I \sigma^2_{\mathrm{residual}}\).
For clarity,
\(u \sim \mathcal{N}(0, G)\), and
\(e \sim \mathcal{N}(0, R)\).
The variables in \(X\) are said to be fixed effects
and the grouping variables in \(Z\) are said to be the random effects
. Grouping variables are categorical and can be quite complex in their relationship to one another. Grouping variables can be
nested (when the groups exists within other groups eg patients are observed only within a single hospital),
partially crossed (when some groups are seen with others eg some patients are observed across some hospitals), or
crossed (when all groups are seen with all other groups eg all patients are observed across all hospitals).
In mature LMM packages (e.g. lme4), these types of grouping relationships are easily expressed. The mixture of fixed and random effects is the mixed
in linear mixed models LMM
.
In probability form, a LMM can be written as follows.
\(P(Y|\beta; u) \sim \mathcal{N}(X \beta + Z u, R)\)
In a multilevel model, when we allow for the intercept to vary, a LMM can be written as follows.
L1: \(Y_{ij} = \beta_{0j} + \beta_{1j} X_{ij} + \ldots + \beta_{mj} X_{mj} + e_{ij}\),
L2: \(\beta_{0j} = \gamma_{00} + u_{0j}\),
L2: \(\beta_{1j} = \gamma_{10}\),
\(\dots\),
L2: \(\beta_{mj} = \gamma_{mj}\),
where
L1 is the Level 1 model (micro-model) of the individual units (e.g. student, patient),
L2 is a Level 2 model (macro-model) of the grouped/clustered units (e.g. teacher, hospital),
\(Y_{ij}\) is the value of the dependent variable for i-th individual and the j-th group,
\(\beta_{0j}\) is the intercept for the j-th group,
\(\beta_{1j}, \ldots, \beta_{mj}\) are the coefficients for the j-th group, and
\(e_{ij}\) is the error for the i-th individual for the j-th group.
Notice how the intercept \(\beta_{0j}\) is the only parameter we allow to vary at L2
(it is the only parameter at L2
that has an error term). If we allow the other parameters to vary at L2
(have error terms), then the multilevel model is said to be a random-intercept, random-slope multilevel model.
11.1. Data
This data is take from a Littell 2006. In this data, we have 7 ingots, 3 metal types and 1 dependent variable.
[1]:
import pandas as pd
import itertools
df = pd.DataFrame({
'ingot': itertools.chain(*[[i + 1 for _ in range(3)] for i in range(7)]),
'metal': ['n', 'i', 'c'] * 7,
'pres': [67,71.9,72.2,67.5,68.8,66.4,76,82.6,74.5,72.7,78.1,67.3,73.1,74.2,73.2,65.8,70.8,68.7,75.6,84.9,69]
}).assign(
ingot_1=lambda d: d['ingot'].apply(lambda i: 1 if i == 1 else 0),
ingot_2=lambda d: d['ingot'].apply(lambda i: 1 if i == 2 else 0),
ingot_3=lambda d: d['ingot'].apply(lambda i: 1 if i == 3 else 0),
ingot_4=lambda d: d['ingot'].apply(lambda i: 1 if i == 4 else 0),
ingot_5=lambda d: d['ingot'].apply(lambda i: 1 if i == 5 else 0),
ingot_6=lambda d: d['ingot'].apply(lambda i: 1 if i == 6 else 0),
ingot_7=lambda d: d['ingot'].apply(lambda i: 1 if i == 7 else 0)
)
df
[1]:
ingot | metal | pres | ingot_1 | ingot_2 | ingot_3 | ingot_4 | ingot_5 | ingot_6 | ingot_7 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | n | 67.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | i | 71.9 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | c | 72.2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2 | n | 67.5 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 2 | i | 68.8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 | 2 | c | 66.4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
6 | 3 | n | 76.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
7 | 3 | i | 82.6 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8 | 3 | c | 74.5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
9 | 4 | n | 72.7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
10 | 4 | i | 78.1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
11 | 4 | c | 67.3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
12 | 5 | n | 73.1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
13 | 5 | i | 74.2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
14 | 5 | c | 73.2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
15 | 6 | n | 65.8 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
16 | 6 | i | 70.8 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
17 | 6 | c | 68.7 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
18 | 7 | n | 75.6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
19 | 7 | i | 84.9 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
20 | 7 | c | 69.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
11.2. Linear regression
[2]:
from patsy import dmatrices, dmatrix
import numpy as np
from sklearn.linear_model import LinearRegression
f = 'pres ~ metal'
y, X = dmatrices(f, df, return_type='dataframe')
X = X.iloc[:,1:]
y = np.ravel(y)
X.shape, y.shape
model = LinearRegression().fit(X, y)
model.intercept_, model.coef_
[2]:
(70.1857142857143, array([5.71428571, 0.91428571]))
[3]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
_ = pd.DataFrame({
'y': y,
'y_pred': model.predict(X)
}).assign(residual=lambda d: np.abs(d['y'] - d['y_pred'])) \
.plot(kind='scatter', x='y_pred', y='residual', title='Residual vs Fitted')
[4]:
import seaborn as sns
_ = sns.boxplot(x='ingot', y='pres', data=df)
Just for fun, let’s use a multilevel model using the intercept method
. In the intercept method, we regress the dependent variable on the independent ones as well the dummy encoded grouping variable (we have 7 dummy variables and we drop the first and use it as the reference) at L1. We then use the coefficients associated with the dummy variables as the dependent variables to be regressed on at L2.
The L1 model is given as follows. Look at the intercept and first 2 coefficients, they are nearly identical to the OLS model above.
L1: \(Y_{ij} = \beta_{0j} + \beta_{1j} \mathrm{metal}_{1j} + \beta_{2j} \mathrm{ingot}_{i2} + \beta_{3j} \mathrm{ingot}_{i3} + \beta_{4j} \mathrm{ingot}_{i4} + \beta_{5j} \mathrm{ingot}_{i5} + \beta_{6j} \mathrm{ingot}_{i6} + \beta_{7j} \mathrm{ingot}_{i7} + e_{ij}\)
[5]:
f = 'pres ~ metal + C(ingot)'
y, X = dmatrices(f, df, return_type='dataframe')
X = X.iloc[:,1:]
y = np.ravel(y)
model = LinearRegression().fit(X, y)
model.intercept_, model.coef_
[5]:
(68.15714285714286,
array([ 5.71428571, 0.91428571, -2.8 , 7.33333333, 2.33333333,
3.13333333, -1.93333333, 6.13333333]))
The L2 model is given as follows.
L2: \(\beta_{0j} = \gamma_{00} + u_{0j}\),
L2: \(\beta_{2j} = \gamma_{20} + u_{2j}\),
\(\ldots\)
L2: \(\beta_{7j} = \gamma_{70} + u_{7j}\),
[6]:
y = np.ravel(model.coef_[2:])
y = np.array([0] + list(model.intercept_ - y))
X = dmatrix('+'.join([f'ingot_{i}' for i in range(2, 8)]), df, return_type='dataframe')
X = X.iloc[:,1:].drop_duplicates().values
X, y, X.shape, y.shape
model = LinearRegression().fit(X, y)
model.intercept_, model.coef_
[6]:
(0.0,
array([70.95714286, 60.82380952, 65.82380952, 65.02380952, 70.09047619,
62.02380952]))
11.3. LMM via ML
Let’s estimate the standard error using the ML approach. Denote the following.
\(V = Z G Z^T + R\)
\(B = (X^T V^{-1} X)^{-1} X^T V^{-1} Y\)
Then, the likelihood function to be optimized is written as follows.
\(-2 L_{\mathrm{ML}}(\theta; Y) = \log |V| + (Y - X B_{\theta})^T V^{-1} (Y - X B_{\theta})\)
[7]:
from patsy import dmatrices, dmatrix
f = 'pres ~ metal'
y, X = dmatrices(f, df, return_type='dataframe')
f = ' + '.join([f'ingot_{i+1}' for i in range(7)]) + ' - 1'
Z = dmatrix(f, df, return_type='dataframe')
y = y.values
X = X.values
Z = Z.values
The standard error and residual are estimated as follows.
[8]:
from scipy.optimize import minimize
import numpy as np
def log_ml(x):
G = x[0] * np.diag(np.ones(7))
R = x[1] * np.diag(np.ones(21))
V = Z.dot(G).dot(Z.T) + R
V_inv = np.linalg.inv(V)
X_T = X.T
B = np.linalg.inv(X_T.dot(V_inv).dot(X)).dot(X_T).dot(V_inv).dot(y)
LL = np.log(np.linalg.det(V)) + (y - (X.dot(B))).T.dot(V_inv).dot(y - (X.dot(B)))
return np.ravel(LL)[0]
r = minimize(fun=log_ml, x0=np.array([1, 1]), method='Nelder-Mead', tol=0.0000000000001, options={'maxiter': 500})
r
[8]:
final_simplex: (array([[9.81238077, 8.88993185],
[9.81238077, 8.88993185],
[9.81238077, 8.88993185]]), array([77.11196749, 77.11196749, 77.11196749]))
fun: 77.11196749090402
message: 'Optimization terminated successfully.'
nfev: 253
nit: 109
status: 0
success: True
x: array([9.81238077, 8.88993185])
The coefficients are estimated as follows.
[9]:
def hat_beta(x):
G = x[0] * np.diag(np.ones(7))
R = x[1] * np.diag(np.ones(21))
V = Z.dot(G).dot(Z.T) + R
V_inv = np.linalg.inv(V)
X_T = X.T
B = np.linalg.inv(X_T.dot(V_inv).dot(X)).dot(X_T).dot(V_inv).dot(y)
return np.ravel(B)
hat_beta(r.x)
[9]:
array([70.18571429, 5.71428571, 0.91428571])
11.4. LMM via REML
The likelihood function to be optimized with REML is as follows.
\(-2 L_{\mathrm{REML}}(\theta; Y) = \log |V| + \log |X^T V^{-1} X| + (Y - X B_{\theta})^T V^{-1} (Y - X B_{\theta})\)
Note that the standard error and residual is larger than with the ML approach.
[10]:
def log_reml(x):
G = x[0] * np.diag(np.ones(7))
R = x[1] * np.diag(np.ones(21))
V = Z.dot(G).dot(Z.T) + R
V_inv = np.linalg.inv(V)
X_T = X.T
B = np.linalg.inv(X_T.dot(V_inv).dot(X)).dot(X_T).dot(V_inv).dot(y)
LL = np.log(np.linalg.det(V)) + np.log(np.linalg.det(X.T.dot(V_inv).dot(X))) + (y - (X.dot(B))).T.dot(V_inv).dot(y - (X.dot(B)))
return np.ravel(LL)[0]
r = minimize(fun=log_reml, x0=np.array([1, 1]), method='Nelder-Mead', tol=0.0000000000001, options={'maxiter': 500})
r
[10]:
final_simplex: (array([[11.44777794, 10.37158691],
[11.44777794, 10.37158691],
[11.44777794, 10.37158691]]), array([74.70841482, 74.70841482, 74.70841482]))
fun: 74.7084148191172
message: 'Optimization terminated successfully.'
nfev: 244
nit: 107
status: 0
success: True
x: array([11.44777794, 10.37158691])
[11]:
hat_beta(r.x)
[11]:
array([70.18571429, 5.71428571, 0.91428571])
11.5. LMM via statsmodels
We can verify the REML approach with statsmodels
.
[12]:
import statsmodels.formula.api as smf
m = smf.mixedlm('pres ~ metal', df, groups=df['ingot'])
m = m.fit(method=['lbfgs'])
print(m.summary())
Mixed Linear Model Regression Results
======================================================
Model: MixedLM Dependent Variable: pres
No. Observations: 21 Method: REML
No. Groups: 7 Scale: 10.3716
Min. group size: 3 Log-Likelihood: -53.8951
Max. group size: 3 Converged: Yes
Mean group size: 3.0
------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
------------------------------------------------------
Intercept 70.186 1.766 39.754 0.000 66.725 73.646
metal[T.i] 5.714 1.721 3.320 0.001 2.340 9.088
metal[T.n] 0.914 1.721 0.531 0.595 -2.460 4.288
Group Var 11.448 3.273
======================================================
11.6. LMM via pymer4
You will have to have R
installed with the following packages, lme4 and lmerTest, as pymer4 uses rpy2 to use R
to do the estimations.
[13]:
from pymer4.models import Lmer
m = Lmer('pres ~ metal + (1 | ingot)', data=df)
[14]:
m.fit()
Formula: pres~metal+(1|ingot)
Family: gaussian Inference: parametric
Number of observations: 21 Groups: {'ingot': 7.0}
Log-likelihood: -53.895 AIC: 107.790
Random effects:
Name Var Std
ingot (Intercept) 11.448 3.383
Residual 10.372 3.220
No random effect correlations specified
Fixed effects:
[14]:
Estimate | 2.5_ci | 97.5_ci | SE | DF | T-stat | P-val | Sig | |
---|---|---|---|---|---|---|---|---|
(Intercept) | 70.186 | 66.725 | 73.646 | 1.766 | 11.609 | 39.754 | 0.000 | *** |
metali | 5.714 | 2.340 | 9.088 | 1.721 | 12.000 | 3.320 | 0.006 | ** |
metaln | 0.914 | -2.460 | 4.288 | 1.721 | 12.000 | 0.531 | 0.605 |
[15]:
_ = m.plot_summary()