12. Iteratively Reweighted Least Squares Regression

Ordinary Least Squares OLS regression has an assumption that observations are independently and identically distributed IID. However, quite often, we find that our data violates this assumption. Two common sources of this violation arise from the following.

serial autocorrelation: observations from longitudinal data (e.g. time-series data) are more similar the closer they are together in time and their residuals are correlated.
repeated observations: observations are related to one another (e.g. students can be clustered by teachers) and so residuals are clustered.

When the IID assumption is violated, the variance of the residuals will not be constant but changing based on something in your model. This non-constant variance of the residual is called heteroscedasticity. A diagnostic plot of the residuals \(r\) against the predicted values \(\hat{y}\) might show that as values of \(\hat{y}\) increases, the variance of \(r\) increases as well.

Interestingly, violation of the IID assumption does not bias the estimation of the weights (intercept and coefficients). Rather, violation of the IID assumption will bias the standard error estimation of the weights. Additionally, with the difficulty in estimating the standard error as a result of correlation in the residuals, the model is no longer considered the Best Linear Unbiased Estimator BLUE. To deal with the bias estimation of the standard errors, there’s a few things to do.

multilevel modeling: change the model; get a better model
robust standard error: change the standard error estimator; get a better estimator
Iteratively Reweighted Least Squares Regression IRWLS; weight the data and keep regression until convergence of the weights (coefficients)

In this notebook, we will go over how to multilevel modeling and how to implement IRWLS.

12.1. Data

This data is a taken from the GMU Comparative Effectiveness course. This data is composed of patient-level data across 6 hospitals (A, B, C, D, E, F) concerning their probability of survival. Certainly, this data has observations that are related to one another as patients may be grouped under hospitals (e.g. like students being grouped by teachers). The observations are not IID and there is risk for heteroscedasticity.

[1]:

import pandas as pd

df = pd.read_excel('./data/TraumaData.xlsx').drop(columns=['Unnamed: 0'])
df = df.rename(columns={c: c.lower().replace(' ', '_') for c in df.columns})
df = df.rename(columns={'65+_years': 'is_senior'})
df.shape

[1]:

(50, 6)

[2]:

df.head()

[2]:

	prob_survival	severe_burn	head_injury	is_senior	male	hosp
0	0.694551	1	1	1	1	A
1	0.733619	1	1	1	0	A
2	0.785537	1	1	0	1	A
3	0.818770	1	0	1	1	A
4	0.868275	1	0	0	1	A

12.2. Multilevel modeling, Intercept method

One way to do multilevel modeling is to build two regression models, one called the micro-level model (e.g. patients, students), and another afterwards, called the macro-level model (e.g. hospitals, teachers).

[3]:

from patsy import dmatrices
import statsmodels.api as sm
import numpy as np

np.random.seed(37)

y, X = dmatrices('prob_survival ~ severe_burn + head_injury + is_senior + male + hosp', df, return_type='dataframe')
y = np.ravel(y)
X = X.iloc[:,1:]

X.shape, y.shape

[3]:

((50, 8), (50,))

12.2.1. Micro-level (patients)

In the micro-level model, all variables are regressed against the dependent variable (probability of survival, in this case). The grouping variable denoting hospital is one-hot encoded (with Hospital A being dropped as the reference) to produce dummy variables. The coefficients associated dummy variables are then used as the dependent variable in the macro-level model.

[4]:

from sklearn.linear_model import LinearRegression

def get_coef(m, X=None):
    if X is None:
        return np.array([m.intercept_] + list(m.coef_))
    return pd.concat([
                pd.Series(m.intercept_, ['intercept']),
                pd.Series(m.coef_, X.columns)
            ])

m1 = LinearRegression()
m1.fit(X, y)

c1 = get_coef(m1, X)
c1

[4]:

intercept      1.162184
hosp[T.B]     -0.117932
hosp[T.C]     -0.226131
hosp[T.D]     -0.276450
hosp[T.E]     -0.341136
severe_burn   -0.245425
head_injury   -0.181782
is_senior     -0.109293
male          -0.026703
dtype: float64

[5]:

import matplotlib.pyplot as plt

plt.style.use('ggplot')

ax = pd.DataFrame({'y': m1.predict(X), 'r': y - m1.predict(X)}).plot(kind='scatter', x='y', y='r', xlabel=r'$\hat{y}$')
_ = ax.set_title('Micro-level model residuals')

12.2.2. Macro-level (hospitals)

Before we can use the coefficients of the dummy variables (from the grouping variable), we need to adjust them using the intercept (remember, we dropped one of the dummy variables and considered it the reference). The adjustment is simply the intercept minus the coefficient, which becomes the dependent variable at the macro-level model. We can use the macro-level model to understand macro-level variables and how they influence the probability of survival.

[6]:

hosp_df = pd.DataFrame({
    'tertiary_center': [1, 1, 0, 0, 0],
    'burn_center': [0, 1, 0, 0, 0],
    'y': [0] + list(m1.intercept_ - c1[1:5])
}, index=['A', 'B', 'C', 'D', 'E'])

hosp_df

[6]:

	tertiary_center	burn_center	y
A	1	0	0.000000
B	1	1	1.280116
C	0	0	1.388314
D	0	0	1.438634
E	0	0	1.503320

[7]:

y, X = dmatrices('y ~ tertiary_center + burn_center', hosp_df, return_type='dataframe')
y = np.ravel(y)
X = X.iloc[:,1:]

X.shape, y.shape

[7]:

((5, 2), (5,))

[8]:

m2 = LinearRegression()
m2.fit(X, y)

c2 = get_coef(m2, X)
c2

[8]:

intercept          1.443422
tertiary_center   -1.443422
burn_center        1.280116
dtype: float64

[9]:

ax = pd.DataFrame({'y': m2.predict(X), 'r': y - m2.predict(X)}).plot(kind='scatter', x='y', y='r', xlabel=r'$\hat{y}$')
_ = ax.set_title('Macro-level model residuals')

12.3. Regression, y ~ X

We can also apply OLS directly on the data.

[10]:

df['tertiary_center'] = df['hosp'].apply(lambda h: 1 if h in {'A', 'B'} else 0)
df['burn_center'] = df['hosp'].apply(lambda h: 1 if h in {'B'} else 0)

df.head()

[10]:

	prob_survival	severe_burn	head_injury	is_senior	male	hosp	tertiary_center
0	0.694551	1	1	1	1	A	1
1	0.733619	1	1	1	0	A	1
2	0.785537	1	1	0	1	A	1
3	0.818770	1	0	1	1	A	1
4	0.868275	1	0	0	1	A	1

[11]:

y, X = dmatrices('prob_survival ~ severe_burn + head_injury + is_senior + male + hosp + tertiary_center + burn_center', df, return_type='dataframe')
y = np.ravel(y)
X = X.iloc[:,1:]

X.shape, y.shape

[11]:

((50, 10), (50,))

[12]:

m3 = LinearRegression()
m3.fit(X, y)

c3 = get_coef(m3, X)
c3

[12]:

intercept          0.951254
hosp[T.B]         -0.058966
hosp[T.C]         -0.015202
hosp[T.D]         -0.065521
hosp[T.E]         -0.130207
severe_burn       -0.245425
head_injury       -0.181782
is_senior         -0.109293
male              -0.026703
tertiary_center    0.210929
burn_center       -0.058966
dtype: float64

[13]:

ax = pd.DataFrame({'y': m3.predict(X), 'r': y - m3.predict(X)}).plot(kind='scatter', x='y', y='r', xlabel=r'$\hat{y}$')
_ = ax.set_title('OLS: y ~ X residuals')

12.4. Regression, log(y) ~ X

We can try to control the residual heteroscedasticity by taking the log of the dependent variable.

[14]:

m4 = LinearRegression()
m4.fit(X, np.log(y))

c4 = get_coef(m4, X)
c4

[14]:

intercept         -0.015569
hosp[T.B]         -0.092775
hosp[T.C]         -0.018212
hosp[T.D]         -0.084913
hosp[T.E]         -0.240481
severe_burn       -0.404063
head_injury       -0.269846
is_senior         -0.153650
male              -0.055098
tertiary_center    0.343607
burn_center       -0.092775
dtype: float64

[15]:

ax = pd.DataFrame({'y': m4.predict(X), 'r': np.log(y) - m4.predict(X)}).plot(kind='scatter', x='y', y='r', xlabel=r'$\hat{y}$')
_ = ax.set_title('OLS: log(y) ~ X residuals')

12.5. IRWLS, y ~ X

IRWLS can be applied where we inversely weight each observation by their residual and apply weighted least squares repeatedly until the weights/coefficients converge.

[16]:

def do_irwls(X, y, max_iter=100, delta=0.001, tol=0.001):
    w = np.ones(y.shape[0])

    m = LinearRegression()
    m.fit(X, y, w)

    trace = []

    for i in range(max_iter):
        _B = get_coef(m)
        _w = 1 / np.maximum(delta, np.abs(y - m.predict(X)))

        m = LinearRegression()
        m.fit(X, y, _w)

        B = get_coef(m)
        w = 1 / np.maximum(delta, np.abs(y - m.predict(X)))

        B_dist = np.linalg.norm(B - _B)
        w_dist = np.linalg.norm(w - _w)

        B_delta = np.sum(np.abs(B - _B))
        w_delta = np.sum(np.abs(w - _w))

        trace.append({'i': i, 'w_dist': w_dist, 'w_delta': w_delta, 'B_dist': B_dist, 'B_delta': B_delta})

        if B_delta < tol:
            break

    return m, pd.DataFrame(trace)

m5, trace_df = do_irwls(X, y, delta=0.0001, tol=0.0001)

c5 = get_coef(m5, X)
c5

[16]:

intercept          0.941122
hosp[T.B]         -0.056463
hosp[T.C]         -0.012746
hosp[T.D]         -0.070542
hosp[T.E]         -0.128975
severe_burn       -0.251367
head_injury       -0.175856
is_senior         -0.096670
male              -0.028194
tertiary_center    0.212263
burn_center       -0.056463
dtype: float64

Here is the trace of how the weights of the observations (w_dist, w_delta) and coefficients (B_dist, B_delta) change over the iterations.

[17]:

trace_df

[17]:

	i	w_dist	w_delta	B_dist	B_delta
0	0	2920.326019	5159.839926	0.010612	0.028483
1	1	6115.516835	8481.150242	0.003541	0.009355
2	2	762.006315	2037.056610	0.003275	0.008592
3	3	1598.830939	3631.665480	0.002668	0.007217
4	4	3480.849759	6502.227517	0.001484	0.004030
5	5	4448.407014	7498.270822	0.000537	0.001357
6	6	6805.783694	10404.790335	0.000318	0.000765
7	7	1256.220294	1794.883870	0.000108	0.000260
8	8	47.144100	100.626159	0.000017	0.000040

[18]:

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

_ = trace_df[['B_dist', 'B_delta']].plot(kind='line', ax=ax[0], xlabel='iteration')
_ = trace_df[['w_dist', 'w_delta']].plot(kind='line', ax=ax[1], xlabel='iteration')

_ = ax[0].set_title('Trace of coefficients')
_ = ax[1].set_title('Trace of weights')

plt.tight_layout()

12.6. Comparing coefficients

Let’s compare the coefficients of the models.

ML-1: is multilevel model, micro-model
OLS: is OLS with y ~ X
LOG_OLS: is OLS with log(y) ~ X
IRWLS: is simply IRWLS with y ~ X

[19]:

pd.DataFrame([c1, c3, c4, c5], index=['ML-1', 'OLS', 'LOG_OLS', 'IRWLS']).T

[19]:

	ML-1	OLS	LOG_OLS	IRWLS
intercept	1.162184	0.951254	-0.015569	0.941122
hosp[T.B]	-0.117932	-0.058966	-0.092775	-0.056463
hosp[T.C]	-0.226131	-0.015202	-0.018212	-0.012746
hosp[T.D]	-0.276450	-0.065521	-0.084913	-0.070542
hosp[T.E]	-0.341136	-0.130207	-0.240481	-0.128975
severe_burn	-0.245425	-0.245425	-0.404063	-0.251367
head_injury	-0.181782	-0.181782	-0.269846	-0.175856
is_senior	-0.109293	-0.109293	-0.153650	-0.096670
male	-0.026703	-0.026703	-0.055098	-0.028194
tertiary_center	NaN	0.210929	0.343607	0.212263
burn_center	NaN	-0.058966	-0.092775	-0.056463

12.7. Comparing the residuals visually

[20]:

fig, ax = plt.subplots(1, 4, figsize=(20, 4))

_ = pd.DataFrame({'y': m1.predict(X.iloc[:,:8]), 'r': y - m1.predict(X.iloc[:,:8])}).plot(kind='scatter', x='y', y='r', ax=ax[0], color='red', xlabel=r'$\hat{y}$')
_ = pd.DataFrame({'y': m3.predict(X), 'r': y - m3.predict(X)}).plot(kind='scatter', x='y', y='r', ax=ax[1], color='red', xlabel=r'$\hat{y}$')
_ = pd.DataFrame({'y': m4.predict(X), 'r': np.log(y) - m3.predict(X)}).plot(kind='scatter', x='y', y='r', ax=ax[2], xlabel=r'$\hat{y}$')
_ = pd.DataFrame({'y': m3.predict(X), 'r': y - m3.predict(X)}).plot(kind='scatter', x='y', y='r', ax=ax[3], color='red', label='OLS', xlabel=r'$\hat{y}$')
_ = pd.DataFrame({'y': m5.predict(X), 'r': y - m5.predict(X)}).plot(kind='scatter', x='y', y='r', ax=ax[3], color='green', label='IRWLS', xlabel=r'$\hat{y}$')

_ = ax[0].set_title('ML-1')
_ = ax[1].set_title('OLS')
_ = ax[2].set_title('LOG_OLS')
_ = ax[3].set_title('OLS vs IRWLS')

plt.tight_layout()

12.8. Comparing standard errors

Now we can compare the standard error estimation using bootstrap sampling. We will compare OLS with IRWLS.

[21]:

import scipy.stats


def get_sample(df):
    sample = df.sample(df.shape[0], replace=True)
    y, X = dmatrices('prob_survival ~ severe_burn + head_injury + is_senior + male + hosp + tertiary_center + burn_center', sample, return_type='dataframe')
    y = np.ravel(y)
    X = X.iloc[:,1:]

    return X, y

def do_reg(df, is_ols=True):
    X, y = get_sample(df)

    if is_ols:
        model = LinearRegression()
        model.fit(X, y)
    else:
        model, _ = do_irwls(X, y, delta=0.0001, tol=0.0001)

    params = get_coef(model, X)

    return params

def get_se(df, is_ols=True):
    y, X = dmatrices('prob_survival ~ severe_burn + head_injury + is_senior + male + hosp + tertiary_center + burn_center', df, return_type='dataframe')
    y = np.ravel(y)
    X = X.iloc[:,1:]

    if is_ols:
        model = LinearRegression()
        model.fit(X, y)
    else:
        model, _ = do_irwls(X, y, delta=0.0001, tol=0.0001)

    w = get_coef(model, X)

    r_df = pd.DataFrame([do_reg(df, is_ols) for _ in range(100)])
    se = r_df.std()

    dof = X.shape[0] - X.shape[1] - 1

    summary = pd.DataFrame({
        'w': w,
        'se': se,
        'z': w / se,
        '.025': w - se,
        '.975': w + se,
        'df': [dof for _ in range(len(w))]
    })

    summary['P>|z|'] = scipy.stats.t.sf(abs(summary.z), df=summary.df)

    return summary

[22]:

ols_df = get_se(df, is_ols=True)
ols_df \
    .style \
    .applymap(lambda v: 'background-color: rgb(255, 0, 0, 0.18)' if v < 0.05 else '', subset=['P>|z|'])

[22]:

	w	se	z	.025	.975	df	P>\|z\|
intercept	0.951254	0.027222	34.944054	0.924032	0.978477	39	0.000000
hosp[T.B]	-0.058966	0.016798	-3.510404	-0.075764	-0.042169	39	0.000573
hosp[T.C]	-0.015202	0.014735	-1.031645	-0.029937	-0.000466	39	0.154297
hosp[T.D]	-0.065521	0.018918	-3.463351	-0.084439	-0.046603	39	0.000655
hosp[T.E]	-0.130207	0.020051	-6.493915	-0.150257	-0.110156	39	0.000000
severe_burn	-0.245425	0.018974	-12.935064	-0.264398	-0.226451	39	0.000000
head_injury	-0.181782	0.020039	-9.071220	-0.201821	-0.161742	39	0.000000
is_senior	-0.109293	0.018184	-6.010370	-0.127477	-0.091109	39	0.000000
male	-0.026703	0.020152	-1.325102	-0.046855	-0.006551	39	0.096424
tertiary_center	0.210929	0.024780	8.512106	0.186149	0.235709	39	0.000000
burn_center	-0.058966	0.016798	-3.510404	-0.075764	-0.042169	39	0.000573

[23]:

irwls_df = get_se(df, is_ols=False)
irwls_df \
    .style \
    .applymap(lambda v: 'background-color: rgb(255, 0, 0, 0.18)' if v < 0.05 else '', subset=['P>|z|'])

[23]:

	w	se	z	.025	.975	df	P>\|z\|
intercept	0.941122	0.035170	26.759178	0.905952	0.976293	39	0.000000
hosp[T.B]	-0.056463	0.023825	-2.369904	-0.080288	-0.032638	39	0.011418
hosp[T.C]	-0.012746	0.019916	-0.640003	-0.032662	0.007170	39	0.262957
hosp[T.D]	-0.070542	0.020053	-3.517723	-0.090596	-0.050489	39	0.000561
hosp[T.E]	-0.128975	0.022705	-5.680358	-0.151680	-0.106269	39	0.000001
severe_burn	-0.251367	0.023692	-10.609974	-0.275058	-0.227675	39	0.000000
head_injury	-0.175856	0.030017	-5.858583	-0.205873	-0.145840	39	0.000000
is_senior	-0.096670	0.025729	-3.757309	-0.122399	-0.070942	39	0.000281
male	-0.028194	0.021105	-1.335924	-0.049299	-0.007090	39	0.094659
tertiary_center	0.212263	0.035038	6.058030	0.177225	0.247301	39	0.000000
burn_center	-0.056463	0.023825	-2.369904	-0.080288	-0.032638	39	0.011418

In this data, OLS tends to result in standard errors that are smaller than IRWLS.

[24]:

ax = ols_df['se'].plot(kind='kde', label='OLS')
_ = irwls_df['se'].plot(kind='kde', label='IRWLS', ax=ax)
_ = ax.set_title('Distribution of standard errors')
_ = ax.legend()

[ ]: