3. Estimating Standard Error and Significance of Regression Coefficients

Scikit-Learn is an awesome API loaded with machine learning algorithms. However, Scikit-Learn seems to lack behind R when it comes to providing additional information for some models. In particular, for regression models such as logistic regression and Ordinary Least Square (OLS) regression, Scikit-Learn does not provide standard errors (SEs) and significance (p-values) of coefficients. There’s a few options to use when researchers want these estimates. One is to use R, but using R just to get estimates may be problematic:

  • the user may not know R

  • the user may want to stay within the Python realm

There is the statsmodel API that does provide these estimates for logistic and OLS regressions and it is a Python library. However, it is not easy to pull out these individual values as they are inside non-obvious internal data structures.

Below, we show how to estimate SE and p-value for logistic and OLS regression coefficients. The approach is to sample with replacement the data and perform many regressions. The estimates of the coefficients then may be used to compute SE and p-value for each coefficient.

3.1. Logistic Regression

Let’s use Scikit-Learn to create classification data.

[1]:
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
import random

np.random.seed(37)
random.seed(37)

X, y = make_classification(**{
    'n_samples': 2000,
    'n_features': 5,
    'n_informative': 5,
    'n_redundant': 0,
    'n_repeated': 0,
    'n_classes': 2,
    'n_clusters_per_class': 1,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (2000, 5), y shape (2000,)

statsmodel can provide us not only with the coefficients of the model, but also the SEs and p-values.

[2]:
import statsmodels.api as sm

binonmial_model = sm.GLM(y, sm.add_constant(X), family=sm.families.Binomial())
res = binonmial_model.fit()

res.summary()
[2]:
Generalized Linear Model Regression Results
Dep. Variable: y No. Observations: 2000
Model: GLM Df Residuals: 1994
Model Family: Binomial Df Model: 5
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -554.00
Date: Tue, 30 Nov 2021 Deviance: 1108.0
Time: 19:02:46 Pearson chi2: 8.48e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -1.5277 0.194 -7.893 0.000 -1.907 -1.148
x1 -0.8484 0.079 -10.673 0.000 -1.004 -0.693
x2 0.1734 0.091 1.898 0.058 -0.006 0.352
x3 -0.8857 0.093 -9.492 0.000 -1.069 -0.703
x4 1.5011 0.084 17.972 0.000 1.337 1.665
x5 0.4876 0.071 6.867 0.000 0.348 0.627

We can sample (with replacement) from the data many times (in this case, 100 times) and peform many logistic regressions on the sampled data. We will trace or log the coefficients each time. The SE of each coefficient is just its standard deviation over these coefficients learned from the samples. The t-value is the coefficient (the coefficient learned from the complete data set that is NOT sampled) divided by the SE. The 95% confidence interval is created by subtracting and adding the SE to the learned coefficient (the coefficient learned from the complete data set that is NOT sampled). The p-value is then computed from the t-value.

[3]:
from sklearn.linear_model import LogisticRegression
import scipy.stats

def do_lreg(df):
    sample = df.sample(df.shape[0], replace=True)
    X_tr = sample[[c for c in sample.columns if c != 'y']]
    y_tr = sample.y

    lr = LogisticRegression(penalty='l2', solver='liblinear')
    lr.fit(X_tr, y_tr)

    params = [lr.intercept_[0]] +  list(lr.coef_[0])

    return params

def get_se(X, y):
    lr = LogisticRegression(penalty='l2', solver='liblinear')
    lr.fit(X, y)

    df = pd.DataFrame(X)
    df['y'] = y

    r_df = pd.DataFrame([do_lreg(df) for _ in range(100)])

    w = [lr.intercept_[0]] + list(lr.coef_[0])
    se = r_df.std()

    dof = X.shape[0] - X.shape[1] - 1

    summary = pd.DataFrame({
        'w': w,
        'se': se,
        'z': w / se,
        '.025': w - se,
        '.975': w + se,
        'df': [dof for _ in range(len(w))]
    })

    summary['P>|z|'] = scipy.stats.t.sf(abs(summary.z), df=summary.df)

    return summary

get_se(X, y)
[3]:
w se z .025 .975 df P>|z|
0 -1.450358 0.195782 -7.408036 -1.646140 -1.254576 1994 9.412147e-14
1 -0.830592 0.073803 -11.254226 -0.904395 -0.756789 1994 7.841322e-29
2 0.161455 0.081370 1.984219 0.080086 0.242825 1994 2.368414e-02
3 -0.853932 0.114541 -7.455270 -0.968473 -0.739392 1994 6.648994e-14
4 1.481402 0.093463 15.850211 1.387940 1.574865 1994 1.106522e-53
5 0.483337 0.066506 7.267597 0.416832 0.549843 1994 2.613835e-13

You will notice that the coefficients and their SEs and p-values are different using this approach as compared with statsmodel. That is ok as statsmodel uses a different way to estimate not only the coefficients but also the SEs. The values are different, but only slightly so; they are effectively the same.

3.2. OLS Regression

Now we generate regression data.

[4]:
from sklearn.datasets import make_regression

X, y = make_regression(**{
    'n_samples': 1000,
    'n_features': 5,
    'n_informative': 5,
    'n_targets': 1,
    'bias': 5.3,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (1000, 5), y shape (1000,)

Here’s the output from statsmodel.

[5]:
mod = sm.OLS(y, sm.add_constant(X))
res = mod.fit()

res.summary()
[5]:
OLS Regression Results
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.665e+32
Date: Tue, 30 Nov 2021 Prob (F-statistic): 0.00
Time: 19:02:46 Log-Likelihood: 29147.
No. Observations: 1000 AIC: -5.828e+04
Df Residuals: 994 BIC: -5.825e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 5.3000 1.69e-15 3.14e+15 0.000 5.300 5.300
x1 41.8134 1.7e-15 2.46e+16 0.000 41.813 41.813
x2 25.0388 1.79e-15 1.4e+16 0.000 25.039 25.039
x3 3.2108 1.71e-15 1.88e+15 0.000 3.211 3.211
x4 29.9942 1.68e-15 1.79e+16 0.000 29.994 29.994
x5 25.3606 1.67e-15 1.52e+16 0.000 25.361 25.361
Omnibus: 0.308 Durbin-Watson: 1.665
Prob(Omnibus): 0.857 Jarque-Bera (JB): 0.394
Skew: 0.021 Prob(JB): 0.821
Kurtosis: 2.913 Cond. No. 1.16


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We estimate SEs and p-values using sampling again.

[6]:
from sklearn.linear_model import LinearRegression

def do_ols(df):
    sample = df.sample(df.shape[0], replace=True)
    X_tr = sample[[c for c in sample.columns if c != 'y']]
    y_tr = sample.y

    lr = LinearRegression()
    lr.fit(X_tr, y_tr)

    params = [lr.intercept_] +  list(lr.coef_)

    return params

def get_se(X, y):
    lr = LinearRegression()
    lr.fit(X, y)

    df = pd.DataFrame(X)
    df['y'] = y

    r_df = pd.DataFrame([do_ols(df) for _ in range(100)])

    w = [lr.intercept_] + list(lr.coef_)
    se = r_df.std()

    dof = X.shape[0] - X.shape[1] - 1

    summary = pd.DataFrame({
        'w': w,
        'se': se,
        't': w / se,
        '.025': w - se,
        '.975': w + se,
        'df': [dof for _ in range(len(w))]
    })

    summary['P>|t|'] = scipy.stats.t.sf(abs(summary.t), df=summary.df)

    return summary


get_se(X, y)
[6]:
w se t .025 .975 df P>|t|
0 5.300000 6.924248e-15 7.654261e+14 5.300000 5.300000 994 0.0
1 41.813417 7.709545e-14 5.423591e+14 41.813417 41.813417 994 0.0
2 25.038808 5.120436e-14 4.889975e+14 25.038808 25.038808 994 0.0
3 3.210812 3.342118e-14 9.607117e+13 3.210812 3.210812 994 0.0
4 29.994205 4.581110e-14 6.547366e+14 29.994205 29.994205 994 0.0
5 25.360629 4.109462e-14 6.171277e+14 25.360629 25.360629 994 0.0