# 3. Estimating Standard Error and Significance of Regression Coefficients

Scikit-Learn is an awesome API loaded with machine learning algorithms. However, Scikit-Learn seems to lack behind R when it comes to providing additional information for some models. In particular, for regression models such as logistic regression and Ordinary Least Square (OLS) regression, Scikit-Learn does not provide standard errors (SEs) and significance (p-values) of coefficients. There’s a few options to use when researchers want these estimates. One is to use R, but using R just to get estimates may be problematic:

• the user may not know R

• the user may want to stay within the Python realm

There is the statsmodel API that does provide these estimates for logistic and OLS regressions and it is a Python library. However, it is not easy to pull out these individual values as they are inside non-obvious internal data structures.

Below, we show how to estimate SE and p-value for logistic and OLS regression coefficients. The approach is to sample with replacement the data and perform many regressions. The estimates of the coefficients then may be used to compute SE and p-value for each coefficient.

## 3.1. Logistic Regression

Let’s use Scikit-Learn to create classification data.

[1]:

from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
import random

np.random.seed(37)
random.seed(37)

X, y = make_classification(**{
'n_samples': 2000,
'n_features': 5,
'n_informative': 5,
'n_redundant': 0,
'n_repeated': 0,
'n_classes': 2,
'n_clusters_per_class': 1,
'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (2000, 5), y shape (2000,)


statsmodel can provide us not only with the coefficients of the model, but also the SEs and p-values.

[2]:

import statsmodels.api as sm

res = binonmial_model.fit()

res.summary()

[2]:

Dep. Variable: No. Observations: y 2000 GLM 1994 Binomial 5 logit 1.0000 IRLS -554.00 Tue, 30 Nov 2021 1108.0 19:02:46 8.48e+03 7 nonrobust
coef std err z P>|z| [0.025 0.975] -1.5277 0.194 -7.893 0.000 -1.907 -1.148 -0.8484 0.079 -10.673 0.000 -1.004 -0.693 0.1734 0.091 1.898 0.058 -0.006 0.352 -0.8857 0.093 -9.492 0.000 -1.069 -0.703 1.5011 0.084 17.972 0.000 1.337 1.665 0.4876 0.071 6.867 0.000 0.348 0.627

We can sample (with replacement) from the data many times (in this case, 100 times) and peform many logistic regressions on the sampled data. We will trace or log the coefficients each time. The SE of each coefficient is just its standard deviation over these coefficients learned from the samples. The t-value is the coefficient (the coefficient learned from the complete data set that is NOT sampled) divided by the SE. The 95% confidence interval is created by subtracting and adding the SE to the learned coefficient (the coefficient learned from the complete data set that is NOT sampled). The p-value is then computed from the t-value.

[3]:

from sklearn.linear_model import LogisticRegression
import scipy.stats

def do_lreg(df):
sample = df.sample(df.shape[0], replace=True)
X_tr = sample[[c for c in sample.columns if c != 'y']]
y_tr = sample.y

lr = LogisticRegression(penalty='l2', solver='liblinear')
lr.fit(X_tr, y_tr)

params = [lr.intercept_[0]] +  list(lr.coef_[0])

return params

def get_se(X, y):
lr = LogisticRegression(penalty='l2', solver='liblinear')
lr.fit(X, y)

df = pd.DataFrame(X)
df['y'] = y

r_df = pd.DataFrame([do_lreg(df) for _ in range(100)])

w = [lr.intercept_[0]] + list(lr.coef_[0])
se = r_df.std()

dof = X.shape[0] - X.shape[1] - 1

summary = pd.DataFrame({
'w': w,
'se': se,
'z': w / se,
'.025': w - se,
'.975': w + se,
'df': [dof for _ in range(len(w))]
})

summary['P>|z|'] = scipy.stats.t.sf(abs(summary.z), df=summary.df)

return summary

get_se(X, y)

[3]:

w se z .025 .975 df P>|z|
0 -1.450358 0.195782 -7.408036 -1.646140 -1.254576 1994 9.412147e-14
1 -0.830592 0.073803 -11.254226 -0.904395 -0.756789 1994 7.841322e-29
2 0.161455 0.081370 1.984219 0.080086 0.242825 1994 2.368414e-02
3 -0.853932 0.114541 -7.455270 -0.968473 -0.739392 1994 6.648994e-14
4 1.481402 0.093463 15.850211 1.387940 1.574865 1994 1.106522e-53
5 0.483337 0.066506 7.267597 0.416832 0.549843 1994 2.613835e-13

You will notice that the coefficients and their SEs and p-values are different using this approach as compared with statsmodel. That is ok as statsmodel uses a different way to estimate not only the coefficients but also the SEs. The values are different, but only slightly so; they are effectively the same.

## 3.2. OLS Regression

Now we generate regression data.

[4]:

from sklearn.datasets import make_regression

X, y = make_regression(**{
'n_samples': 1000,
'n_features': 5,
'n_informative': 5,
'n_targets': 1,
'bias': 5.3,
'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (1000, 5), y shape (1000,)


Here’s the output from statsmodel.

[5]:

mod = sm.OLS(y, sm.add_constant(X))
res = mod.fit()

res.summary()

[5]:

Dep. Variable: R-squared: y 1.000 OLS 1.000 Least Squares 2.665e+32 Tue, 30 Nov 2021 0.00 19:02:46 29147. 1000 -5.828e+04 994 -5.825e+04 5 nonrobust
coef std err t P>|t| [0.025 0.975] 5.3000 1.69e-15 3.14e+15 0.000 5.300 5.300 41.8134 1.7e-15 2.46e+16 0.000 41.813 41.813 25.0388 1.79e-15 1.4e+16 0.000 25.039 25.039 3.2108 1.71e-15 1.88e+15 0.000 3.211 3.211 29.9942 1.68e-15 1.79e+16 0.000 29.994 29.994 25.3606 1.67e-15 1.52e+16 0.000 25.361 25.361
 Omnibus: Durbin-Watson: 0.308 1.665 0.857 0.394 0.021 0.821 2.913 1.16

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We estimate SEs and p-values using sampling again.

[6]:

from sklearn.linear_model import LinearRegression

def do_ols(df):
sample = df.sample(df.shape[0], replace=True)
X_tr = sample[[c for c in sample.columns if c != 'y']]
y_tr = sample.y

lr = LinearRegression()
lr.fit(X_tr, y_tr)

params = [lr.intercept_] +  list(lr.coef_)

return params

def get_se(X, y):
lr = LinearRegression()
lr.fit(X, y)

df = pd.DataFrame(X)
df['y'] = y

r_df = pd.DataFrame([do_ols(df) for _ in range(100)])

w = [lr.intercept_] + list(lr.coef_)
se = r_df.std()

dof = X.shape[0] - X.shape[1] - 1

summary = pd.DataFrame({
'w': w,
'se': se,
't': w / se,
'.025': w - se,
'.975': w + se,
'df': [dof for _ in range(len(w))]
})

summary['P>|t|'] = scipy.stats.t.sf(abs(summary.t), df=summary.df)

return summary

get_se(X, y)

[6]:

w se t .025 .975 df P>|t|
0 5.300000 6.924248e-15 7.654261e+14 5.300000 5.300000 994 0.0
1 41.813417 7.709545e-14 5.423591e+14 41.813417 41.813417 994 0.0
2 25.038808 5.120436e-14 4.889975e+14 25.038808 25.038808 994 0.0
3 3.210812 3.342118e-14 9.607117e+13 3.210812 3.210812 994 0.0
4 29.994205 4.581110e-14 6.547366e+14 29.994205 29.994205 994 0.0
5 25.360629 4.109462e-14 6.171277e+14 25.360629 25.360629 994 0.0