6. Estimating Standard Error and Significance of Regression Coefficients

Scikit-Learn is an awesome API loaded with machine learning algorithms. However, Scikit-Learn seems to lack behind R when it comes to providing additional information for some models. In particular, for regression models such as logistic regression and Ordinary Least Square (OLS) regression, Scikit-Learn does not provide standard errors (SEs) and significance (p-values) of coefficients. There’s a few options to use when researchers want these estimates. One is to use R, but using R just to get estimates may be problematic:

the user may not know R
the user may want to stay within the Python realm

There is the statsmodel API that does provide these estimates for logistic and OLS regressions and it is a Python library. However, it is not easy to pull out these individual values as they are inside non-obvious internal data structures.

Below, we show how to estimate SE and p-value for logistic and OLS regression coefficients. The approach is to sample with replacement the data and perform many regressions. The estimates of the coefficients then may be used to compute SE and p-value for each coefficient.

6.1. Logistic Regression

Let’s use Scikit-Learn to create classification data.

[1]:

from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
import random

np.random.seed(37)
random.seed(37)

X, y = make_classification(**{
    'n_samples': 2000,
    'n_features': 5,
    'n_informative': 5,
    'n_redundant': 0,
    'n_repeated': 0,
    'n_classes': 2,
    'n_clusters_per_class': 1,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (2000, 5), y shape (2000,)

statsmodel can provide us not only with the coefficients of the model, but also the SEs and p-values.

[2]:

import statsmodels.api as sm

binonmial_model = sm.GLM(y, sm.add_constant(X), family=sm.families.Binomial())
res = binonmial_model.fit()

res.summary()

[2]:

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	2000
Model:	GLM	Df Residuals:	1994
Model Family:	Binomial	Df Model:	5
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-554.00
Date:	Tue, 30 Nov 2021	Deviance:	1108.0
Time:	19:02:46	Pearson chi2:	8.48e+03
No. Iterations:	7
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-1.5277	0.194	-7.893	0.000	-1.907	-1.148
x1	-0.8484	0.079	-10.673	0.000	-1.004	-0.693
x2	0.1734	0.091	1.898	0.058	-0.006	0.352
x3	-0.8857	0.093	-9.492	0.000	-1.069	-0.703
x4	1.5011	0.084	17.972	0.000	1.337	1.665
x5	0.4876	0.071	6.867	0.000	0.348	0.627

We can sample (with replacement) from the data many times (in this case, 100 times) and peform many logistic regressions on the sampled data. We will trace or log the coefficients each time. The SE of each coefficient is just its standard deviation over these coefficients learned from the samples. The t-value is the coefficient (the coefficient learned from the complete data set that is NOT sampled) divided by the SE. The 95% confidence interval is created by subtracting and adding the SE to the learned coefficient (the coefficient learned from the complete data set that is NOT sampled). The p-value is then computed from the t-value.

[3]:

from sklearn.linear_model import LogisticRegression
import scipy.stats

def do_lreg(df):
    sample = df.sample(df.shape[0], replace=True)
    X_tr = sample[[c for c in sample.columns if c != 'y']]
    y_tr = sample.y

    lr = LogisticRegression(penalty='l2', solver='liblinear')
    lr.fit(X_tr, y_tr)

    params = [lr.intercept_[0]] +  list(lr.coef_[0])

    return params

def get_se(X, y):
    lr = LogisticRegression(penalty='l2', solver='liblinear')
    lr.fit(X, y)

    df = pd.DataFrame(X)
    df['y'] = y

    r_df = pd.DataFrame([do_lreg(df) for _ in range(100)])

    w = [lr.intercept_[0]] + list(lr.coef_[0])
    se = r_df.std()

    dof = X.shape[0] - X.shape[1] - 1

    summary = pd.DataFrame({
        'w': w,
        'se': se,
        'z': w / se,
        '.025': w - se,
        '.975': w + se,
        'df': [dof for _ in range(len(w))]
    })

    summary['P>|z|'] = scipy.stats.t.sf(abs(summary.z), df=summary.df)

    return summary

get_se(X, y)

[3]:

	w	se	z	.025	.975	df	P>\|z\|
0	-1.450358	0.195782	-7.408036	-1.646140	-1.254576	1994	9.412147e-14
1	-0.830592	0.073803	-11.254226	-0.904395	-0.756789	1994	7.841322e-29
2	0.161455	0.081370	1.984219	0.080086	0.242825	1994	2.368414e-02
3	-0.853932	0.114541	-7.455270	-0.968473	-0.739392	1994	6.648994e-14
4	1.481402	0.093463	15.850211	1.387940	1.574865	1994	1.106522e-53
5	0.483337	0.066506	7.267597	0.416832	0.549843	1994	2.613835e-13

You will notice that the coefficients and their SEs and p-values are different using this approach as compared with statsmodel. That is ok as statsmodel uses a different way to estimate not only the coefficients but also the SEs. The values are different, but only slightly so; they are effectively the same.

6.2. OLS Regression

Now we generate regression data.

[4]:

from sklearn.datasets import make_regression

X, y = make_regression(**{
    'n_samples': 1000,
    'n_features': 5,
    'n_informative': 5,
    'n_targets': 1,
    'bias': 5.3,
    'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

X shape = (1000, 5), y shape (1000,)

Here’s the output from statsmodel.

[5]:

mod = sm.OLS(y, sm.add_constant(X))
res = mod.fit()

res.summary()

[5]:

OLS Regression Results
Dep. Variable:	y	R-squared:	1.000
Model:	OLS	Adj. R-squared:	1.000
Method:	Least Squares	F-statistic:	2.665e+32
Date:	Tue, 30 Nov 2021	Prob (F-statistic):	0.00
Time:	19:02:46	Log-Likelihood:	29147.
No. Observations:	1000	AIC:	-5.828e+04
Df Residuals:	994	BIC:	-5.825e+04
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	5.3000	1.69e-15	3.14e+15	0.000	5.300	5.300
x1	41.8134	1.7e-15	2.46e+16	0.000	41.813	41.813
x2	25.0388	1.79e-15	1.4e+16	0.000	25.039	25.039
x3	3.2108	1.71e-15	1.88e+15	0.000	3.211	3.211
x4	29.9942	1.68e-15	1.79e+16	0.000	29.994	29.994
x5	25.3606	1.67e-15	1.52e+16	0.000	25.361	25.361

Omnibus:	0.308	Durbin-Watson:	1.665
Prob(Omnibus):	0.857	Jarque-Bera (JB):	0.394
Skew:	0.021	Prob(JB):	0.821
Kurtosis:	2.913	Cond. No.	1.16

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We estimate SEs and p-values using sampling again.

[6]:

from sklearn.linear_model import LinearRegression

def do_ols(df):
    sample = df.sample(df.shape[0], replace=True)
    X_tr = sample[[c for c in sample.columns if c != 'y']]
    y_tr = sample.y

    lr = LinearRegression()
    lr.fit(X_tr, y_tr)

    params = [lr.intercept_] +  list(lr.coef_)

    return params

def get_se(X, y):
    lr = LinearRegression()
    lr.fit(X, y)

    df = pd.DataFrame(X)
    df['y'] = y

    r_df = pd.DataFrame([do_ols(df) for _ in range(100)])

    w = [lr.intercept_] + list(lr.coef_)
    se = r_df.std()

    dof = X.shape[0] - X.shape[1] - 1

    summary = pd.DataFrame({
        'w': w,
        'se': se,
        't': w / se,
        '.025': w - se,
        '.975': w + se,
        'df': [dof for _ in range(len(w))]
    })

    summary['P>|t|'] = scipy.stats.t.sf(abs(summary.t), df=summary.df)

    return summary


get_se(X, y)

[6]:

	w	se	t	.025	.975	df
0	5.300000	6.924248e-15	7.654261e+14	5.300000	5.300000	994
1	41.813417	7.709545e-14	5.423591e+14	41.813417	41.813417	994
2	25.038808	5.120436e-14	4.889975e+14	25.038808	25.038808	994
3	3.210812	3.342118e-14	9.607117e+13	3.210812	3.210812	994
4	29.994205	4.581110e-14	6.547366e+14	29.994205	29.994205	994
5	25.360629	4.109462e-14	6.171277e+14	25.360629	25.360629	994

6.3. Links

How do you manually calculate the standard error of the coefficients in multiple regression?