6. Estimating Standard Error and Significance of Regression Coefficients
Scikit-Learn is an awesome API loaded with machine learning algorithms. However, Scikit-Learn seems to lack behind R when it comes to providing additional information for some models. In particular, for regression models such as logistic regression and Ordinary Least Square (OLS) regression, Scikit-Learn does not provide standard errors (SEs
) and significance (p-values
) of coefficients. There’s a few options to use when researchers want these estimates. One is to use R, but using R just
to get estimates may be problematic:
the user may not know R
the user may want to stay within the Python realm
There is the statsmodel
API that does provide these estimates for logistic and OLS regressions and it is a Python library. However, it is not easy to pull out these individual values as they are inside non-obvious internal data structures.
Below, we show how to estimate SE and p-value for logistic and OLS regression coefficients. The approach is to sample with replacement the data and perform many regressions. The estimates of the coefficients then may be used to compute SE and p-value for each coefficient.
6.1. Logistic Regression
Let’s use Scikit-Learn to create classification data.
[1]:
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
import random
np.random.seed(37)
random.seed(37)
X, y = make_classification(**{
'n_samples': 2000,
'n_features': 5,
'n_informative': 5,
'n_redundant': 0,
'n_repeated': 0,
'n_classes': 2,
'n_clusters_per_class': 1,
'random_state': 37
})
print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (2000, 5), y shape (2000,)
statsmodel
can provide us not only with the coefficients of the model, but also the SEs and p-values.
[2]:
import statsmodels.api as sm
binonmial_model = sm.GLM(y, sm.add_constant(X), family=sm.families.Binomial())
res = binonmial_model.fit()
res.summary()
[2]:
Dep. Variable: | y | No. Observations: | 2000 |
---|---|---|---|
Model: | GLM | Df Residuals: | 1994 |
Model Family: | Binomial | Df Model: | 5 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -554.00 |
Date: | Tue, 30 Nov 2021 | Deviance: | 1108.0 |
Time: | 19:02:46 | Pearson chi2: | 8.48e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -1.5277 | 0.194 | -7.893 | 0.000 | -1.907 | -1.148 |
x1 | -0.8484 | 0.079 | -10.673 | 0.000 | -1.004 | -0.693 |
x2 | 0.1734 | 0.091 | 1.898 | 0.058 | -0.006 | 0.352 |
x3 | -0.8857 | 0.093 | -9.492 | 0.000 | -1.069 | -0.703 |
x4 | 1.5011 | 0.084 | 17.972 | 0.000 | 1.337 | 1.665 |
x5 | 0.4876 | 0.071 | 6.867 | 0.000 | 0.348 | 0.627 |
We can sample (with replacement) from the data many times (in this case, 100 times) and peform many logistic regressions on the sampled data. We will trace or log the coefficients each time. The SE of each coefficient is just its standard deviation over these coefficients learned from the samples. The t-value
is the coefficient (the coefficient learned from the complete data set that is NOT sampled) divided by the SE. The 95% confidence interval is created by subtracting and adding the SE to
the learned coefficient (the coefficient learned from the complete data set that is NOT sampled). The p-value is then computed from the t-value.
[3]:
from sklearn.linear_model import LogisticRegression
import scipy.stats
def do_lreg(df):
sample = df.sample(df.shape[0], replace=True)
X_tr = sample[[c for c in sample.columns if c != 'y']]
y_tr = sample.y
lr = LogisticRegression(penalty='l2', solver='liblinear')
lr.fit(X_tr, y_tr)
params = [lr.intercept_[0]] + list(lr.coef_[0])
return params
def get_se(X, y):
lr = LogisticRegression(penalty='l2', solver='liblinear')
lr.fit(X, y)
df = pd.DataFrame(X)
df['y'] = y
r_df = pd.DataFrame([do_lreg(df) for _ in range(100)])
w = [lr.intercept_[0]] + list(lr.coef_[0])
se = r_df.std()
dof = X.shape[0] - X.shape[1] - 1
summary = pd.DataFrame({
'w': w,
'se': se,
'z': w / se,
'.025': w - se,
'.975': w + se,
'df': [dof for _ in range(len(w))]
})
summary['P>|z|'] = scipy.stats.t.sf(abs(summary.z), df=summary.df)
return summary
get_se(X, y)
[3]:
w | se | z | .025 | .975 | df | P>|z| | |
---|---|---|---|---|---|---|---|
0 | -1.450358 | 0.195782 | -7.408036 | -1.646140 | -1.254576 | 1994 | 9.412147e-14 |
1 | -0.830592 | 0.073803 | -11.254226 | -0.904395 | -0.756789 | 1994 | 7.841322e-29 |
2 | 0.161455 | 0.081370 | 1.984219 | 0.080086 | 0.242825 | 1994 | 2.368414e-02 |
3 | -0.853932 | 0.114541 | -7.455270 | -0.968473 | -0.739392 | 1994 | 6.648994e-14 |
4 | 1.481402 | 0.093463 | 15.850211 | 1.387940 | 1.574865 | 1994 | 1.106522e-53 |
5 | 0.483337 | 0.066506 | 7.267597 | 0.416832 | 0.549843 | 1994 | 2.613835e-13 |
You will notice that the coefficients and their SEs and p-values are different using this approach as compared with statsmodel
. That is ok as statsmodel
uses a different way to estimate not only the coefficients but also the SEs. The values are different, but only slightly so; they are effectively the same.
6.2. OLS Regression
Now we generate regression data.
[4]:
from sklearn.datasets import make_regression
X, y = make_regression(**{
'n_samples': 1000,
'n_features': 5,
'n_informative': 5,
'n_targets': 1,
'bias': 5.3,
'random_state': 37
})
print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (1000, 5), y shape (1000,)
Here’s the output from statsmodel
.
[5]:
mod = sm.OLS(y, sm.add_constant(X))
res = mod.fit()
res.summary()
[5]:
Dep. Variable: | y | R-squared: | 1.000 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 1.000 |
Method: | Least Squares | F-statistic: | 2.665e+32 |
Date: | Tue, 30 Nov 2021 | Prob (F-statistic): | 0.00 |
Time: | 19:02:46 | Log-Likelihood: | 29147. |
No. Observations: | 1000 | AIC: | -5.828e+04 |
Df Residuals: | 994 | BIC: | -5.825e+04 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 5.3000 | 1.69e-15 | 3.14e+15 | 0.000 | 5.300 | 5.300 |
x1 | 41.8134 | 1.7e-15 | 2.46e+16 | 0.000 | 41.813 | 41.813 |
x2 | 25.0388 | 1.79e-15 | 1.4e+16 | 0.000 | 25.039 | 25.039 |
x3 | 3.2108 | 1.71e-15 | 1.88e+15 | 0.000 | 3.211 | 3.211 |
x4 | 29.9942 | 1.68e-15 | 1.79e+16 | 0.000 | 29.994 | 29.994 |
x5 | 25.3606 | 1.67e-15 | 1.52e+16 | 0.000 | 25.361 | 25.361 |
Omnibus: | 0.308 | Durbin-Watson: | 1.665 |
---|---|---|---|
Prob(Omnibus): | 0.857 | Jarque-Bera (JB): | 0.394 |
Skew: | 0.021 | Prob(JB): | 0.821 |
Kurtosis: | 2.913 | Cond. No. | 1.16 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We estimate SEs and p-values using sampling again.
[6]:
from sklearn.linear_model import LinearRegression
def do_ols(df):
sample = df.sample(df.shape[0], replace=True)
X_tr = sample[[c for c in sample.columns if c != 'y']]
y_tr = sample.y
lr = LinearRegression()
lr.fit(X_tr, y_tr)
params = [lr.intercept_] + list(lr.coef_)
return params
def get_se(X, y):
lr = LinearRegression()
lr.fit(X, y)
df = pd.DataFrame(X)
df['y'] = y
r_df = pd.DataFrame([do_ols(df) for _ in range(100)])
w = [lr.intercept_] + list(lr.coef_)
se = r_df.std()
dof = X.shape[0] - X.shape[1] - 1
summary = pd.DataFrame({
'w': w,
'se': se,
't': w / se,
'.025': w - se,
'.975': w + se,
'df': [dof for _ in range(len(w))]
})
summary['P>|t|'] = scipy.stats.t.sf(abs(summary.t), df=summary.df)
return summary
get_se(X, y)
[6]:
w | se | t | .025 | .975 | df | P>|t| | |
---|---|---|---|---|---|---|---|
0 | 5.300000 | 6.924248e-15 | 7.654261e+14 | 5.300000 | 5.300000 | 994 | 0.0 |
1 | 41.813417 | 7.709545e-14 | 5.423591e+14 | 41.813417 | 41.813417 | 994 | 0.0 |
2 | 25.038808 | 5.120436e-14 | 4.889975e+14 | 25.038808 | 25.038808 | 994 | 0.0 |
3 | 3.210812 | 3.342118e-14 | 9.607117e+13 | 3.210812 | 3.210812 | 994 | 0.0 |
4 | 29.994205 | 4.581110e-14 | 6.547366e+14 | 29.994205 | 29.994205 | 994 | 0.0 |
5 | 25.360629 | 4.109462e-14 | 6.171277e+14 | 25.360629 | 25.360629 | 994 | 0.0 |