15. Explanation vs Prediction, Imbalanced Data
Let’s continue our exploration of explanation vs prediction. We argue that the goals of explanation and predictions are very different and when optimized for, may point in different directions for model selection. In this example, we generate data for a classification problem and observe the validated prediction and explanation performance metrics. For prediction, we look at
Area Under the Curve for the Receiver Operating Characteristic (AUC-ROC) curve, and
Area Under the Curve for the Precision-Recall (AUC-PR) curve.
For explanation performance metrics, we look at the following.
Brier score
Log loss score
Brier Skill Score
Enfron’s pseudo R-squared
McFadden’s pseudo R-squared
15.1. Data
Here, we generate data with 2 classes, with the minority class (y=1) at 10% for data imbalance.
[1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import random
import numpy as np
import pandas as pd
random.seed(37)
np.random.seed(37)
def get_Xy(n_samples=4_000, p_imbalance=0.1):
X, y = make_classification(**{
'n_samples': n_samples,
'n_features': 10,
'n_informative': 2,
'n_redundant': 0,
'n_repeated': 0,
'n_classes': 2,
'n_clusters_per_class': 1,
'flip_y': 0.1,
'class_sep': 0.5,
'random_state': 37
})
Xy = pd.DataFrame(X, columns=[f'x{i}' for i in range(X.shape[1])]).assign(y=y)
Xy_0 = Xy[Xy['y']==0]
Xy_1 = Xy[Xy['y']==1].sample(frac=p_imbalance)
return pd.concat([Xy_0, Xy_1]) \
.sample(frac=1.0) \
.reset_index(drop=True)
Xy = get_Xy()
X, y = Xy.drop(columns=['y']), Xy['y']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1, random_state=37, shuffle=True, stratify=Xy['y'])
[2]:
X.shape, y.shape
[2]:
((2212, 10), (2212,))
[3]:
y_tr.value_counts() / y_tr.shape[0], y_te.value_counts() / y_te.shape[0]
[3]:
(0 0.91005
1 0.08995
Name: y, dtype: float64,
0 0.90991
1 0.09009
Name: y, dtype: float64)
15.2. Logistic Regression
The first classifier algorithm we will use is logistic regression. In Scikit-Learn, we note that there is the class_weight
parameter that we can specify to combat data imbalance. There will be two models, m1
and m2
, identical in every way except for the first model m1
will not have class_weight
specified. We want to see how the two models differ in terms of explanation and prediction performances.
[4]:
from sklearn.linear_model import LogisticRegression
m1 = LogisticRegression(solver='saga', random_state=37, n_jobs=-1, max_iter=5_000)
m2 = LogisticRegression(solver='saga', random_state=37, n_jobs=-1, max_iter=5_000, class_weight='balanced')
m1.fit(X_tr, y_tr)
m2.fit(X_tr, y_tr)
[4]:
LogisticRegression(class_weight='balanced', max_iter=5000, n_jobs=-1,
random_state=37, solver='saga')
[5]:
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss, log_loss
def get_performance(m, X, y_true):
p1 = (y.value_counts() / y.shape[0]).sort_index().iloc[1]
y_pred = m.predict_proba(X)[:,1]
y_null = np.full(y_true.shape, p1)
roc = roc_auc_score(y_true, y_pred)
aps = average_precision_score(y_true, y_pred)
roc_lift = roc / p1
aps_lift = aps / p1
bs_alt = brier_score_loss(y_true, y_pred)
bs_nil = brier_score_loss(y_true, y_null)
ll_alt = log_loss(y_true, y_pred)
ll_nil = log_loss(y_true, y_null)
efron_alt = np.sum(np.power(y_true - y_pred, 2.0))
efron_nil = np.sum(np.power(y_true - y_null, 2.0))
efron = 1 - (efron_alt / efron_nil)
mcfadden_alt = ll_alt
mcfadden_nil = ll_nil
mcfadden = 1 - (mcfadden_alt / mcfadden_nil)
return pd.Series({
'p1': p1,
'roc': roc,
'aps': aps,
'roc_lift': roc_lift,
'aps_lift': aps_lift,
'bs_alt': bs_alt,
'bs_nil': bs_nil,
'bss': 1 - (bs_alt / bs_nil),
'll_alt': ll_alt,
'ls_nil': ll_nil,
'lss': 1 - (ll_alt / ll_nil),
'efron_alt': efron_alt,
'efron_nil': efron_nil,
'efron': efron,
'mcfadden_alt': mcfadden_alt,
'mcfadden_nil': mcfadden_nil,
'mcfadden': mcfadden
})
Here is a key to the performances.
p1
: the proportion of data with y=1 or \(P(y=1)\)roc
: AUC-ROCaps
: AUC-PRroc_lift
: AUC-ROC divided by 0.5 (0.5 is random guessing/baseline)aps_lift
: AUC-PR divided byp1
(p1
is the baseline)bs_alt
: The Brier score (lower is better) for the alternative model (eg m1 or m2)bs_nil
: The Brier score (lower is better) for the null model (the null model always predicts the climatic average)bss
: The Brier Skill Score (higher is better)ll_alt
: The log loss score (lower is better) for the alternative model (eg m1 or m2)ll_nil
: The log loss score (lower is better) for the null modellss
: The percentage improvement over the null model based on the log loss score (higher is better)efron_alt
: The sum of squared difference (SSD) (lower is better) for the alternative model (eg m1 or m2)efron_nil
: The SSD (lower is better) for the null modelefron
: Efron’s pseudo R-squaredmcfadden_alt
: The log-likelihood of the alternative model (lower is better)mcfadden_nil
: The log-likelihood of the null model (lower is better)mcfadden
: McFadden’s pseudo R-squared (higher is better)
Below, we have results for
m1_tr
: model 1 (noclass_weight
) with the training datam1_te
: model 1 with the testing data (validated results)m2_tr
: model 2 (balancedclass_weight
to combat data imbalance) with the training datam2_te
: model 2 with the testing data
[6]:
pd.DataFrame({
'm1_tr': get_performance(m1, X_tr, y_tr),
'm2_tr': get_performance(m2, X_tr, y_tr),
'm1_te': get_performance(m1, X_te, y_te),
'm2_te': get_performance(m2, X_te, y_te)
})
[6]:
m1_tr | m2_tr | m1_te | m2_te | |
---|---|---|---|---|
p1 | 0.089964 | 0.089964 | 0.089964 | 0.089964 |
roc | 0.947133 | 0.949313 | 0.986634 | 0.988119 |
aps | 0.657234 | 0.654689 | 0.835388 | 0.861084 |
roc_lift | 10.527925 | 10.552168 | 10.967003 | 10.983512 |
aps_lift | 7.305533 | 7.277244 | 9.285822 | 9.571450 |
bs_alt | 0.047177 | 0.060593 | 0.032510 | 0.040741 |
bs_nil | 0.081859 | 0.081859 | 0.081974 | 0.081974 |
bss | 0.423676 | 0.259786 | 0.603406 | 0.502994 |
ll_alt | 0.160113 | 0.304208 | 0.110582 | 0.211115 |
ls_nil | 0.302422 | 0.302422 | 0.302746 | 0.302746 |
lss | 0.470563 | -0.005906 | 0.634737 | 0.302665 |
efron_alt | 93.882567 | 120.580071 | 7.217293 | 9.044612 |
efron_nil | 162.898995 | 162.898995 | 18.198202 | 18.198202 |
efron | 0.423676 | 0.259786 | 0.603406 | 0.502994 |
mcfadden_alt | 0.160113 | 0.304208 | 0.110582 | 0.211115 |
mcfadden_nil | 0.302422 | 0.302422 | 0.302746 | 0.302746 |
mcfadden | 0.470563 | -0.005906 | 0.634737 | 0.302665 |
In general, you can see from the table above that the logistic regression model with class_weight=balanced
does better with prediction performances. On the other hand, you can also see that the logistic regression model with class_weight=None
does better with explanation performances. Which model would we use depends on what we are trying to do (build a prediction model or an explanatory one).
Below is a table summarizing the coefficients of these two models. Interestingly, all the coefficients are identical; the only difference is with the intercepts!
[7]:
def get_params(m, X):
return pd.concat([
pd.Series(m.intercept_, ['intercept']),
pd.Series(m1.coef_[0], X.columns)
])
pd.DataFrame({
'm1': get_params(m1, X),
'm2': get_params(m2, X)
})
[7]:
m1 | m2 | |
---|---|---|
intercept | -1.857950 | 0.699405 |
x0 | 0.161702 | 0.161702 |
x1 | 0.023234 | 0.023234 |
x2 | -0.114094 | -0.114094 |
x3 | -0.206260 | -0.206260 |
x4 | -2.341687 | -2.341687 |
x5 | -0.072094 | -0.072094 |
x6 | 0.142848 | 0.142848 |
x7 | -0.052862 | -0.052862 |
x8 | -0.059727 | -0.059727 |
x9 | 1.532503 | 1.532503 |
Here, we show the Spearman rank correlation coefficient between the two sets of coefficients.
[8]:
pd.DataFrame({
'm1': get_params(m1, X),
'm2': get_params(m2, X)
}).corr(method='spearman')
[8]:
m1 | m2 | |
---|---|---|
m1 | 1.000000 | 0.672727 |
m2 | 0.672727 | 1.000000 |
15.3. Random forest
In this section, we will duplicate what we have done before with the logistic regression classifier models but for random forest classifiers. Note that there is an additional model m3
where class_weight=balanced_subsample
.
[9]:
from sklearn.ensemble import RandomForestClassifier
m1 = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=5)
m2 = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=5, class_weight='balanced')
m3 = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=5, class_weight='balanced_subsample')
m1.fit(X_tr, y_tr)
m2.fit(X_tr, y_tr)
m3.fit(X_tr, y_tr)
[9]:
RandomForestClassifier(class_weight='balanced_subsample', n_estimators=5,
n_jobs=-1, random_state=37)
Based on the table below, the performances are aligned in the sense that the third model tends to be better in every aspect (prediction and explanation). So, is it the case that non-linear models tend to be more consistent at being better as both a explanatory and predictive model?
[10]:
pd.DataFrame({
'm1_tr': get_performance(m1, X_tr, y_tr),
'm2_tr': get_performance(m2, X_tr, y_tr),
'm3_tr': get_performance(m3, X_tr, y_tr),
'm1_te': get_performance(m1, X_te, y_te),
'm2_te': get_performance(m2, X_te, y_te),
'm3_te': get_performance(m3, X_te, y_te)
})
[10]:
m1_tr | m2_tr | m3_tr | m1_te | m2_te | m3_te | |
---|---|---|---|---|---|---|
p1 | 0.089964 | 0.089964 | 0.089964 | 0.089964 | 0.089964 | 0.089964 |
roc | 0.998103 | 0.998394 | 0.998049 | 0.988366 | 0.988119 | 0.989851 |
aps | 0.971345 | 0.976300 | 0.971664 | 0.839299 | 0.862206 | 0.852963 |
roc_lift | 11.094490 | 11.097730 | 11.093890 | 10.986263 | 10.983512 | 11.002771 |
aps_lift | 10.797064 | 10.852142 | 10.800602 | 9.329290 | 9.583920 | 9.481182 |
bs_alt | 0.012040 | 0.011377 | 0.011437 | 0.029730 | 0.026486 | 0.024324 |
bs_nil | 0.081859 | 0.081859 | 0.081859 | 0.081974 | 0.081974 | 0.081974 |
bss | 0.852915 | 0.861018 | 0.860282 | 0.637327 | 0.676891 | 0.703267 |
ll_alt | 0.043320 | 0.039864 | 0.039599 | 0.090397 | 0.080423 | 0.073280 |
ls_nil | 0.302422 | 0.302422 | 0.302422 | 0.302746 | 0.302746 | 0.302746 |
lss | 0.856757 | 0.868184 | 0.869059 | 0.701409 | 0.734355 | 0.757949 |
efron_alt | 23.960000 | 22.640000 | 22.760000 | 6.600000 | 5.880000 | 5.400000 |
efron_nil | 162.898995 | 162.898995 | 162.898995 | 18.198202 | 18.198202 | 18.198202 |
efron | 0.852915 | 0.861018 | 0.860282 | 0.637327 | 0.676891 | 0.703267 |
mcfadden_alt | 0.043320 | 0.039864 | 0.039599 | 0.090397 | 0.080423 | 0.073280 |
mcfadden_nil | 0.302422 | 0.302422 | 0.302422 | 0.302746 | 0.302746 | 0.302746 |
mcfadden | 0.856757 | 0.868184 | 0.869059 | 0.701409 | 0.734355 | 0.757949 |
Now we look at the feature importances of all models. The first model is obviously quite different from the other two.
[11]:
def get_feature_importances(m, X):
return pd.Series(m.feature_importances_, X.columns)
pd.DataFrame({
'm1': get_feature_importances(m1, X),
'm2': get_feature_importances(m2, X),
'm3': get_feature_importances(m3, X),
})
[11]:
m1 | m2 | m3 | |
---|---|---|---|
x0 | 0.056587 | 0.028781 | 0.030102 |
x1 | 0.053240 | 0.034572 | 0.025111 |
x2 | 0.038195 | 0.022131 | 0.023673 |
x3 | 0.062370 | 0.023034 | 0.035275 |
x4 | 0.262307 | 0.460612 | 0.448095 |
x5 | 0.022753 | 0.023950 | 0.019149 |
x6 | 0.072132 | 0.028148 | 0.030798 |
x7 | 0.044623 | 0.037524 | 0.050079 |
x8 | 0.046933 | 0.026547 | 0.029018 |
x9 | 0.340858 | 0.314702 | 0.308701 |
We quantify the agreement of the feature importances from both models using Spearman rank correlation. The interesting thing to note is that agreement between the first and third models is higher than the second and third models.
[12]:
pd.DataFrame({
'm1': get_feature_importances(m1, X),
'm2': get_feature_importances(m2, X),
'm3': get_feature_importances(m3, X),
}).corr(method='spearman')
[12]:
m1 | m2 | m3 | |
---|---|---|---|
m1 | 1.000000 | 0.575758 | 0.781818 |
m2 | 0.575758 | 1.000000 | 0.709091 |
m3 | 0.781818 | 0.709091 | 1.000000 |