15. Explanation vs Prediction, Imbalanced Data

Let’s continue our exploration of explanation vs prediction. We argue that the goals of explanation and predictions are very different and when optimized for, may point in different directions for model selection. In this example, we generate data for a classification problem and observe the validated prediction and explanation performance metrics. For prediction, we look at

  • Area Under the Curve for the Receiver Operating Characteristic (AUC-ROC) curve, and

  • Area Under the Curve for the Precision-Recall (AUC-PR) curve.

For explanation performance metrics, we look at the following.

  • Brier score

  • Log loss score

  • Brier Skill Score

  • Enfron’s pseudo R-squared

  • McFadden’s pseudo R-squared

15.1. Data

Here, we generate data with 2 classes, with the minority class (y=1) at 10% for data imbalance.

[1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import random
import numpy as np
import pandas as pd

random.seed(37)
np.random.seed(37)

def get_Xy(n_samples=4_000, p_imbalance=0.1):
    X, y = make_classification(**{
        'n_samples': n_samples,
        'n_features': 10,
        'n_informative': 2,
        'n_redundant': 0,
        'n_repeated': 0,
        'n_classes': 2,
        'n_clusters_per_class': 1,
        'flip_y': 0.1,
        'class_sep': 0.5,
        'random_state': 37
    })

    Xy = pd.DataFrame(X, columns=[f'x{i}' for i in range(X.shape[1])]).assign(y=y)
    Xy_0 = Xy[Xy['y']==0]
    Xy_1 = Xy[Xy['y']==1].sample(frac=p_imbalance)

    return pd.concat([Xy_0, Xy_1]) \
        .sample(frac=1.0) \
        .reset_index(drop=True)

Xy = get_Xy()

X, y = Xy.drop(columns=['y']), Xy['y']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1, random_state=37, shuffle=True, stratify=Xy['y'])
[2]:
X.shape, y.shape
[2]:
((2212, 10), (2212,))
[3]:
y_tr.value_counts() / y_tr.shape[0], y_te.value_counts() / y_te.shape[0]
[3]:
(0    0.91005
 1    0.08995
 Name: y, dtype: float64,
 0    0.90991
 1    0.09009
 Name: y, dtype: float64)

15.2. Logistic Regression

The first classifier algorithm we will use is logistic regression. In Scikit-Learn, we note that there is the class_weight parameter that we can specify to combat data imbalance. There will be two models, m1 and m2, identical in every way except for the first model m1 will not have class_weight specified. We want to see how the two models differ in terms of explanation and prediction performances.

[4]:
from sklearn.linear_model import LogisticRegression

m1 = LogisticRegression(solver='saga', random_state=37, n_jobs=-1, max_iter=5_000)
m2 = LogisticRegression(solver='saga', random_state=37, n_jobs=-1, max_iter=5_000, class_weight='balanced')

m1.fit(X_tr, y_tr)
m2.fit(X_tr, y_tr)
[4]:
LogisticRegression(class_weight='balanced', max_iter=5000, n_jobs=-1,
                   random_state=37, solver='saga')
[5]:
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss, log_loss

def get_performance(m, X, y_true):
    p1 = (y.value_counts() / y.shape[0]).sort_index().iloc[1]

    y_pred = m.predict_proba(X)[:,1]
    y_null = np.full(y_true.shape, p1)

    roc = roc_auc_score(y_true, y_pred)
    aps = average_precision_score(y_true, y_pred)

    roc_lift = roc / p1
    aps_lift = aps / p1

    bs_alt = brier_score_loss(y_true, y_pred)
    bs_nil = brier_score_loss(y_true, y_null)

    ll_alt = log_loss(y_true, y_pred)
    ll_nil = log_loss(y_true, y_null)

    efron_alt = np.sum(np.power(y_true - y_pred, 2.0))
    efron_nil = np.sum(np.power(y_true - y_null, 2.0))
    efron = 1 - (efron_alt / efron_nil)

    mcfadden_alt = ll_alt
    mcfadden_nil = ll_nil
    mcfadden = 1 - (mcfadden_alt / mcfadden_nil)

    return pd.Series({
        'p1': p1,
        'roc': roc,
        'aps': aps,
        'roc_lift': roc_lift,
        'aps_lift': aps_lift,
        'bs_alt': bs_alt,
        'bs_nil': bs_nil,
        'bss': 1 - (bs_alt / bs_nil),
        'll_alt': ll_alt,
        'ls_nil': ll_nil,
        'lss': 1 - (ll_alt / ll_nil),
        'efron_alt': efron_alt,
        'efron_nil': efron_nil,
        'efron': efron,
        'mcfadden_alt': mcfadden_alt,
        'mcfadden_nil': mcfadden_nil,
        'mcfadden': mcfadden
    })

Here is a key to the performances.

  • p1: the proportion of data with y=1 or \(P(y=1)\)

  • roc: AUC-ROC

  • aps: AUC-PR

  • roc_lift: AUC-ROC divided by 0.5 (0.5 is random guessing/baseline)

  • aps_lift: AUC-PR divided by p1 (p1 is the baseline)

  • bs_alt: The Brier score (lower is better) for the alternative model (eg m1 or m2)

  • bs_nil: The Brier score (lower is better) for the null model (the null model always predicts the climatic average)

  • bss: The Brier Skill Score (higher is better)

  • ll_alt: The log loss score (lower is better) for the alternative model (eg m1 or m2)

  • ll_nil: The log loss score (lower is better) for the null model

  • lss: The percentage improvement over the null model based on the log loss score (higher is better)

  • efron_alt: The sum of squared difference (SSD) (lower is better) for the alternative model (eg m1 or m2)

  • efron_nil: The SSD (lower is better) for the null model

  • efron: Efron’s pseudo R-squared

  • mcfadden_alt: The log-likelihood of the alternative model (lower is better)

  • mcfadden_nil: The log-likelihood of the null model (lower is better)

  • mcfadden: McFadden’s pseudo R-squared (higher is better)

Below, we have results for

  • m1_tr: model 1 (no class_weight) with the training data

  • m1_te: model 1 with the testing data (validated results)

  • m2_tr: model 2 (balanced class_weight to combat data imbalance) with the training data

  • m2_te: model 2 with the testing data

[6]:
pd.DataFrame({
    'm1_tr': get_performance(m1, X_tr, y_tr),
    'm2_tr': get_performance(m2, X_tr, y_tr),
    'm1_te': get_performance(m1, X_te, y_te),
    'm2_te': get_performance(m2, X_te, y_te)
})
[6]:
m1_tr m2_tr m1_te m2_te
p1 0.089964 0.089964 0.089964 0.089964
roc 0.947133 0.949313 0.986634 0.988119
aps 0.657234 0.654689 0.835388 0.861084
roc_lift 10.527925 10.552168 10.967003 10.983512
aps_lift 7.305533 7.277244 9.285822 9.571450
bs_alt 0.047177 0.060593 0.032510 0.040741
bs_nil 0.081859 0.081859 0.081974 0.081974
bss 0.423676 0.259786 0.603406 0.502994
ll_alt 0.160113 0.304208 0.110582 0.211115
ls_nil 0.302422 0.302422 0.302746 0.302746
lss 0.470563 -0.005906 0.634737 0.302665
efron_alt 93.882567 120.580071 7.217293 9.044612
efron_nil 162.898995 162.898995 18.198202 18.198202
efron 0.423676 0.259786 0.603406 0.502994
mcfadden_alt 0.160113 0.304208 0.110582 0.211115
mcfadden_nil 0.302422 0.302422 0.302746 0.302746
mcfadden 0.470563 -0.005906 0.634737 0.302665

In general, you can see from the table above that the logistic regression model with class_weight=balanced does better with prediction performances. On the other hand, you can also see that the logistic regression model with class_weight=None does better with explanation performances. Which model would we use depends on what we are trying to do (build a prediction model or an explanatory one).

Below is a table summarizing the coefficients of these two models. Interestingly, all the coefficients are identical; the only difference is with the intercepts!

[7]:
def get_params(m, X):
    return pd.concat([
        pd.Series(m.intercept_, ['intercept']),
        pd.Series(m1.coef_[0], X.columns)
    ])

pd.DataFrame({
    'm1': get_params(m1, X),
    'm2': get_params(m2, X)
})
[7]:
m1 m2
intercept -1.857950 0.699405
x0 0.161702 0.161702
x1 0.023234 0.023234
x2 -0.114094 -0.114094
x3 -0.206260 -0.206260
x4 -2.341687 -2.341687
x5 -0.072094 -0.072094
x6 0.142848 0.142848
x7 -0.052862 -0.052862
x8 -0.059727 -0.059727
x9 1.532503 1.532503

Here, we show the Spearman rank correlation coefficient between the two sets of coefficients.

[8]:
pd.DataFrame({
    'm1': get_params(m1, X),
    'm2': get_params(m2, X)
}).corr(method='spearman')
[8]:
m1 m2
m1 1.000000 0.672727
m2 0.672727 1.000000

15.3. Random forest

In this section, we will duplicate what we have done before with the logistic regression classifier models but for random forest classifiers. Note that there is an additional model m3 where class_weight=balanced_subsample.

[9]:
from sklearn.ensemble import RandomForestClassifier

m1 = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=5)
m2 = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=5, class_weight='balanced')
m3 = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=5, class_weight='balanced_subsample')

m1.fit(X_tr, y_tr)
m2.fit(X_tr, y_tr)
m3.fit(X_tr, y_tr)
[9]:
RandomForestClassifier(class_weight='balanced_subsample', n_estimators=5,
                       n_jobs=-1, random_state=37)

Based on the table below, the performances are aligned in the sense that the third model tends to be better in every aspect (prediction and explanation). So, is it the case that non-linear models tend to be more consistent at being better as both a explanatory and predictive model?

[10]:
pd.DataFrame({
    'm1_tr': get_performance(m1, X_tr, y_tr),
    'm2_tr': get_performance(m2, X_tr, y_tr),
    'm3_tr': get_performance(m3, X_tr, y_tr),
    'm1_te': get_performance(m1, X_te, y_te),
    'm2_te': get_performance(m2, X_te, y_te),
    'm3_te': get_performance(m3, X_te, y_te)
})
[10]:
m1_tr m2_tr m3_tr m1_te m2_te m3_te
p1 0.089964 0.089964 0.089964 0.089964 0.089964 0.089964
roc 0.998103 0.998394 0.998049 0.988366 0.988119 0.989851
aps 0.971345 0.976300 0.971664 0.839299 0.862206 0.852963
roc_lift 11.094490 11.097730 11.093890 10.986263 10.983512 11.002771
aps_lift 10.797064 10.852142 10.800602 9.329290 9.583920 9.481182
bs_alt 0.012040 0.011377 0.011437 0.029730 0.026486 0.024324
bs_nil 0.081859 0.081859 0.081859 0.081974 0.081974 0.081974
bss 0.852915 0.861018 0.860282 0.637327 0.676891 0.703267
ll_alt 0.043320 0.039864 0.039599 0.090397 0.080423 0.073280
ls_nil 0.302422 0.302422 0.302422 0.302746 0.302746 0.302746
lss 0.856757 0.868184 0.869059 0.701409 0.734355 0.757949
efron_alt 23.960000 22.640000 22.760000 6.600000 5.880000 5.400000
efron_nil 162.898995 162.898995 162.898995 18.198202 18.198202 18.198202
efron 0.852915 0.861018 0.860282 0.637327 0.676891 0.703267
mcfadden_alt 0.043320 0.039864 0.039599 0.090397 0.080423 0.073280
mcfadden_nil 0.302422 0.302422 0.302422 0.302746 0.302746 0.302746
mcfadden 0.856757 0.868184 0.869059 0.701409 0.734355 0.757949

Now we look at the feature importances of all models. The first model is obviously quite different from the other two.

[11]:
def get_feature_importances(m, X):
    return pd.Series(m.feature_importances_, X.columns)

pd.DataFrame({
    'm1': get_feature_importances(m1, X),
    'm2': get_feature_importances(m2, X),
    'm3': get_feature_importances(m3, X),
})
[11]:
m1 m2 m3
x0 0.056587 0.028781 0.030102
x1 0.053240 0.034572 0.025111
x2 0.038195 0.022131 0.023673
x3 0.062370 0.023034 0.035275
x4 0.262307 0.460612 0.448095
x5 0.022753 0.023950 0.019149
x6 0.072132 0.028148 0.030798
x7 0.044623 0.037524 0.050079
x8 0.046933 0.026547 0.029018
x9 0.340858 0.314702 0.308701

We quantify the agreement of the feature importances from both models using Spearman rank correlation. The interesting thing to note is that agreement between the first and third models is higher than the second and third models.

[12]:
pd.DataFrame({
    'm1': get_feature_importances(m1, X),
    'm2': get_feature_importances(m2, X),
    'm3': get_feature_importances(m3, X),
}).corr(method='spearman')
[12]:
m1 m2 m3
m1 1.000000 0.575758 0.781818
m2 0.575758 1.000000 0.709091
m3 0.781818 0.709091 1.000000