22. Log Loss

The log loss score is a scrictly proper scoring rule used to judge probabilistic classifiers. It goes by different names such as logistic loss or cross-entropy loss. It is defined as follows.

\(L_{\log} = -\dfrac{1}{N} \sum_t (y_t \log p_t + (1 - y_t) \log (1 - p_t))\), where

  • \(N\) is the total number of observations,

  • \(y_t\) is the t-th observation of y, and

  • \(p_t\) is the predicted probability of the corresponding \(y_t\).

The value of the log loss score is in the range \([0, \infty]\), where

  • a value closer to 0 indicates “perfect skill”, and

  • a value closer towards \(\infty\) indicates “worst skill”.

This article discusses some interesting advantages and disadvantages of the log loss and Brier scores. The TLDR summary is that the Brier score is more robust to calibration issues and should be used when the predicted probabilities are not well calibrated, while the log loss is appropriate when the predicted probabilities are well-calibrated.

22.1. Load data

Let’s import a dataset about students and whether they have conducted research. The indepent variables in X are the student’s scores and peformance measures, and the dependent variable y is whether they have done research (y = 1) or not (y = 0).

[1]:
import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/selva86/datasets/master/Admission.csv'
Xy = pd.read_csv(url) \
    .drop(columns=['Chance of Admit ', 'Serial No.'])

Xy.shape
[1]:
(400, 7)
[2]:
Xy.columns
[2]:
Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research'],
      dtype='object')
[3]:
Xy.head()
[3]:
GRE Score TOEFL Score University Rating SOP LOR CGPA Research
0 337 118 4 4.5 4.5 9.65 1
1 324 107 4 4.0 4.5 8.87 1
2 316 104 3 3.0 3.5 8.00 1
3 322 110 3 3.5 2.5 8.67 1
4 314 103 2 2.0 3.0 8.21 0

22.2. Create Xy

We will split Xy in X and y individually.

[4]:
X = Xy.drop(columns=['Research'])
y = Xy['Research']

X.shape, y.shape
[4]:
((400, 6), (400,))

22.3. Split Xy into training and testing

We then split X and y into training and testing folds.

[5]:
from sklearn.model_selection import StratifiedKFold

tr_idx, te_idx = next(StratifiedKFold(n_splits=10, random_state=37, shuffle=True).split(X, y))

X_tr, X_te, y_tr, y_te = X.loc[tr_idx], X.loc[te_idx], y.loc[tr_idx], y.loc[te_idx]
X_tr.shape, X_te.shape, y_tr.shape, y_te.shape
[5]:
((360, 6), (40, 6), (360,), (40,))

Let’s make sure the proportions of 1’s and 0’s are preserved with the splitting.

[6]:
y_tr.value_counts() / y_tr.value_counts().sum(),
[6]:
(1    0.547222
 0    0.452778
 Name: Research, dtype: float64,)
[7]:
y_te.value_counts() / y_te.value_counts().sum()
[7]:
1    0.55
0    0.45
Name: Research, dtype: float64

22.4. Model learning

Let’s training a logistic regression model on the training data.

[8]:
from sklearn.linear_model import LogisticRegression

m = LogisticRegression(solver='saga', max_iter=5_000, random_state=37, n_jobs=-1)
m.fit(X_tr, y_tr)
[8]:
LogisticRegression(max_iter=5000, n_jobs=-1, random_state=37, solver='saga')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

22.5. Scoring

The null (reference) model is the one that always predicts the expected probability of y=1 defined as follows.

\(\hat{y} = \dfrac{1}{N} \sum y\)

Notice in the code that we compute this constant prediction from the training data (not testing; although the data was split to preserve the proportions).

[9]:
p1 = (y_tr.value_counts() / y_tr.value_counts().sum()).sort_index().loc[1]

y_null = np.full(y_te.shape, p1)
y_pred = m.predict_proba(X_te)[:,1]

The log loss score of the null and alternative models are shown below. Since lower is better, the alternative model has the better skill.

[10]:
from sklearn.metrics import log_loss

b = pd.Series([
    log_loss(y_te, y_null),
    log_loss(y_te, y_pred)
], ['null', 'alt'])

b
[10]:
null    0.688154
alt     0.531422
dtype: float64

The percentage improvement of the alternative model over the null one is 23%.

[ ]:
1 - (b.loc['alt'] / b.loc['null'])
0.22775752288568218