21. Brier Score
The Brier score BS
is a scrictly proper scoring rule used to judge probabilistic classifiers. It is defined as follows.
\(BS = \dfrac{1}{N} \sum \limits _{t=1}^{N}(y_t - \hat{y}_t)^2\), where
\(N\) is the number of samples,
\(y_t\) is the t-th y observation (zero or one), and
\(\hat{y}_t\) is the probabilistic prediction corresponding to \(y_t\).
The value of BS is in the range \([0, 1]\), where
a value closer to 0 indicates “perfect skill”, and
a value closer to 1 indicates the “worst skill”.
When BS is close to 0.5, the Brier Skill Score BSS
is recommended. BSS is defined as follows.
\(BSS = 1 - \dfrac{BS}{BS_r}\), where
\(BS_r\) is the Brier score for a reference model (eg null model).
The Brier score also goes by the names quadratic loss
or mean squared error
.
21.1. Load data
Let’s import a dataset about students and whether they have conducted research. The indepent variables in X are the student’s scores and peformance measures, and the dependent variable y is whether they have done research (y = 1) or not (y = 0).
[1]:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/selva86/datasets/master/Admission.csv'
Xy = pd.read_csv(url) \
.drop(columns=['Chance of Admit ', 'Serial No.'])
Xy.shape
[1]:
(400, 7)
[2]:
Xy.columns
[2]:
Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
'Research'],
dtype='object')
[3]:
Xy.head()
[3]:
GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | |
---|---|---|---|---|---|---|---|
0 | 337 | 118 | 4 | 4.5 | 4.5 | 9.65 | 1 |
1 | 324 | 107 | 4 | 4.0 | 4.5 | 8.87 | 1 |
2 | 316 | 104 | 3 | 3.0 | 3.5 | 8.00 | 1 |
3 | 322 | 110 | 3 | 3.5 | 2.5 | 8.67 | 1 |
4 | 314 | 103 | 2 | 2.0 | 3.0 | 8.21 | 0 |
21.2. Create Xy
We will split Xy in X and y individually.
[4]:
X = Xy.drop(columns=['Research'])
y = Xy['Research']
X.shape, y.shape
[4]:
((400, 6), (400,))
21.3. Split Xy into training and testing
We then split X and y into training and testing folds.
[5]:
from sklearn.model_selection import StratifiedKFold
tr_idx, te_idx = next(StratifiedKFold(n_splits=10, random_state=37, shuffle=True).split(X, y))
X_tr, X_te, y_tr, y_te = X.loc[tr_idx], X.loc[te_idx], y.loc[tr_idx], y.loc[te_idx]
X_tr.shape, X_te.shape, y_tr.shape, y_te.shape
[5]:
((360, 6), (40, 6), (360,), (40,))
Let’s make sure the proportions of 1’s and 0’s are preserved with the splitting.
[6]:
y_tr.value_counts() / y_tr.value_counts().sum(),
[6]:
(1 0.547222
0 0.452778
Name: Research, dtype: float64,)
[7]:
y_te.value_counts() / y_te.value_counts().sum()
[7]:
1 0.55
0 0.45
Name: Research, dtype: float64
21.4. Model learning
Let’s training a logistic regression model on the training data.
[8]:
from sklearn.linear_model import LogisticRegression
m = LogisticRegression(solver='saga', max_iter=5_000, random_state=37, n_jobs=-1)
m.fit(X_tr, y_tr)
[8]:
LogisticRegression(max_iter=5000, n_jobs=-1, random_state=37, solver='saga')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=5000, n_jobs=-1, random_state=37, solver='saga')
21.5. Scoring
The null (reference) model is the one that always predicts the expected probability of y=1 defined as follows.
\(\hat{y} = \dfrac{1}{N} \sum y\)
Notice in the code that we compute this constant prediction from the training data (not testing; although the data was split to preserve the proportions).
[9]:
p1 = (y_tr.value_counts() / y_tr.value_counts().sum()).sort_index().loc[1]
y_null = np.full(y_te.shape, p1)
y_pred = m.predict_proba(X_te)[:,1]
The BS of the null and alternative models are shown below. Since lower is better for BS, the alternative model has the better skill.
[10]:
from sklearn.metrics import brier_score_loss
b = pd.Series([
brier_score_loss(y_te, y_null),
brier_score_loss(y_te, y_pred)
], ['null', 'alt'])
b
[10]:
null 0.247508
alt 0.180261
dtype: float64
21.6. Brier Skill Score
The BSS is interpreted as the percentage improvement in the BS compared to the reference model. A negative BSS means the alternative model is worse than the reference model. The BSS can also be likened to BS as r-squared is likened to mean squared error MSE
.
In the results below, the alternative model improves 27% over the null (reference) one.
[11]:
1 - (b.loc['alt'] / b.loc['null'])
[11]:
0.27169455444933743