7. Logistic Regression on Probabilities

Typical uses of logistic regression have the dependent variable’s values encoded as 1 or 0. However, sometimes, the dependent variable may come as probabilities. Scikit-Learn will not work when the dependent variable is not encoded as binary values.

What can we do? We may choose an arbitrary threshold (e.g. 0.5) and set all values greater than or equal to such threshold as 1 and the others as 0. But the threshold is arbitrary. Of course, we can do other types of analyses to see which threshold is the best one, but nevertheless, our learned model (the learned coefficients) will change.

You can actually perform logistic regression on a continuous dependent variable, however, you will have to code up the algorithm to learn the coefficients yourself. You may be wondering, when would I want to perform logistic regression on a continuous output? Take the Bradley-Terry Model, which estimates paired ranking preferences. In these models, the design matrix X is regressed against continuous values to estimate rankings.

7.1. Data

This data comes from the A. Agresti book Introduction to Categorical Analysis. The data is composed of head-to-head matches between 5 professional men’s tennis players. For each row, \(n_{ij}\) is the number of wins and \(n_{ji}\) is the number of losses for the player marked by 1 against the player marked by -1. The \(y\) value is derived as \(y = \dfrac{n_{ij}}{n_{ij} + n_{ji}}\). The design matrix X is just the data without \(n_{ij}\), \(n_{ji}\) and \(y\). The dependent variable \(y\) is continuous and each value is a probability. If we perform logistic regression \(y \sim X\), the coefficients give us the rankings of each player.

[1]:
import autograd.numpy as np
import pandas as pd
from autograd import grad

df = pd.DataFrame([
    [1, -1, 0, 0, 0, 9, 6],
    [1, 0, -1, 0, 0, 14, 3],
    [1, 0, 0, -1, 0, 9, 2],
    [1, 0, 0, 0, -1, 4, 3],
    [0, 1, -1, 0, 0, 5, 0],
    [0, 1, 0, -1, 0, 5, 1],
    [0, 1, 0, 0, -1, 7, 2],
    [0, 0, 1, -1, 0, 2, 4],
    [0, 0, 1, 0, -1, 2, 2],
    [0, 0, 0, 1, -1, 4, 3]
], columns=['Djokovic', 'Federer', 'Murray', 'Nadal', 'Wawrinka', 'n_ij', 'n_ji'])
df['y'] = df.n_ij / (df.n_ij + df.n_ji)

df
[1]:
Djokovic Federer Murray Nadal Wawrinka n_ij n_ji y
0 1 -1 0 0 0 9 6 0.600000
1 1 0 -1 0 0 14 3 0.823529
2 1 0 0 -1 0 9 2 0.818182
3 1 0 0 0 -1 4 3 0.571429
4 0 1 -1 0 0 5 0 1.000000
5 0 1 0 -1 0 5 1 0.833333
6 0 1 0 0 -1 7 2 0.777778
7 0 0 1 -1 0 2 4 0.333333
8 0 0 1 0 -1 2 2 0.500000
9 0 0 0 1 -1 4 3 0.571429
[2]:
X = df[[c for c in df.columns if c not in ['n_ij', 'n_ji', 'y']]].values
y = df.y.values

7.2. Scikit failure

We can see that Scikit-Learn cannot perform logistic regression when \(y\) is continuous.

[3]:
from sklearn.linear_model import LogisticRegression

m = LogisticRegression()
m.fit(X, y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-f98d958045d5> in <module>
      2
      3 m = LogisticRegression()
----> 4 m.fit(X, y)

~/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py in fit(self, X, y, sample_weight)
   1345                                    order="C",
   1346                                    accept_large_sparse=solver != 'liblinear')
-> 1347         check_classification_targets(y)
   1348         self.classes_ = np.unique(y)
   1349

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
    181     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    182                       'multilabel-indicator', 'multilabel-sequences']:
--> 183         raise ValueError("Unknown label type: %r" % y_type)
    184
    185

ValueError: Unknown label type: 'continuous'

7.3. Log loss

We can use gradient descent to learn the model (estimate the coefficients) on our own. The loss function we are trying to optimize is given as follows.

  • \(\hat{y} = Xw\)

  • \(\dfrac{\sum -y \hat{y} + \log(1 + \exp(\hat{y}))}{n}\)

[4]:
from autograd.numpy import exp, log, sqrt

def loss(w, X, y):
    n = float(len(X))
    y_pred = np.dot(X, w)
    return np.sum(-(y_pred * y) + log(1.0 + exp(y_pred))) / n

loss_grad = grad(loss)
w = np.array([0.0 for _ in range(X.shape[1])])
alpha=0.05

for i in range(10_000):
    loss = loss_grad(w, X, y)
    w = w - (loss * alpha)

Here are the weights/coefficients learned. Notice there is no intercept estimation in the Bradley-Terry model. If we needed an intercept, we would add a column vector of 1 to X. These weights corresponding to rankings of each player (sort them descendingly to see who is the top; it should be Federer, Djokovic, Nadal, Wawrinka and then Murray).

[5]:
w
[5]:
array([ 0.75898769,  0.96401452, -0.94732759, -0.38448982, -0.39118479])

Do the following to calculate the probability of Federer beating Djokovic.

[6]:
np.exp(w[1] - w[0]) / (1 + np.exp(w[1] - w[0]))
[6]:
0.5510779076999136

Do the following the calculate the probability of Djokovic beating Federer.

[7]:
np.exp(w[0] - w[1]) / (1 + np.exp(w[0] - w[1]))
[7]:
0.4489220923000864

Comparing the true \(y\) values to the predicted ones \(\hat{y}\). Notice that Federer has beaten Murray 5-0 or 100%, but the model estimates 87% chance of win for Federer.

[8]:
import pandas as pd

pd.DataFrame({
    'y_true': y,
    'y_pred': 1 / (1 + np.exp(-X.dot(w)))
})
[8]:
y_true y_pred
0 0.600000 0.448922
1 0.823529 0.846358
2 0.818182 0.758318
3 0.571429 0.759542
4 1.000000 0.871170
5 0.833333 0.793885
6 0.777778 0.794978
7 0.333333 0.362891
8 0.500000 0.364440
9 0.571429 0.501674

7.4. MSE loss

Here’s another interesting thing to do. Let’s define the loss function using Mean Squared Error (MSE) and learn the model using this loss function.

  • \(\hat{y} = \dfrac{1}{1 + \exp(Xw)}\)

  • \(\dfrac{\sum (\hat{y} - y)^2}{n}\)

[9]:
def loss(w, X, y):
    n = float(len(X))
    y_pred = np.dot(X, w)
    y_pred = 1 / (1 + np.exp(-y_pred))

    loss = ((y_pred - y) ** 2.0)
    return loss.mean(axis=None)

loss_grad = grad(loss)
w = np.array([0.0 for _ in range(X.shape[1])])
alpha=0.05

for i in range(20_000):
    loss = loss_grad(w, X, y)
    w = w - (loss * alpha)

Here are the weights.

[10]:
w
[10]:
array([ 0.8115388 ,  0.86375836, -0.87645027, -0.35292713, -0.44591976])

What’s the chance of Federer beating Djokovic?

[11]:
np.exp(w[1] - w[0]) / (1 + np.exp(w[1] - w[0]))
[11]:
0.5130519259636259

What’s the chance of Djokovic beating Federer?

[12]:
np.exp(w[0] - w[1]) / (1 + np.exp(w[0] - w[1]))
[12]:
0.486948074036374

Compare the \(y\) against \(\hat{y}\).

[13]:
pd.DataFrame({
    'y_true': y,
    'y_pred': 1 / (1 + np.exp(-X.dot(w)))
})
[13]:
y_true y_pred
0 0.600000 0.486948
1 0.823529 0.843960
2 0.818182 0.762143
3 0.571429 0.778588
4 1.000000 0.850714
5 0.833333 0.771480
6 0.777778 0.787459
7 0.333333 0.372029
8 0.500000 0.394000
9 0.571429 0.523231