# 4. Logistic Regression on Probabilities

Typical uses of logistic regression have the dependent variable’s values encoded as 1 or 0. However, sometimes, the dependent variable may come as probabilities. Scikit-Learn will not work when the dependent variable is not encoded as binary values.

What can we do? We may choose an arbitrary threshold (e.g. 0.5) and set all values greater than or equal to such threshold as 1 and the others as 0. But the threshold is arbitrary. Of course, we can do other types of analyses to see which threshold is the best one, but nevertheless, our learned model (the learned coefficients) will change.

You can actually perform logistic regression on a continuous dependent variable, however, you will have to code up the algorithm to learn the coefficients yourself. You may be wondering, when would I want to perform logistic regression on a continuous output? Take the Bradley-Terry Model, which estimates paired ranking preferences. In these models, the design matrix `X`

is regressed against continuous values to estimate rankings.

## 4.1. Data

This data comes from the A. Agresti book `Introduction to Categorical Analysis`

. The data is composed of head-to-head matches between 5 professional men’s tennis players. For each row, \(n_{ij}\) is the number of wins and \(n_{ji}\) is the number of losses for the player marked by 1 against the player marked by -1. The \(y\) value is derived as
\(y = \dfrac{n_{ij}}{n_{ij} + n_{ji}}\). The design matrix `X`

is just the data without \(n_{ij}\), \(n_{ji}\) and \(y\). The dependent variable \(y\) is continuous and each value is a probability. If we perform logistic regression \(y \sim X\), the coefficients give us the rankings of each player.

```
[1]:
```

```
import autograd.numpy as np
import pandas as pd
from autograd import grad
df = pd.DataFrame([
[1, -1, 0, 0, 0, 9, 6],
[1, 0, -1, 0, 0, 14, 3],
[1, 0, 0, -1, 0, 9, 2],
[1, 0, 0, 0, -1, 4, 3],
[0, 1, -1, 0, 0, 5, 0],
[0, 1, 0, -1, 0, 5, 1],
[0, 1, 0, 0, -1, 7, 2],
[0, 0, 1, -1, 0, 2, 4],
[0, 0, 1, 0, -1, 2, 2],
[0, 0, 0, 1, -1, 4, 3]
], columns=['Djokovic', 'Federer', 'Murray', 'Nadal', 'Wawrinka', 'n_ij', 'n_ji'])
df['y'] = df.n_ij / (df.n_ij + df.n_ji)
df
```

```
[1]:
```

Djokovic | Federer | Murray | Nadal | Wawrinka | n_ij | n_ji | y | |
---|---|---|---|---|---|---|---|---|

0 | 1 | -1 | 0 | 0 | 0 | 9 | 6 | 0.600000 |

1 | 1 | 0 | -1 | 0 | 0 | 14 | 3 | 0.823529 |

2 | 1 | 0 | 0 | -1 | 0 | 9 | 2 | 0.818182 |

3 | 1 | 0 | 0 | 0 | -1 | 4 | 3 | 0.571429 |

4 | 0 | 1 | -1 | 0 | 0 | 5 | 0 | 1.000000 |

5 | 0 | 1 | 0 | -1 | 0 | 5 | 1 | 0.833333 |

6 | 0 | 1 | 0 | 0 | -1 | 7 | 2 | 0.777778 |

7 | 0 | 0 | 1 | -1 | 0 | 2 | 4 | 0.333333 |

8 | 0 | 0 | 1 | 0 | -1 | 2 | 2 | 0.500000 |

9 | 0 | 0 | 0 | 1 | -1 | 4 | 3 | 0.571429 |

```
[2]:
```

```
X = df[[c for c in df.columns if c not in ['n_ij', 'n_ji', 'y']]].values
y = df.y.values
```

## 4.2. Scikit failure

We can see that Scikit-Learn cannot perform logistic regression when \(y\) is continuous.

```
[3]:
```

```
from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
m.fit(X, y)
```

```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-f98d958045d5> in <module>
2
3 m = LogisticRegression()
----> 4 m.fit(X, y)
~/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py in fit(self, X, y, sample_weight)
1345 order="C",
1346 accept_large_sparse=solver != 'liblinear')
-> 1347 check_classification_targets(y)
1348 self.classes_ = np.unique(y)
1349
~/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
181 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
182 'multilabel-indicator', 'multilabel-sequences']:
--> 183 raise ValueError("Unknown label type: %r" % y_type)
184
185
ValueError: Unknown label type: 'continuous'
```

## 4.3. Log loss

We can use gradient descent to learn the model (estimate the coefficients) on our own. The loss function we are trying to optimize is given as follows.

\(\hat{y} = Xw\)

\(\dfrac{\sum -y \hat{y} + \log(1 + \exp(\hat{y}))}{n}\)

```
[4]:
```

```
from autograd.numpy import exp, log, sqrt
def loss(w, X, y):
n = float(len(X))
y_pred = np.dot(X, w)
return np.sum(-(y_pred * y) + log(1.0 + exp(y_pred))) / n
loss_grad = grad(loss)
w = np.array([0.0 for _ in range(X.shape[1])])
alpha=0.05
for i in range(10_000):
loss = loss_grad(w, X, y)
w = w - (loss * alpha)
```

Here are the weights/coefficients learned. Notice there is no intercept estimation in the Bradley-Terry model. If we needed an intercept, we would add a column vector of 1 to `X`

. These weights corresponding to rankings of each player (sort them descendingly to see who is the top; it should be Federer, Djokovic, Nadal, Wawrinka and then Murray).

```
[5]:
```

```
w
```

```
[5]:
```

```
array([ 0.75898769, 0.96401452, -0.94732759, -0.38448982, -0.39118479])
```

Do the following to calculate the probability of Federer beating Djokovic.

```
[6]:
```

```
np.exp(w[1] - w[0]) / (1 + np.exp(w[1] - w[0]))
```

```
[6]:
```

```
0.5510779076999136
```

Do the following the calculate the probability of Djokovic beating Federer.

```
[7]:
```

```
np.exp(w[0] - w[1]) / (1 + np.exp(w[0] - w[1]))
```

```
[7]:
```

```
0.4489220923000864
```

Comparing the true \(y\) values to the predicted ones \(\hat{y}\). Notice that Federer has beaten Murray 5-0 or 100%, but the model estimates 87% chance of win for Federer.

```
[8]:
```

```
import pandas as pd
pd.DataFrame({
'y_true': y,
'y_pred': 1 / (1 + np.exp(-X.dot(w)))
})
```

```
[8]:
```

y_true | y_pred | |
---|---|---|

0 | 0.600000 | 0.448922 |

1 | 0.823529 | 0.846358 |

2 | 0.818182 | 0.758318 |

3 | 0.571429 | 0.759542 |

4 | 1.000000 | 0.871170 |

5 | 0.833333 | 0.793885 |

6 | 0.777778 | 0.794978 |

7 | 0.333333 | 0.362891 |

8 | 0.500000 | 0.364440 |

9 | 0.571429 | 0.501674 |

## 4.4. MSE loss

Here’s another interesting thing to do. Let’s define the loss function using Mean Squared Error (MSE) and learn the model using this loss function.

\(\hat{y} = \dfrac{1}{1 + \exp(Xw)}\)

\(\dfrac{\sum (\hat{y} - y)^2}{n}\)

```
[9]:
```

```
def loss(w, X, y):
n = float(len(X))
y_pred = np.dot(X, w)
y_pred = 1 / (1 + np.exp(-y_pred))
loss = ((y_pred - y) ** 2.0)
return loss.mean(axis=None)
loss_grad = grad(loss)
w = np.array([0.0 for _ in range(X.shape[1])])
alpha=0.05
for i in range(20_000):
loss = loss_grad(w, X, y)
w = w - (loss * alpha)
```

Here are the weights.

```
[10]:
```

```
w
```

```
[10]:
```

```
array([ 0.8115388 , 0.86375836, -0.87645027, -0.35292713, -0.44591976])
```

What’s the chance of Federer beating Djokovic?

```
[11]:
```

```
np.exp(w[1] - w[0]) / (1 + np.exp(w[1] - w[0]))
```

```
[11]:
```

```
0.5130519259636259
```

What’s the chance of Djokovic beating Federer?

```
[12]:
```

```
np.exp(w[0] - w[1]) / (1 + np.exp(w[0] - w[1]))
```

```
[12]:
```

```
0.486948074036374
```

Compare the \(y\) against \(\hat{y}\).

```
[13]:
```

```
pd.DataFrame({
'y_true': y,
'y_pred': 1 / (1 + np.exp(-X.dot(w)))
})
```

```
[13]:
```

y_true | y_pred | |
---|---|---|

0 | 0.600000 | 0.486948 |

1 | 0.823529 | 0.843960 |

2 | 0.818182 | 0.762143 |

3 | 0.571429 | 0.778588 |

4 | 1.000000 | 0.850714 |

5 | 0.833333 | 0.771480 |

6 | 0.777778 | 0.787459 |

7 | 0.333333 | 0.372029 |

8 | 0.500000 | 0.394000 |

9 | 0.571429 | 0.523231 |