# 2. Psuedo r-squared for logistic regression¶

In ordinary least square (OLS) regression, the $$R^2$$ statistics measures the amount of variance explained by the regression model. The value of $$R^2$$ ranges in $$[0, 1]$$, with a larger value indicating more variance is explained by the model (higher value is better). For OLS regression, $$R^2$$ is defined as following.

$$R^2 = 1 - \frac{ \sum (y_i - \hat{y}_i)^2 }{ \sum (y_i - \bar{y})^2 }$$

where

• $$y_i$$ is the i-th label (0 or 1)

• $$\hat{y}_i$$ is the i-th predicted value

• $$\bar{y}$$ is the mean of $$y$$

The three main ways to interpret $$R^2$$ is as follows.

• explained variable: how much variability is explained by the model

• goodness-of-fit: how well does the model fit the data

• correlation: the correlations between the predictions and true values

For logistic regression, there have been many proposed pseudo-$$R^2$$. A non-exhaustive list is shown below.

• Efron’s $$R^2$$

• McFadden’s $$R^2$$

• McFadden’s Adjusted $$R^2$$

• Cox & Snell $$R^2$$

• Nagelkerke/Cragg & Uhler’s $$R^2$$

• McKelvey & Zavoina $$R^2$$

• Count $$R^2$$

• Adjusted Count $$R^2$$

In this notebook, we show how to compute some of these pseudo-$$R^2$$. We will not compute pseudo-$$R^2$$ that are based on raw likelihood since these may lead to underflow (Cox & Snell and Nagelkerke/Cragg & Uhler). Note the following.

• these pseudo-$$R^2$$ values lie in $$[0, 1]$$ with values closer to 1 indicating better fit

• DL McFadden stated that a pseudo-$$R^2$$ higher than 0.2 represents an excellent fit

• Additionally, McFadden’s $$R^2$$ can be negative

• these pseudo-$$R^2$$ values may be wildly different from one another

• these pseudo-$$R^2$$ values cannot be interpreted like OLS $$R^2$$

## 2.1. Simulate data¶

Here, we simulate data for logistic regression analysis. The functional form looks like the following.

$$\log \frac{p}{1-p} = 1.0 + 2.0 * x_1 + 3.0 * x_2$$

[1]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy.random import binomial, normal
from scipy.stats import bernoulli, binom

np.random.seed(37)
sns.set(color_codes=True)

n = 10000
X = np.hstack([
np.array([1 for _ in range(n)]).reshape(n, 1),
normal(0.0, 1.0, n).reshape(n, 1),
normal(0.0, 1.0, n).reshape(n, 1)
])
z = np.dot(X, np.array([1.0, 2.0, 3.0])) + normal(0.0, 1.0, n)
p = 1.0 / (1.0 + np.exp(-z))
y = binom.rvs(1, p)


## 2.2. Learn the model parameters¶

Now, we use Sckit-Learn’s logistic regression solver to learn the model parameters (weights/coefficients). Note that we introduced some error during the simulation and so the coefficients of the learned model are not recovered to be exactly like the generating model.

[2]:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(fit_intercept=False, solver='lbfgs')
lr.fit(X, y)

w = np.array(lr.coef_).transpose()
y_pred = lr.predict_proba(X)[:, 1]

print(lr.coef_)

[[0.89312519 1.71445566 2.59091944]]


## 2.3. Performance¶

### 2.3.1. Efron’s $$R^2$$¶

$$R^2 = 1 - \frac{\sum (y_i - \pi_i)^2}{\sum (y_i - \bar{y})^2}$$

• $$y_i$$ is the i-th outcome label (e.g. 1 or 0)

• $$\pi_i$$ is the i-th predicted outcome probability

• $$\bar{y}$$ is the expected value of the observed outcomes $$y = [y_1, \ldots, y_n]$$ e.g. $$\bar{y} = \frac{\sum y_i}{n}$$

[3]:

def efron_rsquare(y, y_pred):
n = float(len(y))
t1 = np.sum(np.power(y - y_pred, 2.0))
t2 = np.sum(np.power((y - (np.sum(y) / n)), 2.0))
return 1.0 - (t1 / t2)

[4]:

efron_rsquare(y, y_pred)

[4]:

0.5513983981594343


### 2.3.2. McFadden’s $$R^2$$¶

$$R^2 = 1 - \frac{\ln \hat{L}_{full}}{\ln \hat{L}_{null}}$$

• $$\hat{L}_{full}$$ is the estimated likelihood of the full model

• $$\hat{L}_{null}$$ is the estimated likelihood of the null model (model with only intercept)

[5]:

def full_log_likelihood(w, X, y):
score = np.dot(X, w).reshape(1, X.shape[0])
return np.sum(-np.log(1 + np.exp(score))) + np.sum(y * score)

def null_log_likelihood(w, X, y):
z = np.array([w if i == 0 else 0.0 for i, w in enumerate(w.reshape(1, X.shape[1])[0])]).reshape(X.shape[1], 1)
score = np.dot(X, z).reshape(1, X.shape[0])
return np.sum(-np.log(1 + np.exp(score))) + np.sum(y * score)

return 1.0 - (full_log_likelihood(w, X, y) / null_log_likelihood(w, X, y))

k = float(X.shape[1])
return 1.0 - ((full_log_likelihood(w, X, y) - k) / null_log_likelihood(w, X, y))

[6]:

mcfadden_rsquare(w, X, y)

[6]:

0.5173841067369636


### 2.3.3. McFadden’s Adjusted $$R^2$$¶

$$R^2 = 1 - \frac{\ln \hat{L}_{full} - K}{\ln \hat{L}_{null}}$$

• $$\hat{L}_{full}$$ is the estimated likelihood of the full model

• $$\hat{L}_{null}$$ is the estimated likelihood of the null model (model with only intercept)

• $$K$$ is the number of parameters (e.g. number of covariates associated with non-zero coefficients)

[7]:

mcfadden_adjusted_rsquare(w, X, y)

[7]:

0.5169598888329648


### 2.3.4. McKelvey & Zavoina $$R^2$$¶

$$R^2 = \frac{ \sigma^2(\hat{y}) }{ \sigma^2(\hat{y}) + \frac{\pi^2}{3} }$$

• $$\sigma^2(\hat{y})$$ is the variance of the predicted probabilities

[8]:

def mz_rsquare(y_pred):
return np.var(y_pred) / (np.var(y_pred) + (np.power(np.pi, 2.0) / 3.0) )

[9]:

mz_rsquare(y_pred)

[9]:

0.03882393736712112


### 2.3.5. Count $$R^2$$¶

$$R^2=\frac{C}{T}$$

• $$C$$ is the total number of correctly classified observations with treating a probability below 0.5 as a 0 and above as a 1

• $$T$$ is the total number of observations

[10]:

def get_num_correct(y, y_pred, t=0.5):
y_correct = np.array([0.0 if p < t else 1.0 for p in y_pred])
return sum([1.0 for p, p_pred in zip(y, y_correct) if p == p_pred])

def count_rsquare(y, y_pred, t=0.5):
n = float(len(y))
num_correct = get_num_correct(y, y_pred, t)
return num_correct / n

[11]:

count_rsquare(y, y_pred)

[11]:

0.8469


### 2.3.6. Adjust count $$R^2$$¶

$$R^2=\frac{C - n}{T - n}$$

• $$C$$ is the total number of correctly classified observations with treating a probability below 0.5 as a 0 and above as a 1

• $$T$$ is the total number of observations

• $$n$$ is the count of the most frequent outcome

[12]:

def get_count_most_freq_outcome(y):
num_0 = 0
num_1 = 0
for p in y:
if p == 1.0:
num_1 += 1
else:
num_0 += 1
return float(max(num_0, num_1))

correct = get_num_correct(y, y_pred, t)
total = float(len(y))
n = get_count_most_freq_outcome(y)
return (correct - n) / (total - n)

[13]:

count_adjusted_rsquare(y, y_pred)

[13]:

0.6243866535819431


### 2.3.7. Comparison with other model performance metrics¶

Here, we show other performance metrics to see how these might compare to pseudo-$$R^2$$.

[14]:

from sklearn.metrics import average_precision_score, brier_score_loss, \
log_loss, roc_auc_score
from sklearn.metrics import explained_variance_score, mean_absolute_error, \
mean_squared_error, median_absolute_error, r2_score

m_names = [
'average_precision_score', 'brier_score_loss',
'log_loss', 'roc_auc_score',
'explained_variance_score', 'mean_absolute_error',
'mean_squared_error', 'median_absolute_error', 'r2_score'
]
metrics = [average_precision_score, brier_score_loss, log_loss, roc_auc_score,
explained_variance_score, mean_absolute_error, mean_squared_error, median_absolute_error, r2_score]
for n, m in zip(m_names, metrics):
print('{:.5f} : {}'.format(m(y, y_pred), n))


0.94774 : average_precision_score
0.10832 : brier_score_loss
0.34130 : log_loss
0.92479 : roc_auc_score
0.55140 : explained_variance_score
0.21691 : mean_absolute_error
0.10832 : mean_squared_error
0.11089 : median_absolute_error
0.55140 : r2_score