2. Regression Errors
Let’s talk about errors in regression problems. Typically, in regression, we have a variable \(y\) for which we want to learn a model to predict. The prediction from the model is usually denoted as \(\hat{y}\). The error \(e\) is thus defined as follows
\(e = y - \hat{y}\)
Since we have many pairs of the truth, \(y\) and \(\hat{y}\), we want to average over the differences. I will denote this error as the Mean Error ME
.
\(\mathrm{ME} = \frac{1}{n} \sum{y - \hat{y}}\)
The problem with ME is that averaging over the differences may result in something close to zero. The reason is because the positive and negative differences will have a cancelling effect. No one really computes the error of a regression model in this way.
A better way is to consider the Mean Absolute Error MAE
, where we take the average of the absolute differences.
\(\mathrm{MAE} = \frac{1}{n} \sum |y - \hat{y}|\)
In MAE, since there are only positive differences resulting from \(|y - \hat{y}|\), we avoid the cancelling effect of positive and negative values when averaging. Many times, data scientists want to punish models that predict values further from the truth. In that case, the Root Mean Squared Error RMSE
is used.
\(\mathrm{RMSE} = \sqrt{\frac{1}{n} \sum (y - \hat{y})^2}\)
In RMSE, we do not take the difference as in ME or the absolute difference as in MAE, rather, we square the difference. The idea is that when a model’s prediction is off from the truth, we should exaggerate the consequences as it reflects the reality that being further away from the truth is orders of magnitude worse. However, the squaring of the difference results in something that is no longer in the unit of \(y\), as such, we take the square root to bring the scalar value back into unit with \(y\).
For all these measures of performance, the closer the value is to zero, the better.
Let’s look at the following made-up example where a hypothetical model has made some prediction \(\hat{y}\) or y_pred
and for each of these prediction, we have the ground truth \(y\) or y_true
.
[1]:
import pandas as pd
df = pd.DataFrame({
'y_true': [10, 8, 7, 9, 4],
'y_pred': [11, 7, 6, 15, 1]
})
df = pd.DataFrame({
'y_true': [10, 8, 7, 9, 4],
'y_pred': [11, 7, 5, 11, 1]
})
df
[1]:
y_true | y_pred | |
---|---|---|
0 | 10 | 11 |
1 | 8 | 7 |
2 | 7 | 5 |
3 | 9 | 11 |
4 | 4 | 1 |
We will now compute the error E
, absolute error AE
and squared errors SE
for each pair.
[2]:
import numpy as np
df['E'] = df.y_true - df.y_pred
df['AE'] = np.abs(df.y_true - df.y_pred)
df['SE'] = np.power(df.y_true - df.y_pred, 2.0)
df
[2]:
y_true | y_pred | E | AE | SE | |
---|---|---|---|---|---|
0 | 10 | 11 | -1 | 1 | 1.0 |
1 | 8 | 7 | 1 | 1 | 1.0 |
2 | 7 | 5 | 2 | 2 | 4.0 |
3 | 9 | 11 | -2 | 2 | 4.0 |
4 | 4 | 1 | 3 | 3 | 9.0 |
From E, AE and SE, we can compute the average or mean errors, ME, MAE, RMSE, respectively, as follows.
[3]:
errors = df[['E', 'AE', 'SE']].mean()
errors.se = np.sqrt(errors.SE)
errors.index = ['ME', 'MAE', 'RMSE']
errors
[3]:
ME 0.6
MAE 1.8
RMSE 3.8
dtype: float64
As you can see, these judgement of errors are saying different things and might lead you to draw contradictory and/or conflicting conclusions. We know ME is defective, and so we will ignore interpreting ME. MAE says we can expect to be 2.4
off from the truth while RMSE says we can expect to be 9.6
off from the truth. The values 2.4
and 9.6
are very different; while 2.4
may seem to be tolerably good
, on the other hand, 9.6
seems bad
.
One thing we can try to do is to normalize
these values. Let’s just look at RMSE. Here are some ways we can normalize RMSE.
using the
mean
of y, denoted as \(\bar{y}\)using the
standard deviation
of y, denoted as \(\sigma_y\)using the range of y, denoted as \(y_{\mathrm{max}} - y_{\mathrm{min}}\)
using the interquartile range of y, denoted as \(Q_y^1 - Q_y^3\)
The code to compute these is as follows.
\(\bar{y}\) is
me_y
\(\sigma_y\) is
sd_y
\(y_{\mathrm{max}} - y_{\mathrm{min}}\) is
ra_y
\(Q_y^1 - Q_y^3\) is
iq_y
Since these are used to divide RMSE, let’s group them under a series as denominators
.
[4]:
from scipy.stats import iqr
me_y = df.y_true.mean()
sd_y = df.y_true.std()
ra_y = df.y_true.max() - df.y_true.min()
iq_y = iqr(df.y_true)
denominators = pd.Series([me_y, sd_y, ra_y, iq_y], index=['me_y', 'sd_y', 'ra_y', 'iq_y'])
denominators
[4]:
me_y 7.600000
sd_y 2.302173
ra_y 6.000000
iq_y 2.000000
dtype: float64
Here’s the results of normalizing RMSE with the mean me
, standard deviation sd
, range ra
and interquartile range iq
.
[5]:
pd.DataFrame([{
r'$\mathrm{RMSE}_{\mathrm{me}}$': errors.RMSE / denominators.me_y,
r'$\mathrm{RMSE}_{\mathrm{sd}}$': errors.RMSE / denominators.sd_y,
r'$\mathrm{RMSE}_{\mathrm{ra}}$': errors.RMSE / denominators.ra_y,
r'$\mathrm{RMSE}_{\mathrm{iq}}$': errors.RMSE / denominators.iq_y,
}]).T.rename(columns={0: 'values'})
[5]:
values | |
---|---|
$\mathrm{RMSE}_{\mathrm{me}}$ | 0.500000 |
$\mathrm{RMSE}_{\mathrm{sd}}$ | 1.650615 |
$\mathrm{RMSE}_{\mathrm{ra}}$ | 0.633333 |
$\mathrm{RMSE}_{\mathrm{iq}}$ | 1.900000 |