2. Regression Errors

Let’s talk about errors in regression problems. Typically, in regression, we have a variable $y$ for which we want to learn a model to predict. The prediction from the model is usually denoted as $\hat{y}$. The error $e$ is thus defined as follows

$e = y - \hat{y}$

Since we have many pairs of the truth, $y$ and $\hat{y}$, we want to average over the differences. I will denote this error as the Mean Error ME.

$\mathrm{ME} = \frac{1}{n} \sum{y - \hat{y}}$

The problem with ME is that averaging over the differences may result in something close to zero. The reason is because the positive and negative differences will have a cancelling effect. No one really computes the error of a regression model in this way.

A better way is to consider the Mean Absolute Error MAE, where we take the average of the absolute differences.

$\mathrm{MAE} = \frac{1}{n} \sum |y - \hat{y}|$

In MAE, since there are only positive differences resulting from $|y - \hat{y}|$, we avoid the cancelling effect of positive and negative values when averaging. Many times, data scientists want to punish models that predict values further from the truth. In that case, the Root Mean Squared Error RMSE is used.

$\mathrm{RMSE} = \sqrt{\frac{1}{n} \sum (y - \hat{y})^2}$

In RMSE, we do not take the difference as in ME or the absolute difference as in MAE, rather, we square the difference. The idea is that when a model’s prediction is off from the truth, we should exaggerate the consequences as it reflects the reality that being further away from the truth is orders of magnitude worse. However, the squaring of the difference results in something that is no longer in the unit of $y$, as such, we take the square root to bring the scalar value back into unit with $y$.

For all these measures of performance, the closer the value is to zero, the better.

Let’s look at the following made-up example where a hypothetical model has made some prediction $\hat{y}$ or y_pred and for each of these prediction, we have the ground truth $y$ or y_true.

[1]:

import pandas as pd

df = pd.DataFrame({
    'y_true': [10, 8, 7, 9, 4],
    'y_pred': [11, 7, 6, 15, 1]
})

df = pd.DataFrame({
    'y_true': [10, 8, 7, 9, 4],
    'y_pred': [11, 7, 5, 11, 1]
})

df

[1]:

	y_true	y_pred
0	10	11
1	8	7
2	7	5
3	9	11
4	4	1

We will now compute the error E, absolute error AE and squared errors SE for each pair.

[2]:

import numpy as np

df['E'] = df.y_true - df.y_pred
df['AE'] = np.abs(df.y_true - df.y_pred)
df['SE'] = np.power(df.y_true - df.y_pred, 2.0)

df

[2]:

	y_true	y_pred	E	AE	SE
0	10	11	-1	1	1.0
1	8	7	1	1	1.0
2	7	5	2	2	4.0
3	9	11	-2	2	4.0
4	4	1	3	3	9.0

From E, AE and SE, we can compute the average or mean errors, ME, MAE, RMSE, respectively, as follows.

[3]:

errors = df[['E', 'AE', 'SE']].mean()
errors.se = np.sqrt(errors.SE)

errors.index = ['ME', 'MAE', 'RMSE']
errors

[3]:

ME      0.6
MAE     1.8
RMSE    3.8
dtype: float64

As you can see, these judgement of errors are saying different things and might lead you to draw contradictory and/or conflicting conclusions. We know ME is defective, and so we will ignore interpreting ME. MAE says we can expect to be 2.4 off from the truth while RMSE says we can expect to be 9.6 off from the truth. The values 2.4 and 9.6 are very different; while 2.4 may seem to be tolerably good, on the other hand, 9.6 seems bad.

One thing we can try to do is to normalize these values. Let’s just look at RMSE. Here are some ways we can normalize RMSE.

using the mean of y, denoted as $\bar{y}$
using the standard deviation of y, denoted as $\sigma_y$
using the range of y, denoted as $y_{\mathrm{max}} - y_{\mathrm{min}}$
using the interquartile range of y, denoted as $Q_y^1 - Q_y^3$

The code to compute these is as follows.

$\bar{y}$ is me_y
$\sigma_y$ is sd_y
$y_{\mathrm{max}} - y_{\mathrm{min}}$ is ra_y
$Q_y^1 - Q_y^3$ is iq_y

Since these are used to divide RMSE, let’s group them under a series as denominators.

[4]:

from scipy.stats import iqr

me_y = df.y_true.mean()
sd_y = df.y_true.std()
ra_y = df.y_true.max() - df.y_true.min()
iq_y = iqr(df.y_true)

denominators = pd.Series([me_y, sd_y, ra_y, iq_y], index=['me_y', 'sd_y', 'ra_y', 'iq_y'])
denominators

[4]:

me_y    7.600000
sd_y    2.302173
ra_y    6.000000
iq_y    2.000000
dtype: float64

Here’s the results of normalizing RMSE with the mean me, standard deviation sd, range ra and interquartile range iq.

[5]:

pd.DataFrame([{
    r'$\mathrm{RMSE}_{\mathrm{me}}$': errors.RMSE / denominators.me_y,
    r'$\mathrm{RMSE}_{\mathrm{sd}}$': errors.RMSE / denominators.sd_y,
    r'$\mathrm{RMSE}_{\mathrm{ra}}$': errors.RMSE / denominators.ra_y,
    r'$\mathrm{RMSE}_{\mathrm{iq}}$': errors.RMSE / denominators.iq_y,
}]).T.rename(columns={0: 'values'})

[5]:

	values
$\mathrm{RMSE}_{\mathrm{me}}$	0.500000
$\mathrm{RMSE}_{\mathrm{sd}}$	1.650615
$\mathrm{RMSE}_{\mathrm{ra}}$	0.633333
$\mathrm{RMSE}_{\mathrm{iq}}$	1.900000