# 14. Regression Errors¶

Let’s talk about errors in regression problems. Typically, in regression, we have a variable \(y\) for which we want to learn a model to predict. The prediction from the model is usually denoted as \(\hat{y}\). The error \(e\) is thus defined as follows

\(e = y - \hat{y}\)

Since we have many pairs of the truth, \(y\) and \(\hat{y}\), we want to average over the differences. I will denote this error as the Mean Error `ME`

.

\(\mathrm{ME} = \frac{1}{n} \sum{y - \hat{y}}\)

The problem with ME is that averaging over the differences may result in something close to zero. The reason is because the positive and negative differences will have a cancelling effect. No one really computes the error of a regression model in this way.

A better way is to consider the Mean Absolute Error `MAE`

, where we take the average of the absolute differences.

\(\mathrm{MAE} = \frac{1}{n} \sum |y - \hat{y}|\)

In MAE, since there are only positive differences resulting from \(|y - \hat{y}|\), we avoid the cancelling effect of positive and negative values when averaging. Many times, data scientists want to punish models that predict values further from the truth. In that case, the Root Mean Squared Error `RMSE`

is used.

\(\mathrm{RMSE} = \sqrt{\frac{1}{n} \sum (y - \hat{y})^2}\)

In RMSE, we do not take the difference as in ME or the absolute difference as in MAE, rather, we square the difference. The idea is that when a model’s prediction is off from the truth, we should exaggerate the consequences as it reflects the reality that being further away from the truth is orders of magnitude worse. However, the squaring of the difference results in something that is no longer in the unit of \(y\), as such, we take the square root to bring the scalar value back into unit with \(y\).

For all these measures of performance, the closer the value is to zero, the better.

Let’s look at the following made-up example where a hypothetical model has made some prediction \(\hat{y}\) or `y_pred`

and for each of these prediction, we have the ground truth \(y\) or `y_true`

.

```
[37]:
```

```
import pandas as pd
df = pd.DataFrame({
'y_true': [10, 8, 7, 9, 4],
'y_pred': [11, 7, 6, 15, 1]
})
df = pd.DataFrame({
'y_true': [10, 8, 7, 9, 4],
'y_pred': [11, 7, 5, 11, 1]
})
df
```

```
[37]:
```

y_true | y_pred | |
---|---|---|

0 | 10 | 11 |

1 | 8 | 7 |

2 | 7 | 5 |

3 | 9 | 11 |

4 | 4 | 1 |

We will now compute the error `E`

, absolute error `AE`

and squared errors `SE`

for each pair.

```
[38]:
```

```
import numpy as np
df['E'] = df.y_true - df.y_pred
df['AE'] = np.abs(df.y_true - df.y_pred)
df['SE'] = np.power(df.y_true - df.y_pred, 2.0)
df
```

```
[38]:
```

y_true | y_pred | E | AE | SE | |
---|---|---|---|---|---|

0 | 10 | 11 | -1 | 1 | 1.0 |

1 | 8 | 7 | 1 | 1 | 1.0 |

2 | 7 | 5 | 2 | 2 | 4.0 |

3 | 9 | 11 | -2 | 2 | 4.0 |

4 | 4 | 1 | 3 | 3 | 9.0 |

From E, AE and SE, we can compute the average or mean errors, ME, MAE, RMSE, respectively, as follows.

```
[39]:
```

```
errors = df[['E', 'AE', 'SE']].mean()
errors.se = np.sqrt(errors.SE)
errors.index = ['ME', 'MAE', 'RMSE']
errors
```

```
[39]:
```

```
ME 0.6
MAE 1.8
RMSE 3.8
dtype: float64
```

As you can see, these judgement of errors are saying different things and might lead you to draw contradictory and/or conflicting conclusions. We know ME is defective, and so we will ignore interpreting ME. MAE says we can expect to be `2.4`

off from the truth while RMSE says we can expect to be `9.6`

off from the truth. The values `2.4`

and `9.6`

are very different; while `2.4`

may seem to be tolerably `good`

, on the other hand, `9.6`

seems `bad`

.

One thing we can try to do is to `normalize`

these values. Let’s just look at RMSE. Here are some ways we can normalize RMSE.

using the

`mean`

of y, denoted as \(\bar{y}\)using the

`standard deviation`

of y, denoted as \(\sigma_y\)using the range of y, denoted as \(y_{\mathrm{max}} - y_{\mathrm{min}}\)

using the interquartile range of y, denoted as \(Q_y^1 - Q_y^3\)

The code to compute these is as follows.

\(\bar{y}\) is

`me_y`

\(\sigma_y\) is

`sd_y`

\(y_{\mathrm{max}} - y_{\mathrm{min}}\) is

`ra_y`

\(Q_y^1 - Q_y^3\) is

`iq_y`

Since these are used to divide RMSE, let’s group them under a series as `denominators`

.

```
[40]:
```

```
from scipy.stats import iqr
me_y = df.y_true.mean()
sd_y = df.y_true.std()
ra_y = df.y_true.max() - df.y_true.min()
iq_y = iqr(df.y_true)
denominators = pd.Series([me_y, sd_y, ra_y, iq_y], index=['me_y', 'sd_y', 'ra_y', 'iq_y'])
denominators
```

```
[40]:
```

```
me_y 7.600000
sd_y 2.302173
ra_y 6.000000
iq_y 2.000000
dtype: float64
```

Here’s the results of normalizing RMSE with the mean `me`

, standard deviation `sd`

, range `ra`

and interquartile range `iq`

.

```
[41]:
```

```
pd.DataFrame([{
r'$\mathrm{RMSE}_{\mathrm{me}}$': errors.RMSE / denominators.me_y,
r'$\mathrm{RMSE}_{\mathrm{sd}}$': errors.RMSE / denominators.sd_y,
r'$\mathrm{RMSE}_{\mathrm{ra}}$': errors.RMSE / denominators.ra_y,
r'$\mathrm{RMSE}_{\mathrm{iq}}$': errors.RMSE / denominators.iq_y,
}]).T.rename(columns={0: 'values'})
```

```
[41]:
```

values | |
---|---|

$\mathrm{RMSE}_{\mathrm{me}}$ | 0.500000 |

$\mathrm{RMSE}_{\mathrm{sd}}$ | 1.650615 |

$\mathrm{RMSE}_{\mathrm{ra}}$ | 0.633333 |

$\mathrm{RMSE}_{\mathrm{iq}}$ | 1.900000 |

That we have normalized RMSE, we can make a little bit better interpretation.

\(\mathrm{RMSE}_{\mathrm{me}}\) is saying we can expect to be 126% away from the truth.

\(\mathrm{RMSE}_{\mathrm{sd}}\) is saying we can expect to be over 4.2 standard deviation from the truth.

\(\mathrm{RMSE}_{\mathrm{ra}}\) is saying we can expect to be 1.6

\(\mathrm{RMSE}_{\mathrm{iq}}\)