2. Regression, Gradient Descent

This notebook shows how to solve regression problems with gradient descent through Autograd. In some machine learning algorithms, a loss function must be defined to gauge at how much learning the algorithm has accomplished. After the loss function is defined, you will then need to express the gradient of such loss function. The shape of the gradient of the loss function is typically concave, and you may use gradient descent to optimize it (e.g. find the minimum of the gradient of the loss function). However, expressing the gradient of the loss function is usually non-trivial to express in code form. With Autograd, you simply define the loss function and this package will give you the gradient (first derivative).

For linear regression, \(y = b + w'x\), the loss function is defined as follows.

\(L(\hat{Y}, Y) = \frac{1}{N} \sum{(y - \hat{y})^2} = \frac{1}{N} \sum{(y - (b + w'x))^2}\)

where

\(\hat{Y}\) is the predicted value,
\(Y\) is the real value,
\(b\) is the y-intercept,
\(w\) is the vector of weights (coefficients), and
\(x\) is the vector of observations.

The gradients of the loss function, \(\nabla L(\hat{Y}, Y)\), is simply the partial derivatives with respect to the variables \(b\) and \(w\).

\(\nabla_b = \frac{\partial L}{\partial b} = \frac{2}{N} \sum{-(y - (b + wx))}\)
\(\nabla_{w_{1}} = \frac{\partial L}{\partial w_1} = \frac{2}{N} \sum{-x_1 (y - (b + w'x))}\)
\(\ldots\)
\(\nabla_{w_{n}} = \frac{\partial L}{\partial w_n} = \frac{2}{N} \sum{-x_n (y - (b + w'x))}\)

With Autograd, you do not have to write out all these gradients, you only write out the loss function. It’s really amazing to see how your code becomes more concise and easy to understand.

2.1. Examples

2.1.1. Imports

Notice that you don’t import numpy but instead autograd.numpy?

[1]:

import autograd.numpy as np
from autograd.numpy.random import normal
import pandas as pd
import matplotlib.pyplot as plt
from autograd import grad

np.random.seed(37)

2.1.2. Define the loss function and its gradient using Autograd

[2]:

def loss(w, X, y_true):
    """
    The loss function is 1/n * (y_predicted - y_true)^2
    """
    y_pred = np.dot(X, w)
    loss = ((y_pred - y_true) ** 2.0)
    return loss.mean(axis=None)

#the magic line that gives you the gradient of the loss function
loss_grad = grad(loss)

# here we simulate 5,000 samples for 2 examples
n = 5000

2.1.3. Example 1

Here, we specify a simple linear model \(y = 5 + 2.0x_0\).

\(X_0 \sim \mathcal{N}(2, 1)\)
\(Y \sim \mathcal{N}(5 + 2.0x_0, 1)\)

[3]:

X = np.hstack([
        np.ones(n).reshape(n, 1),
        normal(2.0, 1.0, n).reshape(n, 1)
    ])
y = normal(5.0 + 2.0 * X[:, 1], 1, n)
w = np.array([np.random.randn() for _ in range(X.shape[1])])

for i in range(1000):
    w = w - loss_grad(w, X, y) * 0.01

print('intercept + weights: {}'.format(w))

intercept + weights: [4.92702295 2.027472  ]

2.1.4. Example 2

Here, we specify a more complicated linear model with \(y = 5 + 2x_0 + 1x_1 + 3x_2 + 0.5x_3 + 1.5x_4\).

Note, we can reuse the same gradient loss function as before for both simple and multiple linear regression problems.

\(X_0 \sim \mathcal{N}(2, 1)\)
\(X_1 \sim \mathcal{N}(1, 1)\)
\(X_2 \sim \mathcal{N}(-1, 1)\)
\(X_3 \sim \mathcal{N}(-2, 1)\)
\(X_4 \sim \mathcal{N}(0.5, 1)\)
\(Y \sim \mathcal{N}(5 + 2x_0 + 1x_1 + 3x_2 + 0.5x_3 + 1.5x_4, 1)\)

[4]:

X = np.hstack([
        np.ones(n).reshape(n, 1),
        normal(2.0, 1.0, n).reshape(n, 1),
        normal(1.0, 1.0, n).reshape(n, 1),
        normal(-1.0, 1.0, n).reshape(n, 1),
        normal(-2.0, 1.0, n).reshape(n, 1),
        normal(0.5, 1.0, n).reshape(n, 1)
    ])
y = normal(5.0 + 2.0 * X[:,1] + 1.0 * X[:,2] + 3.0 * X[:,3] + 0.5 * X[:,4] + 1.5 * X[:, 5], 1.0, n)
w = np.array([np.random.randn() for _ in range(X.shape[1])])

for i in range(1000):
    w = w - loss_grad(w, X, y) * 0.01

print('intercept + weights: {}'.format(w))

intercept + weights: [4.36847588 2.10524944 1.03287861 2.92075188 0.39290605 1.5431474 ]