6. Stochastic Gradient Descent for Online Learning

Let’s take a look at how to use Stochastic Gradient Descent (SGD) for online learning.

6.1. Data

We will simulate data for a regression problem. The data is simulated as follows.

  • \(X_1 \sim \mathcal{N}(2, 1)\)

  • \(X_2 \sim \mathcal{N}(8.8, 1)\)

  • \(Y \sim \mathcal{N}(5.0 + 2 X_1 - 1.5 X_2, 1)\)

Note how we create 3 Xy samples.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx

np.random.seed(37)
num_samples = 100

def get_Xy():
    x1 = 2.0 + np.random.standard_normal(num_samples)
    x2 = 8.8 + np.random.standard_normal(num_samples)

    y = 5.0 + 2.0 * x1 - 1.5 * x2 + np.random.standard_normal(num_samples)

    X = np.column_stack((x1, x2))
    return X, y

data = [get_Xy(), get_Xy(), get_Xy()]

6.2. Scikit-Learn, Linear Regression

We can use the following models to apply regression to the data.

Clearly, the approaches produce different intercepts and weights.

[2]:
from sklearn.linear_model import LinearRegression

X, y = data[0]

lr = LinearRegression()
lr.fit(X, y)
print(lr.intercept_, lr.coef_)
4.6219320466614855 [ 2.16262535 -1.50176798]
[3]:
from sklearn.linear_model import Lasso

X, y = data[0]

lr = Lasso()
lr.fit(X, y)
print(lr.intercept_, lr.coef_)
-3.02407792798139 [ 1.11832673 -0.3863843 ]
[4]:
from sklearn.linear_model import Ridge

X, y = data[0]

lr = Ridge()
lr.fit(X, y)
print(lr.intercept_, lr.coef_)
4.516849362852162 [ 2.14079312 -1.48468401]

6.3. Scikit-Learn, SGD Regression

If we needed to do online learning using Scikit-Learn, we can use SGDRegressor. Pay attention to the partial_fit() method, which executes one epoch of training on the data; an epoch is just one iteration of weight learning. In this example, we learn the coefficients/weights with 20,000 epochs on the first sampled Xy data, followed by another 20,000 epochs on the second sampled Xy data.

In the first round of training, the coefficients are quite off from the true ones.

[5]:
from sklearn.linear_model import SGDRegressor

X, y = data[0]
sgd = SGDRegressor()

for _ in range(20_000):
    sgd.partial_fit(X, y)

print(sgd.intercept_[0], sgd.coef_)
4.621363227569006 [ 2.16332485 -1.49836651]

With online learning, the coefficients are now very close to the true values.

[6]:
X, y = data[1]

for _ in range(20_000):
    sgd.partial_fit(X, y)

print(sgd.intercept_[0], sgd.coef_)
5.106512059932266 [ 1.97471252 -1.52916083]

6.4. SGD

We can implement SGD by hand. This function step() will evaluate the gradients of the weights.

[7]:
def step(X, y, b, w, N, alpha=0.005, freeze={}):
    def get_w_grad(c, y_diff):
        if c in freeze:
            return 0.0
        else:
            return -2 * (X[:,c] * y_diff).mean()
    y_pred = b + X.dot(w)
    y_diff = y - y_pred

    b_grad = -2 * y_diff.mean()
    w_grad = np.array([get_w_grad(c, y_diff) for c in range(X.shape[1])])

    b_new = b - alpha * b_grad
    w_new = w - alpha * w_grad

    return b_new, w_new

6.4.1. Gradient descent, batch

With Batch Gradient Descent BGD, for each epoch, we feed the complete training data to evaluate the gradients of the weights. We specify 30,000 epochs for batch training (BGD). The weights learned are still not close to the true values.

Note that we initialize the weights to all zero.

[8]:
X, y = data[0]

b = 0.0
w = np.zeros(X.shape[1])
alpha = 0.01
N = X.shape[0]

for i in range(30_000):
    b_new, w_new = step(X, y, b, w, alpha)
    b = b_new
    w = w_new
    if i % 1000 == 0:
        print(f'{i}: b = {b_new}, w = {w_new}')

print(f'final: b = {b}, w = {w}')
0: b = -0.04113658111994803, w = [-0.0634133  -0.37421837]
1000: b = 0.4597592414630508, w = [ 2.22724475 -1.04881832]
2000: b = 0.9316684307305253, w = [ 2.22000329 -1.10019403]
3000: b = 1.3500730784489052, w = [ 2.21349774 -1.14572477]
4000: b = 1.721038717105341, w = [ 2.20772979 -1.18609319]
5000: b = 2.049944014408199, w = [ 2.20261582 -1.22188462]
6000: b = 2.341557801026509, w = [ 2.19808167 -1.25361799]
7000: b = 2.6001082143011445, w = [ 2.19406161 -1.28175341]
8000: b = 2.8293440024292913, w = [ 2.19049735 -1.30669881]
9000: b = 3.032588877938021, w = [ 2.1873372  -1.32881589]
10000: b = 3.2127897085232853, w = [ 2.18453535 -1.34842532]
11000: b = 3.372559243978179, w = [ 2.18205118 -1.36581142]
12000: b = 3.514213998712492, w = [ 2.17984867 -1.38122627]
13000: b = 3.6398078391258446, w = [ 2.17789587 -1.39489337]
14000: b = 3.7511617628212286, w = [ 2.17616449 -1.40701089]
15000: b = 3.849890301430676, w = [ 2.17462941 -1.41775451]
16000: b = 3.9374249298701827, w = [ 2.17326838 -1.42728002]
17000: b = 4.015034821437141, w = [ 2.17206167 -1.43572551]
18000: b = 4.083845249680377, w = [ 2.17099177 -1.44321346]
19000: b = 4.144853903853434, w = [ 2.17004318 -1.44985241]
20000: b = 4.198945354510451, w = [ 2.16920214 -1.45573863]
21000: b = 4.246903878982752, w = [ 2.16845646 -1.46095747]
22000: b = 4.289424832694, w = [ 2.16779532 -1.4655846 ]
23000: b = 4.327124731187709, w = [ 2.16720915 -1.46968709]
24000: b = 4.360550189047555, w = [ 2.16668943 -1.47332445]
25000: b = 4.3901858453166644, w = [ 2.16622865 -1.47654939]
26000: b = 4.416461390327421, w = [ 2.1658201 -1.4794087]
27000: b = 4.439757795824325, w = [ 2.16545788 -1.48194381]
28000: b = 4.460412838711318, w = [ 2.16513672 -1.48419149]
29000: b = 4.478725998512731, w = [ 2.16485198 -1.48618432]
final: b = 4.494947519211368, w = [ 2.16459976 -1.48794954]

6.4.2. Gradient descent, stochastic

For SGD, during each epoch, we feed one random sample at a time to evaluate the gradients of the weights. This time, we specify 10,000 epochs. Generally speaking, SGD is preferred over BGD since the former converges faster.

Note that we initialize the weights to all zero.

[9]:
from random import shuffle

X, y = data[0]

b = 0.0
w = np.zeros(X.shape[1])
alpha = 0.01
N = X.shape[0]

for i in range(10_000):
    indices = list(range(X.shape[0]))
    shuffle(indices)

    for r in indices:
        b_new, w_new = step(X[r,:].reshape(1,-1), y[r], b, w, N, alpha=alpha)
        b = b_new
        w = w_new

    if i % 1000 == 0:
        print(f'{i}: b = {b_new}, w = {w_new}')

print(f'final: b = {b}, w = {w}')
0: b = 0.1302635564715835, w = [ 2.40138751 -0.24934505]
1000: b = 4.666510646800165, w = [ 2.19419193 -1.18461641]
2000: b = 5.057244621464196, w = [ 2.21689666 -1.40694691]
3000: b = 4.479227120983244, w = [ 2.01991825 -1.24664366]
4000: b = 4.56540770809679, w = [ 2.52343085 -1.30036674]
5000: b = 4.419414341369592, w = [ 1.80293121 -1.4658816 ]
6000: b = 4.440845181302684, w = [ 2.26876065 -1.07881225]
7000: b = 5.027389277961371, w = [ 2.27691608 -1.22757039]
8000: b = 4.637299491755886, w = [ 1.87026482 -1.2841124 ]
9000: b = 4.44058042372063, w = [ 2.28106253 -1.63058405]
final: b = 4.416424432571243, w = [ 2.33365453 -1.48993813]

6.4.3. Gradient descent, stochastic, online

Now, we keep the learned weights from before and seed the algorithm with these weights (we do not initialize or guess the weights as all zero). However, we do change the data set (as if new data are coming through and we need to re-learn the parameters/weights).

[10]:
X, y = data[1]
N = X.shape[0]

for i in range(1):
    indices = list(range(X.shape[0]))
    shuffle(indices)

    for r in indices:
        b_new, w_new = step(X[r,:].reshape(1,-1), y[r], b, w, N, alpha=alpha)
        b = b_new
        w = w_new

    if i % 1000 == 0:
        print(f'{i}: b = {b_new}, w = {w_new}')

print(f'final: b = {b}, w = {w}')
0: b = 4.411319583679032, w = [ 1.85441694 -1.28729906]
final: b = 4.411319583679032, w = [ 1.85441694 -1.28729906]