6. Stochastic Gradient Descent for Online Learning
Let’s take a look at how to use Stochastic Gradient Descent (SGD) for online learning.
6.1. Data
We will simulate data for a regression problem. The data is simulated as follows.
\(X_1 \sim \mathcal{N}(2, 1)\)
\(X_2 \sim \mathcal{N}(8.8, 1)\)
\(Y \sim \mathcal{N}(5.0 + 2 X_1 - 1.5 X_2, 1)\)
Note how we create 3 Xy
samples.
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
np.random.seed(37)
num_samples = 100
def get_Xy():
x1 = 2.0 + np.random.standard_normal(num_samples)
x2 = 8.8 + np.random.standard_normal(num_samples)
y = 5.0 + 2.0 * x1 - 1.5 * x2 + np.random.standard_normal(num_samples)
X = np.column_stack((x1, x2))
return X, y
data = [get_Xy(), get_Xy(), get_Xy()]
6.2. Scikit-Learn, Linear Regression
We can use the following models to apply regression to the data.
Clearly, the approaches produce different intercepts and weights.
[2]:
from sklearn.linear_model import LinearRegression
X, y = data[0]
lr = LinearRegression()
lr.fit(X, y)
print(lr.intercept_, lr.coef_)
4.6219320466614855 [ 2.16262535 -1.50176798]
[3]:
from sklearn.linear_model import Lasso
X, y = data[0]
lr = Lasso()
lr.fit(X, y)
print(lr.intercept_, lr.coef_)
-3.02407792798139 [ 1.11832673 -0.3863843 ]
[4]:
from sklearn.linear_model import Ridge
X, y = data[0]
lr = Ridge()
lr.fit(X, y)
print(lr.intercept_, lr.coef_)
4.516849362852162 [ 2.14079312 -1.48468401]
6.3. Scikit-Learn, SGD Regression
If we needed to do online learning using Scikit-Learn, we can use SGDRegressor. Pay attention to the partial_fit()
method, which executes one epoch
of training on the data; an epoch is just one iteration of weight learning. In this example, we learn the coefficients/weights with 20,000 epochs on the first sampled Xy
data, followed by another 20,000 epochs on the second sampled Xy
data.
In the first round of training, the coefficients are quite off from the true ones.
[5]:
from sklearn.linear_model import SGDRegressor
X, y = data[0]
sgd = SGDRegressor()
for _ in range(20_000):
sgd.partial_fit(X, y)
print(sgd.intercept_[0], sgd.coef_)
4.621363227569006 [ 2.16332485 -1.49836651]
With online learning, the coefficients are now very close to the true values.
[6]:
X, y = data[1]
for _ in range(20_000):
sgd.partial_fit(X, y)
print(sgd.intercept_[0], sgd.coef_)
5.106512059932266 [ 1.97471252 -1.52916083]
6.4. SGD
We can implement SGD by hand. This function step()
will evaluate the gradients of the weights.
[7]:
def step(X, y, b, w, N, alpha=0.005, freeze={}):
def get_w_grad(c, y_diff):
if c in freeze:
return 0.0
else:
return -2 * (X[:,c] * y_diff).mean()
y_pred = b + X.dot(w)
y_diff = y - y_pred
b_grad = -2 * y_diff.mean()
w_grad = np.array([get_w_grad(c, y_diff) for c in range(X.shape[1])])
b_new = b - alpha * b_grad
w_new = w - alpha * w_grad
return b_new, w_new
6.4.1. Gradient descent, batch
With Batch Gradient Descent BGD
, for each epoch, we feed the complete training data to evaluate the gradients of the weights. We specify 30,000 epochs for batch training (BGD). The weights learned are still not close to the true values.
Note that we initialize the weights to all zero.
[8]:
X, y = data[0]
b = 0.0
w = np.zeros(X.shape[1])
alpha = 0.01
N = X.shape[0]
for i in range(30_000):
b_new, w_new = step(X, y, b, w, alpha)
b = b_new
w = w_new
if i % 1000 == 0:
print(f'{i}: b = {b_new}, w = {w_new}')
print(f'final: b = {b}, w = {w}')
0: b = -0.04113658111994803, w = [-0.0634133 -0.37421837]
1000: b = 0.4597592414630508, w = [ 2.22724475 -1.04881832]
2000: b = 0.9316684307305253, w = [ 2.22000329 -1.10019403]
3000: b = 1.3500730784489052, w = [ 2.21349774 -1.14572477]
4000: b = 1.721038717105341, w = [ 2.20772979 -1.18609319]
5000: b = 2.049944014408199, w = [ 2.20261582 -1.22188462]
6000: b = 2.341557801026509, w = [ 2.19808167 -1.25361799]
7000: b = 2.6001082143011445, w = [ 2.19406161 -1.28175341]
8000: b = 2.8293440024292913, w = [ 2.19049735 -1.30669881]
9000: b = 3.032588877938021, w = [ 2.1873372 -1.32881589]
10000: b = 3.2127897085232853, w = [ 2.18453535 -1.34842532]
11000: b = 3.372559243978179, w = [ 2.18205118 -1.36581142]
12000: b = 3.514213998712492, w = [ 2.17984867 -1.38122627]
13000: b = 3.6398078391258446, w = [ 2.17789587 -1.39489337]
14000: b = 3.7511617628212286, w = [ 2.17616449 -1.40701089]
15000: b = 3.849890301430676, w = [ 2.17462941 -1.41775451]
16000: b = 3.9374249298701827, w = [ 2.17326838 -1.42728002]
17000: b = 4.015034821437141, w = [ 2.17206167 -1.43572551]
18000: b = 4.083845249680377, w = [ 2.17099177 -1.44321346]
19000: b = 4.144853903853434, w = [ 2.17004318 -1.44985241]
20000: b = 4.198945354510451, w = [ 2.16920214 -1.45573863]
21000: b = 4.246903878982752, w = [ 2.16845646 -1.46095747]
22000: b = 4.289424832694, w = [ 2.16779532 -1.4655846 ]
23000: b = 4.327124731187709, w = [ 2.16720915 -1.46968709]
24000: b = 4.360550189047555, w = [ 2.16668943 -1.47332445]
25000: b = 4.3901858453166644, w = [ 2.16622865 -1.47654939]
26000: b = 4.416461390327421, w = [ 2.1658201 -1.4794087]
27000: b = 4.439757795824325, w = [ 2.16545788 -1.48194381]
28000: b = 4.460412838711318, w = [ 2.16513672 -1.48419149]
29000: b = 4.478725998512731, w = [ 2.16485198 -1.48618432]
final: b = 4.494947519211368, w = [ 2.16459976 -1.48794954]
6.4.2. Gradient descent, stochastic
For SGD, during each epoch, we feed one random sample at a time to evaluate the gradients of the weights. This time, we specify 10,000 epochs. Generally speaking, SGD is preferred over BGD since the former converges faster.
Note that we initialize the weights to all zero.
[9]:
from random import shuffle
X, y = data[0]
b = 0.0
w = np.zeros(X.shape[1])
alpha = 0.01
N = X.shape[0]
for i in range(10_000):
indices = list(range(X.shape[0]))
shuffle(indices)
for r in indices:
b_new, w_new = step(X[r,:].reshape(1,-1), y[r], b, w, N, alpha=alpha)
b = b_new
w = w_new
if i % 1000 == 0:
print(f'{i}: b = {b_new}, w = {w_new}')
print(f'final: b = {b}, w = {w}')
0: b = 0.1302635564715835, w = [ 2.40138751 -0.24934505]
1000: b = 4.666510646800165, w = [ 2.19419193 -1.18461641]
2000: b = 5.057244621464196, w = [ 2.21689666 -1.40694691]
3000: b = 4.479227120983244, w = [ 2.01991825 -1.24664366]
4000: b = 4.56540770809679, w = [ 2.52343085 -1.30036674]
5000: b = 4.419414341369592, w = [ 1.80293121 -1.4658816 ]
6000: b = 4.440845181302684, w = [ 2.26876065 -1.07881225]
7000: b = 5.027389277961371, w = [ 2.27691608 -1.22757039]
8000: b = 4.637299491755886, w = [ 1.87026482 -1.2841124 ]
9000: b = 4.44058042372063, w = [ 2.28106253 -1.63058405]
final: b = 4.416424432571243, w = [ 2.33365453 -1.48993813]
6.4.3. Gradient descent, stochastic, online
Now, we keep the learned weights from before and seed the algorithm with these weights (we do not initialize or guess the weights as all zero). However, we do change the data set (as if new data are coming through and we need to re-learn the parameters/weights).
[10]:
X, y = data[1]
N = X.shape[0]
for i in range(1):
indices = list(range(X.shape[0]))
shuffle(indices)
for r in indices:
b_new, w_new = step(X[r,:].reshape(1,-1), y[r], b, w, N, alpha=alpha)
b = b_new
w = w_new
if i % 1000 == 0:
print(f'{i}: b = {b_new}, w = {w_new}')
print(f'final: b = {b}, w = {w}')
0: b = 4.411319583679032, w = [ 1.85441694 -1.28729906]
final: b = 4.411319583679032, w = [ 1.85441694 -1.28729906]