5. Outlier Detection with Autoencoders

Autoencoders are a type of deep-learning architectures that are used to learn how to compress and decompress data faithfully. The compression layers are referred to as the encoding layers, and the decompression layers are referred to as the decoding layers. Conceptually, when we are encoding data, we are squeezing the n-dimensional data vector to successively smaller dimensionsional space representation. If a vector is 256 in length (or has 256 dimensions), we might have successive layers to encode this vector into 128, 64, 36, 18 and 9 dimensions. These smaller dimensional space are said to be latent space, or higher-order (although they are lower dimensions) space. The second part of the autoencoder architecture is the decoding layers, which is essentially a reversal of the encoding layers. In the example here, we will have layers that builds the 9-dimensional vector to 18, 36, 64, 128 and 256 dimensions. A lot of the information is lost during encoding, but this information that is lost may be looked at as noise or non-essential. Once we are done with encoding, what remains is the essential information to reconstruct the input vector, and the decoding layers attempts to reconstruct the output to be like the input based on the essential information (or latent representation).

Autoencoders may be useful for a variety of things, including, anomaly or outlier/inlier detection. We may train an autoencoder on inliers and compute the expected error for inliers. When a new observation comes through, we feed this observation into the autoencoder and compute its reconstruction error; if it is different above a threshold from the expected error for inliers, then such observation may be considered an outlier. Let’s see how autoencoders may be used to detect outliers and inliers.

5.1. Data

The data is sampled from \(X \sim \mathcal{N}(0, 1)\).

[1]:

import numpy as np
import pandas as pd
import random as rand

np.random.seed(37)
rand.seed(37)

X = np.random.normal(loc=0, scale=1, size=1_000).reshape(-1, 1)

print(f'X shape = {X.shape}')

X shape = (1000, 1)

5.2. Dataset and Data Loader

We will use PyTorch to build an autoencoder, and as such, will construct a dataset and data loader from the sampled data.

[2]:

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision.transforms import *

class GaussianDataset(Dataset):
    def __init__(self, X, device, clazz=0):
        self.__device = device
        self.__clazz = clazz
        self.__X = X

    def __len__(self):
        return self.__X.shape[0]

    def __getitem__(self, idx):
        item = self.__X[idx,:]

        return item, self.__clazz

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

dataset = GaussianDataset(X=X, device=device)
data_loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=1)

cuda

5.3. Autoencoder architecture

Look at the architecture in this autoencoder; it has 5 encoding layers and 5 decoding layers. All activation layers between these layers are ReLU.

[3]:

from torchvision import datasets
from torchvision import transforms

class AE(torch.nn.Module):
    def __init__(self, input_size):
        super().__init__()

        self.encoder = torch.nn.Sequential(
            torch.nn.Linear(input_size, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 36),
            torch.nn.ReLU(),
            torch.nn.Linear(36, 18),
            torch.nn.ReLU(),
            torch.nn.Linear(18, 9)
        )

        self.decoder = torch.nn.Sequential(
            torch.nn.Linear(9, 18),
            torch.nn.ReLU(),
            torch.nn.Linear(18, 36),
            torch.nn.ReLU(),
            torch.nn.Linear(36, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, input_size)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

5.4. Learning

We will define our loss function to be mean-squared error (MSE) and use the Adam optimizer.

[4]:

model = AE(input_size=X.shape[1]).double().to(device)
loss_function = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-8)

[5]:

epochs = 20
loss_df = []

for epoch in range(epochs):
    losses = []

    for (items, _) in data_loader:
        items = items.to(device)
        optimizer.zero_grad()

        reconstructed = model(items)
        loss = loss_function(reconstructed, items)

        loss.backward()

        optimizer.step()

        losses.append(loss.detach().cpu().numpy().item())

    losses = np.array(losses)

    loss_df.append({
        'epoch': epoch + 1,
        'loss': losses.mean()
    })

loss_df = pd.DataFrame(loss_df)
loss_df.index = loss_df['epoch']
loss_df = loss_df.drop(columns=['epoch'])

The average loss over each epoch will differ base on the batch size (set to 64 earlier).

[6]:

import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

ax = loss_df['loss'].plot(kind='line', figsize=(15, 4), title='MSE Loss', ylabel='MSE')
_ = ax.set_xticks(list(range(1, 21, 1)))

Once we learn the parameters for the autoencoder, we can predict one observation at a time from the dataset.

[7]:

pd.DataFrame([{'y_true': dataset[r][0][0], 'y_pred': model(torch.from_numpy(dataset[r][0]).to(device)).detach().cpu().item()}
              for r in range(10)])

[7]:

	y_true	y_pred
0	-0.054464	-0.053576
1	0.674308	0.672243
2	0.346647	0.347807
3	-1.300346	-1.301217
4	1.518512	1.517482
5	0.989824	0.993620
6	0.277681	0.275364
7	-0.448589	-0.445697
8	0.961966	0.965153
9	-0.827579	-0.826105

Batch prediction from the data loader is also possible.

[8]:

import itertools

res = ((np.ravel(items.numpy()), np.ravel(model(items.to(device)).detach().cpu().numpy()))
       for i, (items, _) in enumerate(data_loader) if i < 10)
res = map(lambda tup: [{'y_true': t, 'y_pred': p} for t, p in zip(tup[0], tup[1])], res)
res = itertools.chain(*res)
res = pd.DataFrame(res)
res

[8]:

	y_true	y_pred
0	-0.508864	-0.507587
1	2.295016	2.298257
2	0.952336	0.955154
3	0.357611	0.359933
4	-0.800504	-0.800884
...	...	...
635	0.526008	0.524745
636	0.677452	0.675304
637	0.321861	0.320346
638	-1.066025	-1.066037
639	-0.301855	-0.301612

640 rows × 2 columns

5.5. Detecting outliers

Now, let’s see if we can detect outliers with the autoencoder. We will sample 4 data sets as follows.

\(A \sim \mathcal{N}(\mu_X, \sigma_X)\)
\(B \sim \mathcal{N}(\mu_X + 1, \sigma_X)\)
\(C \sim \mathcal{N}(\mu_X + 1, \sigma_X + 2)\)
\(D \sim \mathcal{N}(\mu_X + 10, \sigma_X)\)

[9]:

A = np.random.normal(loc=0, scale=1, size=1_000).reshape(-1, 1)
B = np.random.normal(loc=1, scale=1, size=1_000).reshape(-1, 1)
C = np.random.normal(loc=1, scale=2, size=1_000).reshape(-1, 1)
D = np.random.normal(loc=10, scale=1, size=1_000).reshape(-1, 1)

The mean average error from the training data \(X\) may be computed and used as a threshold to consider if a new observation is an outlier or inlier.

[10]:

def predict(m, y_true, device):
    y_pred = m(torch.from_numpy(y_true).to(device)).detach().cpu().item()
    return {'y_true': y_true[0], 'y_pred': y_pred}

pred_df = pd.DataFrame([predict(model, X[r,:], device) for r in range(X.shape[0])])
mae = np.abs(pred_df.y_true - pred_df.y_pred).mean()
print(f'MAE = {mae:.5f}')

MAE = 0.00175

Detecting outliers from A. Note that 0 indicates inlier and 1 indicates outlier.

[11]:

def detect_outlier(m, X, device, mae):
    df = pd.DataFrame([predict(model, X[r,:], device) for r in range(X.shape[0])])
    df['error'] = np.abs(df.y_true - df.y_pred)
    df['outlier'] = df['error'].apply(lambda e: 1 if e > mae else 0)
    return df

detect_outlier(model, A, device, mae).head(n=10).style\
    .applymap(lambda s: 'background: rgba(255, 0, 0, 0.5)' if s == 1 else None)

[11]:

	y_true	y_pred	error	outlier
0	-0.659954	-0.663455	0.003501	1
1	-0.072388	-0.073349	0.000961	0
2	-0.988276	-0.988742	0.000466	0
3	0.510436	0.509313	0.001123	0
4	0.442500	0.444572	0.002073	1
5	-0.238447	-0.239949	0.001502	0
6	-1.113931	-1.113480	0.000451	0
7	-0.500849	-0.498968	0.001881	1
8	0.486258	0.485348	0.000910	0
9	-0.754691	-0.758206	0.003515	1

[12]:

_ = detect_outlier(model, A, device, mae)['outlier']\
    .value_counts()\
    .sort_index()\
    .plot(kind='bar', title=r'$A \sim \mathcal{N}(\mu_X, \sigma_X)$ outliers')

Detecting outliers from B.

[13]:

detect_outlier(model, B, device, mae).head(n=10).style\
    .applymap(lambda s: 'background: rgba(255, 0, 0, 0.5)' if s == 1 else None)

[13]:

	y_true	y_pred	error	outlier
0	1.054522	1.052046	0.002475	1
1	-0.118952	-0.120751	0.001799	1
2	0.781712	0.781686	0.000025	0
3	-0.947611	-0.947822	0.000211	0
4	1.566011	1.565508	0.000504	0
5	1.937544	1.936715	0.000829	0
6	-1.634584	-1.636273	0.001689	0
7	1.241298	1.240053	0.001245	0
8	1.521333	1.520347	0.000986	0
9	3.097915	3.075820	0.022095	1

[14]:

_ = detect_outlier(model, B, device, mae)['outlier']\
    .value_counts()\
    .sort_index()\
    .plot(kind='bar', title=r'$B \sim \mathcal{N}(\mu_X + 1, \sigma_X)$ outliers')

Detecting outliers from C.

[15]:

detect_outlier(model, C, device, mae).head(n=10).style\
    .applymap(lambda s: 'background: rgba(255, 0, 0, 0.5)' if s == 1 else None)

[15]:

	y_true	y_pred	error	outlier
0	-1.326775	-1.327718	0.000943	0
1	0.833414	0.833913	0.000500	0
2	-0.872608	-0.871269	0.001340	0
3	3.983235	3.870642	0.112593	1
4	-1.085848	-1.085467	0.000382	0
5	0.385625	0.389884	0.004259	1
6	2.213190	2.214973	0.001783	1
7	0.457719	0.458371	0.000652	0
8	2.334519	2.337720	0.003201	1
9	-0.450000	-0.447052	0.002949	1

[16]:

_ = detect_outlier(model, C, device, mae)['outlier']\
    .value_counts()\
    .sort_index()\
    .plot(kind='bar', title=r'$C \sim \mathcal{N}(\mu_X + 1, \sigma_X + 2)$ outliers')

Detecting outliers from D.

[17]:

detect_outlier(model, D, device, mae).head(n=10).style\
    .applymap(lambda s: 'background: rgba(255, 0, 0, 0.5)' if s == 1 else None)

[17]:

	y_true	y_pred	error	outlier
0	11.090525	10.215066	0.875459	1
1	9.474894	8.776244	0.698650	1
2	9.906585	9.160721	0.745864	1
3	10.568700	9.750421	0.818279	1
4	8.904352	8.268102	0.636249	1
5	10.612258	9.789214	0.823043	1
6	7.439080	6.959486	0.479594	1
7	10.321630	9.530373	0.791257	1
8	10.054895	9.292811	0.762084	1
9	9.925478	9.177548	0.747930	1

[18]:

_ = detect_outlier(model, D, device, mae)['outlier']\
    .value_counts()\
    .sort_index()\
    .plot(kind='bar', title=r'$D \sim \mathcal{N}(\mu_X + 10, \sigma_X)$ outliers')

As we get get farther from \(X\), going from \(A\) to \(D\), more and more of the samples are considered as outliers.