7. Autoencoders, Detecting Malicious URLs

Let’s see how we can use autoencoders to detect malicious URLs. In this notebook, we will have a dataset of malicious URLs and fake URLs. The malicious URLs are real and taken from urlhaus. The fake URLs are generated using Faker.

7.1. Malicious URLs

[1]:
import pandas as pd

mdf = pd.read_csv('./cybersecurity/csv.txt', index_col=0)
mdf.shape
[1]:
(294903, 8)
[2]:
mdf.head()
[2]:
dateadded url url_status last_online threat tags urlhaus_link reporter
id
1959127 2022-01-09 04:09:07 http://113.178.137.103:43063/bin.sh online None malware_download 32-bit,elf,mips,Mozi https://urlhaus.abuse.ch/url/1959127/ geenensp
1959126 2022-01-09 04:09:06 http://117.195.92.136:52193/bin.sh online None malware_download 32-bit,elf,mips,Mozi https://urlhaus.abuse.ch/url/1959126/ geenensp
1959125 2022-01-09 04:08:05 http://42.230.86.76:39194/i online None malware_download 32-bit,elf,mips,Mozi https://urlhaus.abuse.ch/url/1959125/ geenensp
1959124 2022-01-09 04:07:06 http://45.206.219.185:39528/Mozi.m online None malware_download elf,Mozi https://urlhaus.abuse.ch/url/1959124/ lrz_urlhaus
1959123 2022-01-09 04:07:05 http://182.117.30.44:43217/bin.sh online None malware_download 32-bit,elf,mips,Mozi https://urlhaus.abuse.ch/url/1959123/ geenensp

7.2. Fake URLs

[3]:
from faker import Faker
import numpy as np
import random

np.random.seed(37)
random.seed(37)
Faker.seed(37)

def get_fake_url(f):
    def get_slash_index(u):
        indices = [i for i, c in enumerate(u) if c == '/']
        return indices[2]

    url = f.uri()

    if np.random.random() > 0.3:
        slash_index = get_slash_index(url)
        first = url[:slash_index]
        second = url[slash_index:]
        port = f.port_number()
        url = f'{first}:{port}{second}'

    return url

fake = Faker()

fdf = pd.DataFrame({'url': [get_fake_url(fake) for _ in range(mdf.shape[0])]})
fdf.shape
[3]:
(294903, 1)
[4]:
fdf.head()
[4]:
url
0 http://www.terrell.info:18926/posts/explore/ex...
1 http://stephens.biz:47993/
2 http://www.butler.com/explore/faq.htm
3 https://moore.com:40837/
4 https://www.nelson-harris.com:34916/main/

7.3. Visualize URL length and character distributions

Let’s observe the distribution of URL lengths and characters in the URLs for both datasets.

[5]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.style.use('fivethirtyeight')

ax = mdf.url.apply(len).value_counts().sort_index()\
    .plot(kind='bar', figsize=(25, 4), title=f'Distribution of Malicious URL Lengths, min={mdf.url.apply(len).min()}, max={mdf.url.apply(len).max()}')
_ = ax.xaxis.set_major_locator(ticker.MultipleLocator(10))
_images/autoencoder-malicious-urls_8_0.png
[6]:
ax = fdf.url.apply(len).value_counts().sort_index()\
    .plot(kind='bar', figsize=(25, 4), title=f'Distribution of Fake URL Lengths, min={fdf.url.apply(len).min()}, max={fdf.url.apply(len).max()}')
_ = ax.xaxis.set_major_locator(ticker.MultipleLocator(10))
_images/autoencoder-malicious-urls_9_0.png
[7]:
import itertools
from collections import Counter

def get_character_distribution(df):
    s = sorted([(k, v) for k, v in Counter(itertools.chain(*df.url.apply(lambda u: [c for c in u]))).items()])
    s = pd.Series([v for _, v in s], index=[k for k, _ in s])
    return s

ax = get_character_distribution(mdf).plot(kind='bar', figsize=(25, 4), title='Distribution of Malicious Letters')
_images/autoencoder-malicious-urls_10_0.png
[8]:
ax = get_character_distribution(fdf).plot(kind='bar', figsize=(15, 4), title='Distribution of Fake Letters')
_images/autoencoder-malicious-urls_11_0.png
[9]:
def get_ord_distribution(df):
    s = sorted([(k, v) for k, v in Counter(itertools.chain(*df.url.apply(lambda u: [ord(c) for c in u]))).items()])
    s = pd.Series([v for _, v in s], index=[k for k, _ in s])
    return s

ax = get_ord_distribution(mdf).plot(kind='bar', figsize=(25, 4), title='Distribution of Malicious Letters (Integer)')
_images/autoencoder-malicious-urls_12_0.png
[10]:
ax = get_ord_distribution(fdf).plot(kind='bar', figsize=(15, 4), title='Distribution of Fake Letters (Integer)')
_images/autoencoder-malicious-urls_13_0.png

7.4. Vectorize URLs

We will vectorize each URL using its ascii code representation. Each vector will be 450 in size with zero-padding where necessary (meaning, no URL will be more than 450 characters).

[11]:
import numpy as np

def vectorize(df, max_length):
    X = np.array(list(df.url.apply(lambda u: [ord(c) for c in u] + [0 for _ in range(len(u), max_length)])), dtype=np.double)
    return X

max_length = max(max(mdf.url.apply(len).max(), fdf.url.apply(len).max()), 450)
print(max_length)

M = vectorize(mdf, max_length)
F = vectorize(fdf, max_length)

M.shape, F.shape
450
[11]:
((294903, 450), (294903, 450))

7.5. Autoencoding

7.5.1. Datasets and data loaders

[12]:
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision.transforms import *

class UrlDataset(Dataset):
    def __init__(self, X, device):
        self.__device = device
        self.__X = X

    def __len__(self):
        return self.__X.shape[0]

    def __getitem__(self, idx):
        item = self.__X[idx,:]

        return item

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

m_dataset = UrlDataset(X=M, device=device)
f_dataset = UrlDataset(X=F, device=device)

m_dataloader = DataLoader(m_dataset, batch_size=64, shuffle=True, num_workers=1)
f_dataloader = DataLoader(f_dataset, batch_size=64, shuffle=True, num_workers=1)
cuda

7.5.2. AE model

[13]:
from torchvision import datasets
from torchvision import transforms

class AE(torch.nn.Module):
    def __init__(self, input_size):
        super().__init__()

        self.encoder = torch.nn.Sequential(
            torch.nn.Linear(input_size, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 36),
            torch.nn.ReLU(),
            torch.nn.Linear(36, 18),
            torch.nn.ReLU(),
            torch.nn.Linear(18, 9)
        )

        self.decoder = torch.nn.Sequential(
            torch.nn.Linear(9, 18),
            torch.nn.ReLU(),
            torch.nn.Linear(18, 36),
            torch.nn.ReLU(),
            torch.nn.Linear(36, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, input_size)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

7.5.3. Learning

[14]:
def learn(data_loader, input_size, device, epochs=20):
    model = AE(input_size=input_size).double().to(device)
    model.train()

    loss_function = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-8)

    loss_df = []

    for epoch in range(epochs):
        losses = []

        for items in data_loader:
            items = items.to(device)
            optimizer.zero_grad()

            reconstructed = model(items)
            loss = loss_function(reconstructed, items)

            loss.backward()

            optimizer.step()

            losses.append(loss.detach().cpu().numpy().item())

        losses = np.array(losses)

        loss_df.append({
            'epoch': epoch + 1,
            'loss': losses.mean()
        })

        print(f'{epoch + 1:03}, {losses.mean():.5f}')

    loss_df = pd.DataFrame(loss_df)
    loss_df.index = loss_df['epoch']
    loss_df = loss_df.drop(columns=['epoch'])

    return model, loss_df
[15]:
m_model, m_loss = learn(m_dataloader, M.shape[1], device)
001, 40.69729
002, 23.34037
003, 21.57728
004, 20.66604
005, 19.53925
006, 18.43908
007, 17.31271
008, 16.88268
009, 16.67882
010, 16.04619
011, 15.73320
012, 15.36572
013, 15.19503
014, 14.95645
015, 14.73917
016, 14.61808
017, 14.53333
018, 14.43027
019, 14.37942
020, 14.32605
[16]:
f_model, f_loss = learn(f_dataloader, F.shape[1], device)
001, 68.04555
002, 48.74619
003, 43.61743
004, 37.87440
005, 31.67343
006, 28.65478
007, 27.01096
008, 25.89886
009, 24.50368
010, 23.09080
011, 22.30850
012, 21.22333
013, 20.54085
014, 20.09835
015, 19.71120
016, 19.35925
017, 18.92354
018, 18.19207
019, 17.65860
020, 17.37112
[17]:
ax = m_loss['loss'].plot(kind='line', figsize=(15, 4), title='Malicious MSE Loss', ylabel='MSE')
_ = ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
_images/autoencoder-malicious-urls_25_0.png
[18]:
ax = m_loss['loss'].plot(kind='line', figsize=(15, 4), title='Fake MSE Loss', ylabel='MSE')
_ = ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
_images/autoencoder-malicious-urls_26_0.png

7.5.4. Evaluation

[28]:
_ = m_model.eval()
_ = f_model.eval()
[108]:
def evaluate(m, f, X, device, urls):
    T = torch.from_numpy(X).to(device)

    m_preds = m(T).detach().cpu().numpy()
    f_preds = f(T).detach().cpu().numpy()

    m_dist = np.linalg.norm(X - m_preds, 2, axis=1)
    f_dist = np.linalg.norm(X - f_preds, 2, axis=1)

    return pd.DataFrame({
        'urls': list(urls),
        'm_dist': m_dist,
        'f_dist': f_dist
    })

m_result = evaluate(m_model, f_model, M, device, mdf.url)
f_result = evaluate(m_model, f_model, F, device, fdf.url)

m_result.shape, f_result.shape
[108]:
((294903, 3), (294903, 3))
[109]:
m_result.head()
[109]:
urls m_dist f_dist
0 http://113.178.137.103:43063/bin.sh 12.796917 147.769682
1 http://117.195.92.136:52193/bin.sh 14.110015 157.243083
2 http://42.230.86.76:39194/i 13.937121 129.116049
3 http://45.206.219.185:39528/Mozi.m 13.059710 163.499153
4 http://182.117.30.44:43217/bin.sh 12.304893 145.153160
[110]:
f_result.head()
[110]:
urls m_dist f_dist
0 http://www.terrell.info:18926/posts/explore/ex... 215.396759 121.751853
1 http://stephens.biz:47993/ 94.481302 51.613355
2 http://www.butler.com/explore/faq.htm 141.791605 102.246879
3 https://moore.com:40837/ 119.615730 26.017324
4 https://www.nelson-harris.com:34916/main/ 151.358979 53.734932

These are the mean distances.

[111]:
m_result[['m_dist', 'f_dist']].mean()
[111]:
m_dist     42.201429
f_dist    202.676389
dtype: float64
[112]:
f_result[['m_dist', 'f_dist']].mean()
[112]:
m_dist    150.676333
f_dist     82.323130
dtype: float64

We can use the mean distance and its standard deviation as thresholds to determine if a observation is an outlier.

[113]:
m_mean = m_result.m_dist.mean()
f_mean = f_result.f_dist.mean()

m_std = m_result.m_dist.std()
f_std = f_result.f_dist.std()

m_mean, m_std, f_mean, f_std
[113]:
(42.2014289456109, 67.19424483605282, 82.3231299935951, 31.48630596249839)

Here’s the performances of the malicious autoencoder.

[116]:
from sklearn.metrics import confusion_matrix, roc_auc_score, average_precision_score, f1_score

p_1 = m_result.m_dist.apply(lambda v: 1 if v <= m_mean + m_std else 0)
y_1 = np.ones(len(p_1))

p_0 = f_result.m_dist.apply(lambda v: 1 if v <= m_mean + m_std else 0)
y_0 = np.zeros(len(p_0))

y_pred = pd.concat([p_1, p_0]).values
y_true = np.concatenate([y_1, y_0])

confusion_matrix(y_true, y_pred)
[116]:
array([[282663,  12240],
       [ 41393, 253510]])
[117]:
roc_auc_score(y_true, y_pred), average_precision_score(y_true, y_pred), f1_score(y_true, y_pred)
[117]:
(0.9090667100707691, 0.8902257823896167, 0.9043383340497598)

Here’s the performances of the fake autoencoder.

[118]:
p_1 = f_result.f_dist.apply(lambda v: 1 if v <= f_mean + f_std else 0)
y_1 = np.ones(len(p_1))

p_0 = m_result.f_dist.apply(lambda v: 1 if v <= f_mean + f_std else 0)
y_0 = np.zeros(len(p_0))

y_pred = pd.concat([p_1, p_0]).values
y_true = np.concatenate([y_1, y_0])

confusion_matrix(y_true, y_pred)
[118]:
array([[289469,   5434],
       [ 52385, 242518]])