7. Autoencoders, Detecting Malicious URLs

Let’s see how we can use autoencoders to detect malicious URLs. In this notebook, we will have a dataset of malicious URLs and fake URLs. The malicious URLs are real and taken from urlhaus. The fake URLs are generated using Faker.

7.1. Malicious URLs

[1]:

import pandas as pd

mdf = pd.read_csv('./cybersecurity/csv.txt', index_col=0)
mdf.shape

[1]:

(294903, 8)

[2]:

mdf.head()

[2]:

	dateadded	url	url_status	last_online	threat	tags	urlhaus_link	reporter
id
1959127	2022-01-09 04:09:07	http://113.178.137.103:43063/bin.sh	online	None	malware_download	32-bit,elf,mips,Mozi	https://urlhaus.abuse.ch/url/1959127/	geenensp
1959126	2022-01-09 04:09:06	http://117.195.92.136:52193/bin.sh	online	None	malware_download	32-bit,elf,mips,Mozi	https://urlhaus.abuse.ch/url/1959126/	geenensp
1959125	2022-01-09 04:08:05	http://42.230.86.76:39194/i	online	None	malware_download	32-bit,elf,mips,Mozi	https://urlhaus.abuse.ch/url/1959125/	geenensp
1959124	2022-01-09 04:07:06	http://45.206.219.185:39528/Mozi.m	online	None	malware_download	elf,Mozi	https://urlhaus.abuse.ch/url/1959124/	lrz_urlhaus
1959123	2022-01-09 04:07:05	http://182.117.30.44:43217/bin.sh	online	None	malware_download	32-bit,elf,mips,Mozi	https://urlhaus.abuse.ch/url/1959123/	geenensp

7.2. Fake URLs

[3]:

from faker import Faker
import numpy as np
import random

np.random.seed(37)
random.seed(37)
Faker.seed(37)

def get_fake_url(f):
    def get_slash_index(u):
        indices = [i for i, c in enumerate(u) if c == '/']
        return indices[2]

    url = f.uri()

    if np.random.random() > 0.3:
        slash_index = get_slash_index(url)
        first = url[:slash_index]
        second = url[slash_index:]
        port = f.port_number()
        url = f'{first}:{port}{second}'

    return url

fake = Faker()

fdf = pd.DataFrame({'url': [get_fake_url(fake) for _ in range(mdf.shape[0])]})
fdf.shape

[3]:

(294903, 1)

[4]:

fdf.head()

[4]:

	url
0	http://www.terrell.info:18926/posts/explore/ex...
1	http://stephens.biz:47993/
2	http://www.butler.com/explore/faq.htm
3	https://moore.com:40837/
4	https://www.nelson-harris.com:34916/main/

7.3. Visualize URL length and character distributions

Let’s observe the distribution of URL lengths and characters in the URLs for both datasets.

[5]:

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.style.use('fivethirtyeight')

ax = mdf.url.apply(len).value_counts().sort_index()\
    .plot(kind='bar', figsize=(25, 4), title=f'Distribution of Malicious URL Lengths, min={mdf.url.apply(len).min()}, max={mdf.url.apply(len).max()}')
_ = ax.xaxis.set_major_locator(ticker.MultipleLocator(10))

_images/autoencoder-malicious-urls_8_0.png

[6]:

ax = fdf.url.apply(len).value_counts().sort_index()\
    .plot(kind='bar', figsize=(25, 4), title=f'Distribution of Fake URL Lengths, min={fdf.url.apply(len).min()}, max={fdf.url.apply(len).max()}')
_ = ax.xaxis.set_major_locator(ticker.MultipleLocator(10))

_images/autoencoder-malicious-urls_9_0.png

[7]:

import itertools
from collections import Counter

def get_character_distribution(df):
    s = sorted([(k, v) for k, v in Counter(itertools.chain(*df.url.apply(lambda u: [c for c in u]))).items()])
    s = pd.Series([v for _, v in s], index=[k for k, _ in s])
    return s

ax = get_character_distribution(mdf).plot(kind='bar', figsize=(25, 4), title='Distribution of Malicious Letters')

_images/autoencoder-malicious-urls_10_0.png

[8]:

ax = get_character_distribution(fdf).plot(kind='bar', figsize=(15, 4), title='Distribution of Fake Letters')

_images/autoencoder-malicious-urls_11_0.png

[9]:

def get_ord_distribution(df):
    s = sorted([(k, v) for k, v in Counter(itertools.chain(*df.url.apply(lambda u: [ord(c) for c in u]))).items()])
    s = pd.Series([v for _, v in s], index=[k for k, _ in s])
    return s

ax = get_ord_distribution(mdf).plot(kind='bar', figsize=(25, 4), title='Distribution of Malicious Letters (Integer)')

_images/autoencoder-malicious-urls_12_0.png

[10]:

ax = get_ord_distribution(fdf).plot(kind='bar', figsize=(15, 4), title='Distribution of Fake Letters (Integer)')

_images/autoencoder-malicious-urls_13_0.png

7.4. Vectorize URLs

We will vectorize each URL using its ascii code representation. Each vector will be 450 in size with zero-padding where necessary (meaning, no URL will be more than 450 characters).

[11]:

import numpy as np

def vectorize(df, max_length):
    X = np.array(list(df.url.apply(lambda u: [ord(c) for c in u] + [0 for _ in range(len(u), max_length)])), dtype=np.double)
    return X

max_length = max(max(mdf.url.apply(len).max(), fdf.url.apply(len).max()), 450)
print(max_length)

M = vectorize(mdf, max_length)
F = vectorize(fdf, max_length)

M.shape, F.shape

[11]:

((294903, 450), (294903, 450))

7.5. Autoencoding

7.5.1. Datasets and data loaders

[12]:

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision.transforms import *

class UrlDataset(Dataset):
    def __init__(self, X, device):
        self.__device = device
        self.__X = X

    def __len__(self):
        return self.__X.shape[0]

    def __getitem__(self, idx):
        item = self.__X[idx,:]

        return item

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

m_dataset = UrlDataset(X=M, device=device)
f_dataset = UrlDataset(X=F, device=device)

m_dataloader = DataLoader(m_dataset, batch_size=64, shuffle=True, num_workers=1)
f_dataloader = DataLoader(f_dataset, batch_size=64, shuffle=True, num_workers=1)

cuda

7.5.2. AE model

[13]:

from torchvision import datasets
from torchvision import transforms

class AE(torch.nn.Module):
    def __init__(self, input_size):
        super().__init__()

        self.encoder = torch.nn.Sequential(
            torch.nn.Linear(input_size, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 36),
            torch.nn.ReLU(),
            torch.nn.Linear(36, 18),
            torch.nn.ReLU(),
            torch.nn.Linear(18, 9)
        )

        self.decoder = torch.nn.Sequential(
            torch.nn.Linear(9, 18),
            torch.nn.ReLU(),
            torch.nn.Linear(18, 36),
            torch.nn.ReLU(),
            torch.nn.Linear(36, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, input_size)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

7.5.3. Learning

[14]:

def learn(data_loader, input_size, device, epochs=20):
    model = AE(input_size=input_size).double().to(device)
    model.train()

    loss_function = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-8)

    loss_df = []

    for epoch in range(epochs):
        losses = []

        for items in data_loader:
            items = items.to(device)
            optimizer.zero_grad()

            reconstructed = model(items)
            loss = loss_function(reconstructed, items)

            loss.backward()

            optimizer.step()

            losses.append(loss.detach().cpu().numpy().item())

        losses = np.array(losses)

        loss_df.append({
            'epoch': epoch + 1,
            'loss': losses.mean()
        })

        print(f'{epoch + 1:03}, {losses.mean():.5f}')

    loss_df = pd.DataFrame(loss_df)
    loss_df.index = loss_df['epoch']
    loss_df = loss_df.drop(columns=['epoch'])

    return model, loss_df

[15]:

m_model, m_loss = learn(m_dataloader, M.shape[1], device)

001, 40.69729
002, 23.34037
003, 21.57728
004, 20.66604
005, 19.53925
006, 18.43908
007, 17.31271
008, 16.88268
009, 16.67882
010, 16.04619
011, 15.73320
012, 15.36572
013, 15.19503
014, 14.95645
015, 14.73917
016, 14.61808
017, 14.53333
018, 14.43027
019, 14.37942
020, 14.32605

[16]:

f_model, f_loss = learn(f_dataloader, F.shape[1], device)

001, 68.04555
002, 48.74619
003, 43.61743
004, 37.87440
005, 31.67343
006, 28.65478
007, 27.01096
008, 25.89886
009, 24.50368
010, 23.09080
011, 22.30850
012, 21.22333
013, 20.54085
014, 20.09835
015, 19.71120
016, 19.35925
017, 18.92354
018, 18.19207
019, 17.65860
020, 17.37112

[17]:

ax = m_loss['loss'].plot(kind='line', figsize=(15, 4), title='Malicious MSE Loss', ylabel='MSE')
_ = ax.xaxis.set_major_locator(ticker.MultipleLocator(1))

_images/autoencoder-malicious-urls_25_0.png

[18]:

ax = m_loss['loss'].plot(kind='line', figsize=(15, 4), title='Fake MSE Loss', ylabel='MSE')
_ = ax.xaxis.set_major_locator(ticker.MultipleLocator(1))

_images/autoencoder-malicious-urls_26_0.png

7.5.4. Evaluation

[28]:

_ = m_model.eval()
_ = f_model.eval()

[108]:

def evaluate(m, f, X, device, urls):
    T = torch.from_numpy(X).to(device)

    m_preds = m(T).detach().cpu().numpy()
    f_preds = f(T).detach().cpu().numpy()

    m_dist = np.linalg.norm(X - m_preds, 2, axis=1)
    f_dist = np.linalg.norm(X - f_preds, 2, axis=1)

    return pd.DataFrame({
        'urls': list(urls),
        'm_dist': m_dist,
        'f_dist': f_dist
    })

m_result = evaluate(m_model, f_model, M, device, mdf.url)
f_result = evaluate(m_model, f_model, F, device, fdf.url)

m_result.shape, f_result.shape

[108]:

((294903, 3), (294903, 3))

[109]:

m_result.head()

[109]:

	urls	m_dist	f_dist
0	http://113.178.137.103:43063/bin.sh	12.796917	147.769682
1	http://117.195.92.136:52193/bin.sh	14.110015	157.243083
2	http://42.230.86.76:39194/i	13.937121	129.116049
3	http://45.206.219.185:39528/Mozi.m	13.059710	163.499153
4	http://182.117.30.44:43217/bin.sh	12.304893	145.153160

[110]:

f_result.head()

[110]:

	urls	m_dist	f_dist
0	http://www.terrell.info:18926/posts/explore/ex...	215.396759	121.751853
1	http://stephens.biz:47993/	94.481302	51.613355
2	http://www.butler.com/explore/faq.htm	141.791605	102.246879
3	https://moore.com:40837/	119.615730	26.017324
4	https://www.nelson-harris.com:34916/main/	151.358979	53.734932

These are the mean distances.

[111]:

m_result[['m_dist', 'f_dist']].mean()

[111]:

m_dist     42.201429
f_dist    202.676389
dtype: float64

[112]:

f_result[['m_dist', 'f_dist']].mean()

[112]:

m_dist    150.676333
f_dist     82.323130
dtype: float64

We can use the mean distance and its standard deviation as thresholds to determine if a observation is an outlier.

[113]:

m_mean = m_result.m_dist.mean()
f_mean = f_result.f_dist.mean()

m_std = m_result.m_dist.std()
f_std = f_result.f_dist.std()

m_mean, m_std, f_mean, f_std

[113]:

(42.2014289456109, 67.19424483605282, 82.3231299935951, 31.48630596249839)

Here’s the performances of the malicious autoencoder.

[116]:

from sklearn.metrics import confusion_matrix, roc_auc_score, average_precision_score, f1_score

p_1 = m_result.m_dist.apply(lambda v: 1 if v <= m_mean + m_std else 0)
y_1 = np.ones(len(p_1))

p_0 = f_result.m_dist.apply(lambda v: 1 if v <= m_mean + m_std else 0)
y_0 = np.zeros(len(p_0))

y_pred = pd.concat([p_1, p_0]).values
y_true = np.concatenate([y_1, y_0])

confusion_matrix(y_true, y_pred)

[116]:

array([[282663,  12240],
       [ 41393, 253510]])

[117]:

roc_auc_score(y_true, y_pred), average_precision_score(y_true, y_pred), f1_score(y_true, y_pred)

[117]:

(0.9090667100707691, 0.8902257823896167, 0.9043383340497598)

Here’s the performances of the fake autoencoder.

[118]:

p_1 = f_result.f_dist.apply(lambda v: 1 if v <= f_mean + f_std else 0)
y_1 = np.ones(len(p_1))

p_0 = m_result.f_dist.apply(lambda v: 1 if v <= f_mean + f_std else 0)
y_0 = np.zeros(len(p_0))

y_pred = pd.concat([p_1, p_0]).values
y_true = np.concatenate([y_1, y_0])

confusion_matrix(y_true, y_pred)

[118]:

array([[289469,   5434],
       [ 52385, 242518]])