3. Chernoff Faces, Classification

In this notebook, we take the synthetic data generated from chernoff-faces.ipynb and apply two common classification techniques (logistic regression and random forest) to classify the data.

3.1. Data

We have to first load up the data. There are 2 data sets that we are acquring, the training T and validation V data sets. We will use T for learning and validate the model against V (which the model has not seen before).

[1]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from collections import namedtuple

np.random.seed(37)

tr_df = pd.read_csv('./faces/data-train.csv')
va_df = pd.read_csv('./faces/data-valid.csv')

Data = namedtuple('Data', 'X y')
T = Data(tr_df[[i for i in tr_df.columns if i != 'y']], tr_df['y'])
V = Data(va_df[[i for i in va_df.columns if i != 'y']], va_df['y'])

3.2. Classification performance

[2]:

from sklearn.metrics import accuracy_score, multilabel_confusion_matrix
from sklearn.metrics import roc_auc_score

Metric = namedtuple('Metric', 'clazz tn fp fn tp sen spe acc f1 auc')

def get_classification_metrics(model, T, V):
    def get_metrics(clazz, cmatrix):
        tn, fp, fn, tp = cmatrix[0][0], cmatrix[0][1], cmatrix[1][0], cmatrix[1][1]
        sen = tp / (tp + fn)
        spe = tn / (tn + fp)
        acc = (tp + tn) / (tp + fp + fn + tn)
        f1 = (2.0 * tp) / (2 * tp + fp + fn)
        return clazz, tn, fp, fn, tp, sen, spe, acc, f1

    model.fit(T.X, T.y)
    y_pred = model.predict(V.X)
    cmatrices = multilabel_confusion_matrix(V.y, y_pred)

    try:
        clazzes = sorted(list(T.y.value_counts().index))
    except:
        clazzes = np.unique(T.y).astype(int)

    y_pred = model.predict_proba(V.X)
    metrics = []
    for clazz in clazzes:
        clazz, tn, fp, fn, tp, sen, spe, acc, f1 = get_metrics(clazz, cmatrices[clazz])
        y_true = [1 if y == clazz else 0 for y in V.y]
        auc = roc_auc_score(y_true, y_pred[:,clazz])
        metric = Metric(clazz, tn, fp, fn, tp, sen, spe, acc, f1, auc)
        metrics.append(metric)
    return metrics

def print_classification_metrics(metrics):
    for m in metrics:
        print('{}: sen = {:.5f}, spe = {:.5f}, acc = {:.5f}, f1 = {:.5f}, auc = {:.5f}'
              .format(m.clazz, m.sen, m.spe, m.acc, m.f1, m.auc))

3.3. Logistic regression

As you can see, logistic regression does not do so well for class 2 and 3 in terms of sensitivity. The problem might be due to data imbalance in the training data.

[3]:

from sklearn.linear_model import LogisticRegression

print_classification_metrics(
    get_classification_metrics(
        LogisticRegression(random_state=37, multi_class='ovr', solver='newton-cg'), T, V))

0: sen = 1.00000, spe = 0.82667, acc = 0.87000, f1 = 0.79365, auc = 1.00000
1: sen = 1.00000, spe = 0.74667, acc = 0.81000, f1 = 0.72464, auc = 1.00000
2: sen = 0.72000, spe = 1.00000, acc = 0.93000, f1 = 0.83721, auc = 1.00000
3: sen = 0.00000, spe = 1.00000, acc = 0.75000, f1 = 0.00000, auc = 0.66347

3.4. Random forest

Random forest does well across all classification measures, even in the face of data imbalance.

[4]:

from sklearn.ensemble import RandomForestClassifier

print_classification_metrics(
    get_classification_metrics(
        RandomForestClassifier(n_estimators=100, random_state=37), T, V))

0: sen = 1.00000, spe = 0.98667, acc = 0.99000, f1 = 0.98039, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.88000, spe = 0.93333, acc = 0.92000, f1 = 0.84615, auc = 0.98107
3: sen = 0.76000, spe = 0.96000, acc = 0.91000, f1 = 0.80851, auc = 0.96720

3.5. Sample data to counter data imbalance

Let’s see if we may combat data imbalance by learning the distribution per class, and then sampling new data from each of these new data. Note that we will learn from this new data set, not the original synthetic one. Furthermore, we will generate 10,000 new samples per class to make sure we get enough samples and also to combat data imbalance. We call this new data set S to distinguish it from T. Below, we will learn models from S and validate against V.

[5]:

from scipy.stats import multivariate_normal

def sample(mvn, N=10000):
    X = np.array([multivariate_normal.rvs(mean=mvn.mean, cov=mvn.cov) for _ in range(N)])
    y = np.full((N, 1), mvn.clazz, dtype=np.int32)
    return np.hstack([X, y])

Mvn = namedtuple('Mvn', 'clazz mean cov')

X_cols = [i for i in tr_df.columns if i != 'y']

mvns = { clazz: Mvn(clazz,
                    tr_df[tr_df['y'] == clazz][X_cols].mean().values,
                    tr_df[tr_df['y'] == clazz][X_cols].cov().values)
        for clazz in list(sorted(tr_df['y'].value_counts().index)) }

S = np.vstack([sample(mvn) for mvn in mvns.values()])
print(S.shape)

X = S[:, 0:S.shape[1] - 1]
y = S[:, S.shape[1] - 1]

S = Data(X, y)

(40000, 19)

3.6. Logistic regression applied to sampled data

So, sampling does really help logistic regression. Look at the sensitivites for classes 2 and 3 go up (with class imbalance gone)!

[6]:

print_classification_metrics(
    get_classification_metrics(
        LogisticRegression(random_state=37, multi_class='ovr', solver='newton-cg'), S, V))

0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.88000, spe = 0.97333, acc = 0.95000, f1 = 0.89796, auc = 0.99467
3: sen = 0.92000, spe = 0.96000, acc = 0.95000, f1 = 0.90196, auc = 0.99253

3.7. Random forest applied to sampled data

Random forest is also a beneficiary of sampling as well!

[7]:

print_classification_metrics(
    get_classification_metrics(
        RandomForestClassifier(n_estimators=100, random_state=37), S, V))

0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.96000, spe = 0.97333, acc = 0.97000, f1 = 0.94118, auc = 0.99787
3: sen = 0.92000, spe = 0.98667, acc = 0.97000, f1 = 0.93878, auc = 0.99733

3.8. Comparing classifications

Now, let’s compare and contrast the classification performances of the Inception V3 network against random forest.

Inception V3 had the following validation results.

sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, mcc = 1.00000
sen = 1.00000, spe = 0.97333, acc = 0.98000, f1 = 0.96154, mcc = 0.94933
sen = 0.80000, spe = 0.94667, acc = 0.91000, f1 = 0.81633, mcc = 0.75703
sen = 0.76000, spe = 0.93333, acc = 0.89000, f1 = 0.77551, mcc = 0.70296

Random forest had the following validation results.

sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
sen = 0.96000, spe = 0.97333, acc = 0.97000, f1 = 0.94118, auc = 0.99787
sen = 0.92000, spe = 0.98667, acc = 0.97000, f1 = 0.93878, auc = 0.99733

For classes 0 and 1, both techniques produced nearly identical results; not interesting since those classes were abundant relatively in the training data. However, the performances for classes 2 and 3 are where we see the biggest differences. Clearly, random forest does a whole lot better with the help of sampling. Inception V3 should also benefit from sampling, if there is a way to sample from the images (we pretend we do NOT know that the Chernoff faces came from multivariate, multi-level gaussian distributions).