3. Chernoff Faces, Classification

In this notebook, we take the synthetic data generated from chernoff-faces.ipynb and apply two common classification techniques (logistic regression and random forest) to classify the data.

3.1. Data

We have to first load up the data. There are 2 data sets that we are acquring, the training T and validation V data sets. We will use T for learning and validate the model against V (which the model has not seen before).

[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from collections import namedtuple

np.random.seed(37)

tr_df = pd.read_csv('./faces/data-train.csv')
va_df = pd.read_csv('./faces/data-valid.csv')

Data = namedtuple('Data', 'X y')
T = Data(tr_df[[i for i in tr_df.columns if i != 'y']], tr_df['y'])
V = Data(va_df[[i for i in va_df.columns if i != 'y']], va_df['y'])

3.2. Classification performance

[2]:
from sklearn.metrics import accuracy_score, multilabel_confusion_matrix
from sklearn.metrics import roc_auc_score

Metric = namedtuple('Metric', 'clazz tn fp fn tp sen spe acc f1 auc')

def get_classification_metrics(model, T, V):
    def get_metrics(clazz, cmatrix):
        tn, fp, fn, tp = cmatrix[0][0], cmatrix[0][1], cmatrix[1][0], cmatrix[1][1]
        sen = tp / (tp + fn)
        spe = tn / (tn + fp)
        acc = (tp + tn) / (tp + fp + fn + tn)
        f1 = (2.0 * tp) / (2 * tp + fp + fn)
        return clazz, tn, fp, fn, tp, sen, spe, acc, f1

    model.fit(T.X, T.y)
    y_pred = model.predict(V.X)
    cmatrices = multilabel_confusion_matrix(V.y, y_pred)

    try:
        clazzes = sorted(list(T.y.value_counts().index))
    except:
        clazzes = np.unique(T.y).astype(int)

    y_pred = model.predict_proba(V.X)
    metrics = []
    for clazz in clazzes:
        clazz, tn, fp, fn, tp, sen, spe, acc, f1 = get_metrics(clazz, cmatrices[clazz])
        y_true = [1 if y == clazz else 0 for y in V.y]
        auc = roc_auc_score(y_true, y_pred[:,clazz])
        metric = Metric(clazz, tn, fp, fn, tp, sen, spe, acc, f1, auc)
        metrics.append(metric)
    return metrics

def print_classification_metrics(metrics):
    for m in metrics:
        print('{}: sen = {:.5f}, spe = {:.5f}, acc = {:.5f}, f1 = {:.5f}, auc = {:.5f}'
              .format(m.clazz, m.sen, m.spe, m.acc, m.f1, m.auc))

3.3. Logistic regression

As you can see, logistic regression does not do so well for class 2 and 3 in terms of sensitivity. The problem might be due to data imbalance in the training data.

[3]:
from sklearn.linear_model import LogisticRegression

print_classification_metrics(
    get_classification_metrics(
        LogisticRegression(random_state=37, multi_class='ovr', solver='newton-cg'), T, V))
0: sen = 1.00000, spe = 0.82667, acc = 0.87000, f1 = 0.79365, auc = 1.00000
1: sen = 1.00000, spe = 0.74667, acc = 0.81000, f1 = 0.72464, auc = 1.00000
2: sen = 0.72000, spe = 1.00000, acc = 0.93000, f1 = 0.83721, auc = 1.00000
3: sen = 0.00000, spe = 1.00000, acc = 0.75000, f1 = 0.00000, auc = 0.66347

3.4. Random forest

Random forest does well across all classification measures, even in the face of data imbalance.

[4]:
from sklearn.ensemble import RandomForestClassifier

print_classification_metrics(
    get_classification_metrics(
        RandomForestClassifier(n_estimators=100, random_state=37), T, V))
0: sen = 1.00000, spe = 0.98667, acc = 0.99000, f1 = 0.98039, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.88000, spe = 0.93333, acc = 0.92000, f1 = 0.84615, auc = 0.98107
3: sen = 0.76000, spe = 0.96000, acc = 0.91000, f1 = 0.80851, auc = 0.96720

3.5. Sample data to counter data imbalance

Let’s see if we may combat data imbalance by learning the distribution per class, and then sampling new data from each of these new data. Note that we will learn from this new data set, not the original synthetic one. Furthermore, we will generate 10,000 new samples per class to make sure we get enough samples and also to combat data imbalance. We call this new data set S to distinguish it from T. Below, we will learn models from S and validate against V.

[5]:
from scipy.stats import multivariate_normal

def sample(mvn, N=10000):
    X = np.array([multivariate_normal.rvs(mean=mvn.mean, cov=mvn.cov) for _ in range(N)])
    y = np.full((N, 1), mvn.clazz, dtype=np.int32)
    return np.hstack([X, y])

Mvn = namedtuple('Mvn', 'clazz mean cov')

X_cols = [i for i in tr_df.columns if i != 'y']

mvns = { clazz: Mvn(clazz,
                    tr_df[tr_df['y'] == clazz][X_cols].mean().values,
                    tr_df[tr_df['y'] == clazz][X_cols].cov().values)
        for clazz in list(sorted(tr_df['y'].value_counts().index)) }

S = np.vstack([sample(mvn) for mvn in mvns.values()])
print(S.shape)

X = S[:, 0:S.shape[1] - 1]
y = S[:, S.shape[1] - 1]

S = Data(X, y)
(40000, 19)

3.6. Logistic regression applied to sampled data

So, sampling does really help logistic regression. Look at the sensitivites for classes 2 and 3 go up (with class imbalance gone)!

[6]:
print_classification_metrics(
    get_classification_metrics(
        LogisticRegression(random_state=37, multi_class='ovr', solver='newton-cg'), S, V))
0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.88000, spe = 0.97333, acc = 0.95000, f1 = 0.89796, auc = 0.99467
3: sen = 0.92000, spe = 0.96000, acc = 0.95000, f1 = 0.90196, auc = 0.99253

3.7. Random forest applied to sampled data

Random forest is also a beneficiary of sampling as well!

[7]:
print_classification_metrics(
    get_classification_metrics(
        RandomForestClassifier(n_estimators=100, random_state=37), S, V))
0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.96000, spe = 0.97333, acc = 0.97000, f1 = 0.94118, auc = 0.99787
3: sen = 0.92000, spe = 0.98667, acc = 0.97000, f1 = 0.93878, auc = 0.99733

3.8. Comparing classifications

Now, let’s compare and contrast the classification performances of the Inception V3 network against random forest.

Inception V3 had the following validation results.

0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, mcc = 1.00000
1: sen = 1.00000, spe = 0.97333, acc = 0.98000, f1 = 0.96154, mcc = 0.94933
2: sen = 0.80000, spe = 0.94667, acc = 0.91000, f1 = 0.81633, mcc = 0.75703
3: sen = 0.76000, spe = 0.93333, acc = 0.89000, f1 = 0.77551, mcc = 0.70296

Random forest had the following validation results.

0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.96000, spe = 0.97333, acc = 0.97000, f1 = 0.94118, auc = 0.99787
3: sen = 0.92000, spe = 0.98667, acc = 0.97000, f1 = 0.93878, auc = 0.99733

For classes 0 and 1, both techniques produced nearly identical results; not interesting since those classes were abundant relatively in the training data. However, the performances for classes 2 and 3 are where we see the biggest differences. Clearly, random forest does a whole lot better with the help of sampling. Inception V3 should also benefit from sampling, if there is a way to sample from the images (we pretend we do NOT know that the Chernoff faces came from multivariate, multi-level gaussian distributions).