3. Chernoff Faces, Classification
In this notebook, we take the synthetic data generated from chernoff-faces.ipynb and apply two common classification techniques (logistic regression and random forest) to classify the data.
3.1. Data
We have to first load up the data. There are 2 data sets that we are acquring, the training T
and validation V
data sets. We will use T
for learning and validate the model against V
(which the model has not seen before).
[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from collections import namedtuple
np.random.seed(37)
tr_df = pd.read_csv('./faces/data-train.csv')
va_df = pd.read_csv('./faces/data-valid.csv')
Data = namedtuple('Data', 'X y')
T = Data(tr_df[[i for i in tr_df.columns if i != 'y']], tr_df['y'])
V = Data(va_df[[i for i in va_df.columns if i != 'y']], va_df['y'])
3.2. Classification performance
[2]:
from sklearn.metrics import accuracy_score, multilabel_confusion_matrix
from sklearn.metrics import roc_auc_score
Metric = namedtuple('Metric', 'clazz tn fp fn tp sen spe acc f1 auc')
def get_classification_metrics(model, T, V):
def get_metrics(clazz, cmatrix):
tn, fp, fn, tp = cmatrix[0][0], cmatrix[0][1], cmatrix[1][0], cmatrix[1][1]
sen = tp / (tp + fn)
spe = tn / (tn + fp)
acc = (tp + tn) / (tp + fp + fn + tn)
f1 = (2.0 * tp) / (2 * tp + fp + fn)
return clazz, tn, fp, fn, tp, sen, spe, acc, f1
model.fit(T.X, T.y)
y_pred = model.predict(V.X)
cmatrices = multilabel_confusion_matrix(V.y, y_pred)
try:
clazzes = sorted(list(T.y.value_counts().index))
except:
clazzes = np.unique(T.y).astype(int)
y_pred = model.predict_proba(V.X)
metrics = []
for clazz in clazzes:
clazz, tn, fp, fn, tp, sen, spe, acc, f1 = get_metrics(clazz, cmatrices[clazz])
y_true = [1 if y == clazz else 0 for y in V.y]
auc = roc_auc_score(y_true, y_pred[:,clazz])
metric = Metric(clazz, tn, fp, fn, tp, sen, spe, acc, f1, auc)
metrics.append(metric)
return metrics
def print_classification_metrics(metrics):
for m in metrics:
print('{}: sen = {:.5f}, spe = {:.5f}, acc = {:.5f}, f1 = {:.5f}, auc = {:.5f}'
.format(m.clazz, m.sen, m.spe, m.acc, m.f1, m.auc))
3.3. Logistic regression
As you can see, logistic regression does not do so well for class 2 and 3 in terms of sensitivity. The problem might be due to data imbalance in the training data.
[3]:
from sklearn.linear_model import LogisticRegression
print_classification_metrics(
get_classification_metrics(
LogisticRegression(random_state=37, multi_class='ovr', solver='newton-cg'), T, V))
0: sen = 1.00000, spe = 0.82667, acc = 0.87000, f1 = 0.79365, auc = 1.00000
1: sen = 1.00000, spe = 0.74667, acc = 0.81000, f1 = 0.72464, auc = 1.00000
2: sen = 0.72000, spe = 1.00000, acc = 0.93000, f1 = 0.83721, auc = 1.00000
3: sen = 0.00000, spe = 1.00000, acc = 0.75000, f1 = 0.00000, auc = 0.66347
3.4. Random forest
Random forest does well across all classification measures, even in the face of data imbalance.
[4]:
from sklearn.ensemble import RandomForestClassifier
print_classification_metrics(
get_classification_metrics(
RandomForestClassifier(n_estimators=100, random_state=37), T, V))
0: sen = 1.00000, spe = 0.98667, acc = 0.99000, f1 = 0.98039, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.88000, spe = 0.93333, acc = 0.92000, f1 = 0.84615, auc = 0.98107
3: sen = 0.76000, spe = 0.96000, acc = 0.91000, f1 = 0.80851, auc = 0.96720
3.5. Sample data to counter data imbalance
Let’s see if we may combat data imbalance by learning the distribution per class, and then sampling new data from each of these new data. Note that we will learn from this new data set, not the original synthetic one. Furthermore, we will generate 10,000 new samples per class to make sure we get enough samples and also to combat data imbalance. We call this new data set S
to distinguish it from T
. Below, we will learn models from S
and validate against V
.
[5]:
from scipy.stats import multivariate_normal
def sample(mvn, N=10000):
X = np.array([multivariate_normal.rvs(mean=mvn.mean, cov=mvn.cov) for _ in range(N)])
y = np.full((N, 1), mvn.clazz, dtype=np.int32)
return np.hstack([X, y])
Mvn = namedtuple('Mvn', 'clazz mean cov')
X_cols = [i for i in tr_df.columns if i != 'y']
mvns = { clazz: Mvn(clazz,
tr_df[tr_df['y'] == clazz][X_cols].mean().values,
tr_df[tr_df['y'] == clazz][X_cols].cov().values)
for clazz in list(sorted(tr_df['y'].value_counts().index)) }
S = np.vstack([sample(mvn) for mvn in mvns.values()])
print(S.shape)
X = S[:, 0:S.shape[1] - 1]
y = S[:, S.shape[1] - 1]
S = Data(X, y)
(40000, 19)
3.6. Logistic regression applied to sampled data
So, sampling does really help logistic regression. Look at the sensitivites for classes 2 and 3 go up (with class imbalance gone)!
[6]:
print_classification_metrics(
get_classification_metrics(
LogisticRegression(random_state=37, multi_class='ovr', solver='newton-cg'), S, V))
0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.88000, spe = 0.97333, acc = 0.95000, f1 = 0.89796, auc = 0.99467
3: sen = 0.92000, spe = 0.96000, acc = 0.95000, f1 = 0.90196, auc = 0.99253
3.7. Random forest applied to sampled data
Random forest is also a beneficiary of sampling as well!
[7]:
print_classification_metrics(
get_classification_metrics(
RandomForestClassifier(n_estimators=100, random_state=37), S, V))
0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.96000, spe = 0.97333, acc = 0.97000, f1 = 0.94118, auc = 0.99787
3: sen = 0.92000, spe = 0.98667, acc = 0.97000, f1 = 0.93878, auc = 0.99733
3.8. Comparing classifications
Now, let’s compare and contrast the classification performances of the Inception V3 network against random forest.
Inception V3 had the following validation results.
0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, mcc = 1.00000
1: sen = 1.00000, spe = 0.97333, acc = 0.98000, f1 = 0.96154, mcc = 0.94933
2: sen = 0.80000, spe = 0.94667, acc = 0.91000, f1 = 0.81633, mcc = 0.75703
3: sen = 0.76000, spe = 0.93333, acc = 0.89000, f1 = 0.77551, mcc = 0.70296
Random forest had the following validation results.
0: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
1: sen = 1.00000, spe = 1.00000, acc = 1.00000, f1 = 1.00000, auc = 1.00000
2: sen = 0.96000, spe = 0.97333, acc = 0.97000, f1 = 0.94118, auc = 0.99787
3: sen = 0.92000, spe = 0.98667, acc = 0.97000, f1 = 0.93878, auc = 0.99733
For classes 0 and 1, both techniques produced nearly identical results; not interesting since those classes were abundant relatively in the training data. However, the performances for classes 2 and 3 are where we see the biggest differences. Clearly, random forest does a whole lot better with the help of sampling. Inception V3 should also benefit from sampling, if there is a way to sample from the images (we pretend we do NOT know that the Chernoff faces came from multivariate, multi-level gaussian distributions).