5. Gaussian Mixture Models

This is a notebook showing how to use Scikit’s Gaussian Mixture Model (GMM).

5.1. Motivation, A Simple Example

Oftentimes, you see observations and you want to understand the distributions from which they came. Let’s make it simple and say you observe the lengths of left arms. Now, these lengths may come from a child (age < 18) or an adult (age >= 18); however, you do not know that there are indeed two populations (children and adults). If you perform a density plot the lengths, you will most likely observe a multi-modal distribution (one for the children and the other for the adults). If you simply assume a single gaussian distribution, you would be wrong. GMMs assume that the distribution you are seeing actually comes from multiple underlying guassian distributions that are “mixing” to produce what you see.

5.2. The Data, Arm Lengths

Let’s start out by looking at the data. First, these are real data from children and adults at One-Off Coder. Second, although we know that the data comes from two different population, we are going to pretend as if we only observe the arm lengths.

[1]:

import numpy as np
import pandas as pd

# these are the arm lengths
# a is for adult
# c is child
a_lengths = np.array([24, 23.5, 25.5, 22.0, 25.5, 25.5, 24.0, 28.0, 24.5])
c_lengths = np.array([24.0, 23.0, 21.0, 15.5, 17.5, 17.5, 19.0, 17.0, 17.0, 27.0])

# these are the corresponding labels
# children are marked as 0
# adults are marked as 1
a_labels = np.ones(a_lengths.shape[0], dtype=int)
c_labels = np.zeros(c_lengths.shape[0], dtype=int)

# these are the natural corresponding labels
a_nlabels = ['adult' for _ in range(a_lengths.shape[0])]
c_nlabels = ['child' for _ in range(c_lengths.shape[0])]

# we build a pandas dataframe from the data
lengths = np.concatenate([c_lengths, a_lengths])
labels = np.concatenate([c_labels, a_labels])
nlabels = np.concatenate([c_nlabels, a_nlabels])
df = pd.DataFrame({ 'length': lengths, 'label': labels, 'nlabel': nlabels})

[2]:

df

[2]:

	length	label	nlabel
0	24.0	0	child
1	23.0	0	child
2	21.0	0	child
3	15.5	0	child
4	17.5	0	child
5	17.5	0	child
6	19.0	0	child
7	17.0	0	child
8	17.0	0	child
9	27.0	0	child
10	24.0	1	adult
11	23.5	1	adult
12	25.5	1	adult
13	22.0	1	adult
14	25.5	1	adult
15	25.5	1	adult
16	24.0	1	adult
17	28.0	1	adult
18	24.5	1	adult

5.3. Visualize the Data

Now, we plot the distributions. There is a one for children and adults, and another one for all of the children and adults together. As you can see, when all the data is plotted together, there are two “humps”. This data (with all the children and adult arm lengths lumped together) is what we pretend we are seeing.

5.3.1. Distribution Plots

[3]:

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('ggplot')

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

sns.distplot(df[df['label'] == 0]['length'], label='children', hist=False, ax=ax1)
sns.distplot(df[df['label'] == 1]['length'], label='adults', hist=False, ax=ax1)
sns.distplot(df['length'], label='all', hist=False, ax=ax2)
ax1.set_title('Density Plots of Children and Adults\' Arm Lengths')
ax2.set_title('Density Plot of All Arm Lengths')

plt.tight_layout()

5.3.2. Box and Swarm Plots

Here are some box and swarm plots to see if we can visually detect outliers. As you can see, there is tremendous variance in the children population while the adults are very tight.

[4]:

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.boxplot(x='nlabel', y='length', data=df, ax=ax[0])
sns.swarmplot(x='nlabel', y='length', data=df, ax=ax[1])
ax[0].set_title('Box Plot')
ax[1].set_title('Swarm Plot')
plt.tight_layout()

5.4. Data Science

In GMMs, we want to understand and recover the underlying, “mixing” or “hidden” distributions. Since we do not directly observe these distributions and only hypothesize that they exists, these distributions are often referred to as “hidden” or “latent”. Again, we do not even know how many hidden gaussian distributions there are! When we apply GMMs to a dataset to understand and recover its underlying distributions, we typically have to “guess” the number of such distributions before hand. Typically, we would guess that there are 2, 3, 4, and so on, underlying distributions. At each step, we might apply a “goodness of fit” test to see how well the models (with 2, 3, 4, … assumed hidden gaussians) fit the data and pick the best fitting one.

5.4.1. k-means clustering

In this notebook, we avoid completely “guessing” the number of hidden gaussians by using k-means clustering to identify how many hidden gaussians might actually be in the data. We start out by assuming 2, 3, …, 10 clusters in the data; in the example, this assumption means that the lengths we are seeing comes from 2, 3, …, 10 different groups. We use the silhouette score to determine which k is the best and use that k to create and learn a GMM.

Let’s use k-means to see how many clusters are in the population. Remember, pretend like you do not know how many clusters (or populations) are in the cluster. We will set k to 2, 3, …, 10 and see which number of clusters (according to the silhouette score) is the best. As you can see below, k=2 is the best. Additionally, we plot the score of the GMM model corresponding to each k. A higher silhouette score indicates a better fit and a lower GMM score indicates a better fit.

[5]:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.mixture import GaussianMixture

def get_silhouette_score(X, k):
    model = KMeans(k, random_state=37)
    model.fit(X)
    labels = model.predict(X)
    score = silhouette_score(X, labels)
    return score

def get_gmm_score(X, k):
    gmm = GaussianMixture(n_components=k, max_iter=50)
    gmm.fit(X)
    gmm_scores = gmm.score_samples(X)
    score = np.exp(gmm_scores).sum()
    return score

def get_scores(X, k):
    return k, get_silhouette_score(X, k), get_gmm_score(X, k)

X = df['length'].values.reshape(-1, 1)

score_df = pd.DataFrame([get_scores(X, k) for k in range(2, 11, 1)], columns=['k', 'sil', 'gmm'])

[6]:

fig, ax1 = plt.subplots(figsize=(10, 5))

k_min = np.min(score_df['k'])
k_max = np.max(score_df['k'])

line1 = ax1.plot(score_df['k'], score_df['sil'], color='tab:blue', label='Silhouette')
ax2 = ax1.twinx()
line2 = ax2.plot(score_df['k'], score_df['gmm'], color='tab:red', label='GMM')

lines = line1 + line2
labels = [l.get_label() for l in lines]

ax1.set_title('Silhouette Score and GMM Score vs Number of Clusters, k')
ax1.legend(lines, labels, loc=9)
ax1.set_xlabel('k')
ax1.set_ylabel('Silhouette Score')
ax1.set_xlim([k_min, k_max])
ax2.set_ylabel('GMM Score')

fig.tight_layout()

5.4.2. GMM

So we have determined that k=2 produces the best clustering of the data. Thus, we will assume that there are 2 mixing guassian distributions. Below, we use k=2 to learn a GMM model.

Note that when a GMM learns the parameters for the hidden distributions, it will randomly assign 0, 1, 2, … as a label to each one, and these labels will have nothing to do with how you labeled the data (in our case, 0 = child, 1 = adult). In this experiment, GMM assigned 0 for the distribution that we already know is representative of adults and 1 for the distribution we already know is representative of children (that is why we reverse them). In the real world, you will have to deal with the random labelling of these hidden distributions and use your domain knowledge to interpret what these labels represent.

[7]:

def get_kmeans_labels(X, k):
    model = KMeans(k, random_state=37)
    model.fit(X)
    labels = model.predict(X)
    labels = np.array([0 if label == 1 else 1 for label in labels])
    return labels, model

def get_gmm_labels(X, k):
    gmm = GaussianMixture(n_components=k, max_iter=50, random_state=37)
    gmm.fit(X)
    labels = gmm.predict(X)
    labels = np.array([0 if label == 1 else 1 for label in labels])
    return labels, gmm

prediction_df = df.copy()

best_k = int(score_df.sort_values(['sil'], ascending=False).iloc[0]['k'])
prediction_df['kms_label'], kms = get_kmeans_labels(X, best_k)
prediction_df['gmm_label'], gmm = get_gmm_labels(X, best_k)
print('best k = {}'.format(best_k))

best k = 2

[8]:

prediction_df

[8]:

	length	label	nlabel	kms_label	gmm_label
0	24.0	0	child	0	0
1	23.0	0	child	0	0
2	21.0	0	child	1	0
3	15.5	0	child	1	1
4	17.5	0	child	1	1
5	17.5	0	child	1	1
6	19.0	0	child	1	1
7	17.0	0	child	1	1
8	17.0	0	child	1	1
9	27.0	0	child	0	0
10	24.0	1	adult	0	0
11	23.5	1	adult	0	0
12	25.5	1	adult	0	0
13	22.0	1	adult	0	0
14	25.5	1	adult	0	0
15	25.5	1	adult	0	0
16	24.0	1	adult	0	0
17	28.0	1	adult	0	0
18	24.5	1	adult	0	0

5.4.3. GMM Results Visualized

Here is the result of clustering and GMM model’s prediction visualized.

[9]:

import matplotlib.colors as colors
from matplotlib.patches import Ellipse

fig, ax = plt.subplots(figsize=(10, 10))
colorset = ['blue', 'red']

pcs = []
for cluster, label in zip(range(best_k), ['children', 'adults']):
    cluster_df = prediction_df[prediction_df['gmm_label'] == cluster]
    mu = cluster_df.mean().loc['length']
    cov = cluster_df.var().loc['length']
    cov = np.array([[cov]])

    eva, eve = np.linalg.eigh(cov)
    order = eva.argsort()[::-1]
    eva, eve = eva[order], eve[:, order]
    vx, vy = eve[:,0][0], eve[:,0][0]
    theta = np.arctan2(vy, vx)
    color = colors.to_rgba(colorset[cluster])
    print('cluster={}, color={}'.format(cluster, color))

    for cov_factor in range(1, 4):
        size = np.sqrt(eva[0]) * cov_factor * 2
        angle = np.degrees(theta)
        ell = Ellipse(
            xy=(mu, mu),
            width=size,
            height=size,
            angle=angle,
            linewidth=2)
        ell.set_facecolor((color[0], color[1], color[2], 1.0 / (cov_factor * 4.5)))
        ax.add_artist(ell)

    c = [colorset[v] for v in cluster_df['label'].values]
    pc = ax.scatter(mu, mu, marker='+', s=100, c=colorset[cluster])
    ax.scatter(mu, mu, marker='+', s=1000, c=colorset[cluster])
    ax.scatter(cluster_df['length'], cluster_df['length'], label=label, c=c, marker='o')
    pcs.append(pc)

ax.set_title('Clustering and GMM Results')
ax.set_xlabel('length')
ax.set_ylabel('length')
ax.legend(pcs, ['children', 'adult'])

cluster=0, color=(0.0, 0.0, 1.0, 1.0)
cluster=1, color=(1.0, 0.0, 0.0, 1.0)

[9]:

<matplotlib.legend.Legend at 0x7f6cc7a1d910>

5.4.4. GMM Performances

So how well did the GMM model do?

[10]:

prediction_df = pd.concat([prediction_df, pd.DataFrame(gmm.predict_proba(X), columns=['gmm_1', 'gmm_0'])], axis=1)
prediction_df

[10]:

	length	label	nlabel	kms_label	gmm_label	gmm_1	gmm_0
0	24.0	0	child	0	0	2.706874e-10	1.000000
1	23.0	0	child	0	0	1.380768e-07	1.000000
2	21.0	0	child	1	0	4.759825e-03	0.995240
3	15.5	0	child	1	1	9.999280e-01	0.000072
4	17.5	0	child	1	1	9.984682e-01	0.001532
5	17.5	0	child	1	1	9.984682e-01	0.001532
6	19.0	0	child	1	1	9.184310e-01	0.081569
7	17.0	0	child	1	1	9.994453e-01	0.000555
8	17.0	0	child	1	1	9.994453e-01	0.000555
9	27.0	0	child	0	0	3.613962e-20	1.000000
10	24.0	1	adult	0	0	2.706874e-10	1.000000
11	23.5	1	adult	0	0	6.649432e-09	1.000000
12	25.5	1	adult	0	0	6.662496e-15	1.000000
13	22.0	1	adult	0	0	3.596142e-05	0.999964
14	25.5	1	adult	0	0	6.662496e-15	1.000000
15	25.5	1	adult	0	0	6.662496e-15	1.000000
16	24.0	1	adult	0	0	2.706874e-10	1.000000
17	28.0	1	adult	0	0	4.815558e-24	1.000000
18	24.5	1	adult	0	0	9.314752e-12	1.000000

This the precision, recall and f1-score for the GMM model.

[11]:

from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(prediction_df['label'], prediction_df['gmm_label']))

              precision    recall  f1-score   support

           0       0.31      0.40      0.35        10
           1       0.00      0.00      0.00         9

    accuracy                           0.21        19
   macro avg       0.15      0.20      0.17        19
weighted avg       0.16      0.21      0.18        19

This is the precision, recall, and f1-score for the kmeans model.

[12]:

print(classification_report(prediction_df['label'], prediction_df['kms_label']))

              precision    recall  f1-score   support

           0       0.25      0.30      0.27        10
           1       0.00      0.00      0.00         9

    accuracy                           0.16        19
   macro avg       0.12      0.15      0.14        19
weighted avg       0.13      0.16      0.14        19

This is the ROC AUC score for the GMM model.

[13]:

roc_auc_score(prediction_df['label'], prediction_df['gmm_1'])

[13]:

0.1333333333333333

[14]:

# the means learned from GMM
print(gmm.means_)

# the empirical means from the data
print(df[df['label'] == 0]['length'].mean())
print(df[df['label'] == 1]['length'].mean())

[[17.23018165]
 [24.38989552]]
19.85
24.72222222222222

5.4.5. Cross validation

Note that these results are not validated with a holdout set but rather with the training data itself. Let’s do a simple leave-one-out (LOO) validation. LOO validation is where we keep 1 example out, learn a model from the rest, and then use that learned model to make a prediction on the 1 example we left out; this procedure is repeated N times for N data points and we average out the performances over N.

[15]:

from sklearn.model_selection import LeaveOneOut

X = df['length'].values.reshape(-1, 1)
y = df['label'].values

loo = LeaveOneOut()
N = loo.get_n_splits(X)
t = 0

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    score_df = pd.DataFrame([(k, get_silhouette_score(X, k)) for k in range(2, 11, 1)], columns=['k', 'sil'])
    best_k = int(score_df.sort_values(['sil'], ascending=False).iloc[0]['k'])

    gmm = GaussianMixture(n_components=best_k, max_iter=50, random_state=37)
    gmm.fit(X_train, y_train)

    y_true = y_test[0]
    y_pred = [0 if p == 1 else 1 for p in gmm.predict(X_test)][0]
    t = t + (1 if y_true == y_pred else 0)

accuracy = t / N
print('LOO accuracy is {:.5f}'.format(accuracy))

LOO accuracy is 0.21053

Note that the LOO accuracy (simply the number of matches of prediction to true labels) is not that bad. It’s at 79% percent.