2. Principal Component Analysis

This notebook demonstrates the mechanics of Principal Component Analysis (PCA). One common use of PCA is data dimension reduction. For example, if you have a set of data \(X\), where \(X\) is a matrix of \(n\) rows and \(m\) columns, you might want to transform that data into \(E\), where \(E\) is a matrix of \(n\) rows and \(k\) columns (\(k << m\)). Transforming \(X\) to \(E\) might prove useful for use with downstream machine learning activities such as clustering where distance metrics break down in high dimensional space.

The PCA algorithm applied to a data matrix \(X\) is as follows.

\(Z = T_Z(X)\), where \(T_Z\) converts the data in \(X\) into z-scores
\(C = T_C(Z)\), where \(T_C\) computes the covariance matrix of \(Z\)
\(USU' = T_S(C)\), where \(T_S\) performs singular value decomposition (SVD) on \(C\)

2.1. Manual PCA computation

2.1.1. Sampling

Below, we sample data from two multivariate normal distributions to create \(X\).

[1]:

%matplotlib inline

import matplotlib.pylab as plt
import numpy as np
import random
from numpy.random import normal
from numpy.linalg import inv, svd
import pandas as pd
from collections import namedtuple
import seaborn as sns

DATA = namedtuple('DATA', 'X y df')
_seed = 37
random.seed(_seed)
np.random.seed(_seed)
np.set_printoptions(precision=3, suppress=True)
plt.style.use('ggplot')

def get_data(N=1000):
    x0 = normal(1, 1, N)
    x1 = normal(3 + 2 * x0, 1, N)
    x2 = normal(2.5 + 0.8 * x1, N)
    x3 = normal(5, 2, N)
    x4 = normal(0.8 + 0.8 * x3, N)
    x5 = normal(10, 5, N)

    X0 = [x0, x1, x2, x3, x4, x5]
    X0 = np.hstack([x.reshape(-1, 1) for x in X0])

    x0 = normal(1.5, 1, N)
    x1 = normal(7.3 + 2 * x0, 1, N)
    x2 = normal(3.6 + 0.9 * x1, N)
    x3 = normal(3.3, 2.5, N)
    x4 = normal(0.9 - 0.4 * x3, N)
    x5 = normal(10, 5, N)

    X1 = [x0, x1, x2, x3, x4, x5]
    X1 = np.hstack([x.reshape(-1, 1) for x in X1])

    X = np.vstack([X0, X1])
    y = np.hstack([np.full((1, X0.shape[0]), 0), np.full((1, X1.shape[0]), 1)])[0]
    c = ['zero' if val == 0.0 else 'one' for val in y]

    columns = ['X{}'.format(i) for i in range(X.shape[1])]
    X_df = pd.DataFrame(X, columns=columns)
    X_df['y'] = c

    return DATA(X, y, X_df)

def z_scale(D):
    m = D.X.mean(axis=0)
    s = D.X.std(axis=0)
    Z = ((D.X - m) / s)
    return Z

def get_cov(X):
    m = X.mean(axis=0)
    C = ((X - m).T.dot(X - m)) / X.shape[0]
    return C

Note

D is the data
Z is the scaled data
C is the covariance matrix

[2]:

D = get_data()

[3]:

Z = z_scale(D)
print(Z.shape)

(2000, 6)

[4]:

C = get_cov(Z)
print(C.shape)

(6, 6)

2.1.2. Covariance

The covariance matrix is plotted below.

[5]:

fig, ax = plt.subplots(figsize=(10, 5))
x_labels = ['X{}'.format(i) for i in range(C.shape[0])]
sns.heatmap(C, xticklabels=x_labels, yticklabels=x_labels, ax=ax)
ax.set_title('Covariance Plot')
plt.show()

2.1.3. Singular value decomposition

Now we are ready to apply singular value decomposition (SVD) to C. SVD produces three matrices.

U left-hand eigenvectors
S eigenvalues (not a matrix unless you diagonalize)
VT right-hand eigenvectors

\(USV' = C\)

We will see later that in the Scikit-Learn, the PCA class has

S as the explained_variance_
VT as the components_

[6]:

U, S, VT = svd(C, full_matrices=False)

[7]:

print('U')
print(U)
print(U.shape)

print()
print('S')
print(S)
print(S.shape)

print()
print('VT')
print(VT)
print(VT.shape)

U
[[-0.651  0.054  0.239 -0.043  0.243 -0.675]
 [-0.687  0.047  0.083 -0.004  0.058  0.718]
 [ 0.012  0.728 -0.53   0.164  0.404 -0.007]
 [ 0.309  0.001  0.546 -0.11   0.752  0.172]
 [ 0.086  0.672  0.577  0.003 -0.457  0.009]
 [-0.001  0.121 -0.159 -0.979 -0.03   0.006]]
(6, 6)

S
[1.835 1.025 1.017 0.999 0.89  0.234]
(6,)

VT
[[-0.651 -0.687  0.012  0.309  0.086 -0.001]
 [ 0.054  0.047  0.728  0.001  0.672  0.121]
 [ 0.239  0.083 -0.53   0.546  0.577 -0.159]
 [-0.043 -0.004  0.164 -0.11   0.003 -0.979]
 [ 0.243  0.058  0.404  0.752 -0.457 -0.03 ]
 [-0.675  0.718 -0.007  0.172  0.009  0.006]]
(6, 6)

The variances are as follows:

v_ratios is the proportion of variance explained, and
c_var is the cummulative variance going from the largest to smallest.

The eigenvectors in S represents the magntiude of the variance and its associated eigenvector.

[8]:

v_ratios = S / S.sum()
c_var = np.array([v_ratios[0:v+1].sum() for v in range(len(v_ratios))])
print(v_ratios)
print(c_var)

[0.306 0.171 0.17  0.167 0.148 0.039]
[0.306 0.477 0.646 0.813 0.961 1.   ]

[9]:

fig, ax = plt.subplots(1, 2, figsize=(15, 4))

n_components = len(v_ratios)

ax[0].bar(range(n_components), v_ratios)
ax[1].plot(c_var, label='Cummulative')

ax[0].set_title('Variance Ratios')
ax[0].set_title('Cummulative Variance')

ax[0].set_xlabel('Principal Components')
ax[0].set_ylabel('Variance Ratio')

ax[1].set_xlabel('Principal Components')
ax[1].set_ylabel('Cummulative Variance Ratio')

for v, c in zip([0.5, 0.8], ['r--', 'g--']):
    label = '{:.0f}% variance'.format(v * 100.0)
    ax[1].plot(
        (0, n_components - 1),
        (v, v), c, alpha=0.3,
        label=label)

_ = ax[1].legend()

2.1.4. Reconstructing the covariance matrix

You may reconstruct C from U, S and VT as follows. We use the np.allclose to detect for matches.

\(C = USV'\)

[10]:

print(np.allclose(C, U.dot(np.diag(S)).dot(VT)))
print(np.allclose(C, np.dot(U * S, VT)))
print(np.allclose(C, np.dot(U, np.dot(np.diag(S), VT))))

True
True
True

2.1.5. Principal components to variables

For PCA, VT or \(V'\) is actually just \(U'\). Below, we plot the influence of principal components to variables.

[11]:

import seaborn as sns

fig, ax = plt.subplots(figsize=(10, 5))
x_labels = ['X{}'.format(i) for i in range(VT.shape[1])]
y_labels = ['PC{}'.format(i) for i in range(VT.shape[0])]
sns.heatmap(VT, xticklabels=x_labels, yticklabels=y_labels, center=0.0, ax=ax)
plt.show()

2.1.6. Variables to principal components

Here, we plot U. Notice how the graph is just the transpose of the one above? The way to intpret this plot is to go one PC at a time. When PC0 is high then \(X0\) and \(X1\) are lower; when PC1 is high, then \(X2\) and \(X4\) are high; and so on.

[12]:

fig, ax = plt.subplots(figsize=(10, 5))
y_labels = ['X{}'.format(i) for i in range(U.shape[1])]
x_labels = ['PC{}'.format(i) for i in range(U.shape[0])]
sns.heatmap(U, xticklabels=x_labels, yticklabels=y_labels, center=0.0, ax=ax)
plt.show()

2.1.7. Loadings squared

To understand how much each variable’s variance contributes to each PC, we compute the loadings squared.

L is the factor loading where columns correspond to PCs and rows correspond to variables; the i-th (row), j-th (column) element corresponds to the Pearson correlation coefficient between the i-th variable and j-th PC.
I is the precentage of variance of a variable explained by the a PC; the i-th, j-th element corresponds to the percentage of variance of the i-th variable explained by the j-th PC.

More information

[13]:

L = VT.T * np.sqrt(S)
print('loadings')
print(L)

I = np.power(L, 2)
print()
print('percentage of variance explained per variable per PC')
print(I)

# summing down the columns will give you the egienvalues
# equal to dividing eigenvalues by the number of variables
print()
print('percentage of variance explained over all variables per PC')
print(I.sum(axis=0) / I.shape[0])

print()
print('percentage of variance explained over all variables per PC*')
print(S / I.shape[0])

loadings
[[-0.882  0.054  0.241 -0.043  0.229 -0.326]
 [-0.931  0.047  0.083 -0.004  0.055  0.347]
 [ 0.017  0.736 -0.534  0.164  0.381 -0.003]
 [ 0.419  0.001  0.55  -0.11   0.709  0.083]
 [ 0.117  0.68   0.582  0.003 -0.431  0.005]
 [-0.002  0.123 -0.16  -0.979 -0.028  0.003]]

percentage of variance explained per variable per PC
[[0.779 0.003 0.058 0.002 0.052 0.106]
 [0.867 0.002 0.007 0.    0.003 0.12 ]
 [0.    0.542 0.285 0.027 0.145 0.   ]
 [0.176 0.    0.303 0.012 0.502 0.007]
 [0.014 0.462 0.338 0.    0.186 0.   ]
 [0.    0.015 0.026 0.958 0.001 0.   ]]

percentage of variance explained over all variables per PC
[0.306 0.171 0.17  0.167 0.148 0.039]

percentage of variance explained over all variables per PC*
[0.306 0.171 0.17  0.167 0.148 0.039]

[14]:

fig, ax = plt.subplots(figsize=(10, 5))
y_labels = ['X{}'.format(i) for i in range(I.shape[1])]
x_labels = ['PC{}'.format(i) for i in range(I.shape[0])]
sns.heatmap(I, xticklabels=x_labels, yticklabels=y_labels, ax=ax)
plt.show()

2.1.8. Projecting data into PC space

If we had new data, we may project that data (after it has been scaled) into PC space. To project scaled values into PC space, \(ZV\). The following are called the raw component scores, and have variance equal to the eigenvalues.

[15]:

print(np.dot(Z, VT.T))

[[ 0.785  0.533  0.36   1.555 -1.048 -0.61 ]
 [ 0.036 -1.228  0.604  0.229  0.063 -0.512]
 [ 0.385 -0.152 -0.344 -0.837 -0.442 -0.739]
 ...
 [-0.575  1.009 -1.673  0.675  0.014  0.213]
 [-1.966  0.449 -0.822 -0.61   1.139  0.674]
 [-0.893  0.409  0.265 -1.126  0.491  0.434]]

The standardize component scores are given by the following and have variance equal to 1.

\(Z S^{-1} L\)

[16]:

B = inv(np.diag(S)).dot(L)
print(Z.dot(B))

[[ 0.943  0.042  1.261  6.329 -0.879 -0.345]
 [ 0.199 -1.473  0.768  1.763  0.096 -0.189]
 [ 0.531  0.19  -0.816 -3.609 -0.498 -0.329]
 ...
 [-0.829  0.865 -1.54   1.538  0.058  0.076]
 [-2.255  0.647 -1.387 -3.018  0.942  0.459]
 [-0.975  0.851 -0.361 -4.645  0.297  0.292]]

2.1.9. Project data in PC space to scaled space

If we want to project from PC space back to the scaled space, then, \(EV'^{-1}\).

[17]:

print('Z')
print(Z)

print('Z from PC space')
print(np.dot(np.dot(Z, VT.T), inv(VT.T)))

Z
[[-0.305 -0.99   0.042 -0.624  1.112 -1.489]
 [ 0.407 -0.397 -1.147  0.274 -0.506 -0.474]
 [ 0.086 -0.853 -0.235 -0.435 -0.075  0.864]
 ...
 [-0.14   0.455  1.728 -1.117 -0.339 -0.271]
 [ 0.955  1.857  1.094 -0.017 -0.859  0.755]
 [ 0.541  1.     0.157  0.437  0.126  1.099]]
Z from PC space
[[-0.305 -0.99   0.042 -0.624  1.112 -1.489]
 [ 0.407 -0.397 -1.147  0.274 -0.506 -0.474]
 [ 0.086 -0.853 -0.235 -0.435 -0.075  0.864]
 ...
 [-0.14   0.455  1.728 -1.117 -0.339 -0.271]
 [ 0.955  1.857  1.094 -0.017 -0.859  0.755]
 [ 0.541  1.     0.157  0.437  0.126  1.099]]

2.1.10. Visualize data in PC space

Plot the data in PC space.

[18]:

df = pd.DataFrame(np.dot(Z, VT.T), columns=['PC{}'.format(i) for i in range(VT.T.shape[1])])
df['y'] = D.df['y']

g = sns.pairplot(df, hue='y')
_ = g.map_lower(sns.kdeplot, **{'shade': False, 'n_levels': 10})

2.2. Verification with Scikit-Learn

Now we are going to do the same thing but using Scikit-Learn.

[19]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
pca = PCA(n_components=3, random_state=37)

pipeline = Pipeline([('scaler', scaler), ('pca', pca)])
pipeline.fit(D.X)

[19]:

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=3,
                     random_state=37, svd_solver='auto', tol=0.0,
                     whiten=False))],
         verbose=False)

2.2.1. Right side eigenvectors

The components, right side eigenvectors, VT or \(V'\).

[20]:

print(pca.components_)

[[-0.651 -0.687  0.012  0.309  0.086 -0.001]
 [-0.054 -0.047 -0.728 -0.001 -0.672 -0.121]
 [ 0.239  0.083 -0.53   0.546  0.577 -0.159]]

2.2.2. Eigenvalues

The eigenvalues or explained variance.

[21]:

print(pca.explained_variance_)

[1.836 1.025 1.018]

2.2.3. Loadings

The loadings.

[22]:

print(pca.components_.T * np.sqrt(pca.explained_variance_))

[[-0.883 -0.054  0.241]
 [-0.932 -0.048  0.083]
 [ 0.017 -0.737 -0.534]
 [ 0.419 -0.001  0.551]
 [ 0.117 -0.68   0.582]
 [-0.002 -0.123 -0.16 ]]

2.2.4. Explained variance

The contributions of each variable to a PC.

[23]:

print(np.power(pca.components_.T * np.sqrt(pca.explained_variance_), 2))

[[0.779 0.003 0.058]
 [0.868 0.002 0.007]
 [0.    0.543 0.286]
 [0.176 0.    0.303]
 [0.014 0.462 0.339]
 [0.    0.015 0.026]]

2.2.5. Projecting data into PC space

Transforming data into PC space.

[24]:

print(pipeline.transform(D.X))

[[ 0.785 -0.533  0.36 ]
 [ 0.036  1.228  0.604]
 [ 0.385  0.152 -0.344]
 ...
 [-0.575 -1.009 -1.673]
 [-1.966 -0.449 -0.822]
 [-0.893 -0.409  0.265]]

2.2.6. Visualize data in PC space

[25]:

E = pipeline.transform(D.X)
df = pd.DataFrame(E, columns=['PC{}'.format(p) for p in range(E.shape[1])])
df['y'] = D.df['y']

g = sns.pairplot(df, hue='y')
_ = g.map_lower(sns.kdeplot, **{'shade': False, 'n_levels': 10})