4. Dimension Reduction or Concept Learning?
It is a common preprocessing step in machine learning goals (eg classification, regression, clustering, etc.) to reduce the dimensionality of the data. Having high dimensionality could lead to different problems for different machine learning tasks. For example, in clustering, high dimensionality might lead to the curse of dimensionality. There is a multitude of approaches for reducing the dimensions of a dataset; these techniques often
project the data in the original high-dimensional space to a new lower-dimensional space. The key word here is space
; the original high-dimensional space is the raw space where the data exists; the new low-dimensional space is a new space which the data is being projected into.
But what’s the meaning of the new space? The new lower-dimensional space can be interpreted in many different but related ways. One interpretation is that although the new space has a lower number of dimensions, this new space organizes the original dimensions into higher-order meaning. That’s the confusion; the new lower-dimesional space represents higher order meaning. This new space is also called the hidden space
or latent
space. When dimesion reduction techniques are used, it is
better to understand them as a reprojection of the data into a space of higher order meaning, although the new space will have a smaller number of dimensions. These new dimensions are often called components
when using techniques like principal component analysis PCA
.
Here’s a motivating example. Imagine we have news articles with a lot of words. The representation of these documents could be a vector space model VSM
where the documents correspond to rows and words correspond to columns. Each element in the VSM could simply be the count of number of times a word appears in a document. We can project this VSM into a new space where the axes (columns/fields) are higher-order meaning or concepts. For
example, the words could be reprojected to their types (eg adjective, verbs, nouns, etc).
In the example below, we will play around with the breast cancer data from Scikit-Learn and apply a few different “dimension reduction” techniques. Let’s see if these techniques can recover any concepts or higher-order meaning from the raw features.
4.1. Breast cancer data
There’s nothing important here; we simply load the data. Note that this dataset is intended for a classification problem. Interesting, 0 denotes malignant and 1 denotes benign. There’s 30 features in this dataset.
[1]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
y_map = {0: 'malignant', 1: 'benign'}
X.shape, y.shape
[1]:
((569, 30), (569,))
[2]:
X.head()
[2]:
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
[3]:
y.value_counts().sort_index()
[3]:
target
0 212
1 357
Name: count, dtype: int64
4.2. Decomposition
We will use 4 different dimension reduction algorithms.
PCA denoted as
pca
kernel PCA denoted as
kpca
factor analysis denoted as
fa
tSNE denoted as
tsne
In the code below,
M
stores the models, andD
stores the projected data corresponding to the models.
Note that the number of components is hard-coded to 2
for illustration purposes. Ideally, we would want to tune this hyperparameter.
[4]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, KernelPCA, FactorAnalysis
from sklearn.manifold import TSNE
M = {
'pca': Pipeline([
('scaler', StandardScaler()),
('decomposer', PCA(n_components=2, random_state=37))
]),
'kpca': Pipeline([
('scaler', StandardScaler()),
('decomposer', KernelPCA(n_components=2, kernel='rbf', random_state=37))
]),
'fa': Pipeline([
('scaler', StandardScaler()),
('decomposer', FactorAnalysis(n_components=2, random_state=37))
]),
'tsne': Pipeline([
('scaler', StandardScaler()),
('decomposer', TSNE(n_components=2, n_jobs=-1, random_state=37))
])
}
to_df = lambda mat: pd.DataFrame(mat, columns=[f'C{i}' for i in range(mat.shape[1])])
D = {k: to_df(m.fit_transform(X)) for k, m in M.items()}
4.3. Density plots
Here are the density plots for the first C0
and second C1
components for each technique.
[5]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 4, figsize=(15, 2.5))
for algo, ax, color in zip(D, np.ravel(axes), sns.color_palette('tab10')):
D[algo]['C0'].plot(kind='kde', ax=ax, title=f'{algo}, C0', color=color)
fig.tight_layout()
[6]:
fig, axes = plt.subplots(1, 4, figsize=(15, 2.5))
for algo, ax, color in zip(D, np.ravel(axes), sns.color_palette('tab10')):
D[algo]['C1'].plot(kind='kde', ax=ax, title=f'{algo}, C1', color=color)
fig.tight_layout()
Here are some bivariate density plots for the two components learned by each of the algorithms. PCA and kernel PCA seem to have 2 centers; factor analysis has one and tSNE seem to have 4.
[7]:
fig, axes = plt.subplots(1, 4, figsize=(15, 2.5))
for algo, ax, color in zip(D, np.ravel(axes), sns.color_palette('tab10')):
sns.kdeplot(data=D[algo], x='C0', y='C1', ax=ax, color=color)
ax.set_title(algo)
fig.tight_layout()
4.4. Scatter plots
Take a look at the reprojected data color-coded by class (malignant vs benign). Interestingly, going from 30 dimensions to 2, there seems to be linear separability for all techniques!
[8]:
import seaborn as sns
def scatter_plot(d, k, ax):
df = d.assign(y=y)
for y_val, color in zip(y.unique(), sns.color_palette('tab10')):
label = y_map[y_val]
df[df['y']==y_val].plot(
kind='scatter',
x='C0',
y='C1',
ax=ax,
color=color,
label=label,
s=2
)
ax.set_title(k)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
fig, axes = plt.subplots(2, 2, figsize=(10, 5))
for k, ax in zip(D.keys(), np.ravel(axes)):
scatter_plot(D[k], k, ax)
fig.tight_layout()
4.5. Classification
If the new dimensions are meaningful, they should help with classification. Below are the non-validated classification performances of each technique. Evidently, PCA and tSNE have the best performances in terms of area under the ROC curve and average precision score.
[9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score
def get_performance(Z, algo):
m = LogisticRegression(random_state=37, n_jobs=-1, solver='saga', max_iter=10_000)
m.fit(Z, y)
y_pred = m.predict(Z)
roc, aps = roc_auc_score(y, y_pred), average_precision_score(y, y_pred)
return {
'algo': algo,
'roc': roc,
'aps': aps
}
pd.DataFrame([get_performance(D[algo], algo) for algo in D.keys()])
[9]:
algo | roc | aps | |
---|---|---|---|
0 | pca | 0.950617 | 0.949288 |
1 | kpca | 0.922903 | 0.923345 |
2 | fa | 0.910153 | 0.909794 |
3 | tsne | 0.954376 | 0.952908 |
Here’s the kicker, learning a logistic regression classification model from the raw dimensions performs worse than learning one using the new and fewer dimensions.
[10]:
m = LogisticRegression(random_state=37, n_jobs=-1, solver='saga', max_iter=10_000)
m.fit(X, y)
y_pred = m.predict(X)
roc_auc_score(y, y_pred), average_precision_score(y, y_pred)
[10]:
(0.9086795095396649, 0.9067761384577263)
4.6. Coefficients
If we wanted to know how the old dimensions influences the new ones, we can perform regression. Since there are 2 components, we will compute 2 regressions of the component on the old dimensions for each algorithm. Below, after we get the regression coefficients, we rescale them to the range [20, 100] to make side-by-side comparison of the coefficients easier.
[11]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
def get_components(algo, regressor='rf'):
def get_regressor():
regressors = {
'linear': lambda: LinearRegression(fit_intercept=False, n_jobs=-1),
'ridge': lambda: Ridge(fit_intercept=False, random_state=37),
'lasso': lambda: Lasso(fit_intercept=False, random_state=37),
'en': lambda: ElasticNet(l1_ratio=0.5, fit_intercept=False, random_state=37),
'rf': lambda: RandomForestRegressor(random_state=37, n_jobs=-1)
}
if regressor in regressors:
return regressors[regressor]
return regressors['linear']
def get_influence(c):
r = get_regressor()()
q = pd.DataFrame(StandardScaler().fit_transform(D[algo]), columns=D[algo].columns)
q = q[c]
r.fit(Z, q)
if regressor == 'rf':
f = r.feature_importances_
else:
f = r.coef_
return pd.Series(np.abs(f), index=X.columns)
Z = M[algo].named_steps['scaler'].transform(X)
return pd.DataFrame({
f'{algo}_C0': get_influence('C0'),
f'{algo}_C1': get_influence('C1')
})
def rescale(X):
S = MinMaxScaler((20, 100)).fit_transform(X)
S = pd.DataFrame(S, columns=X.columns, index=X.index) \
.assign(**{
f'__{X.columns[0]}': S[:,0]
})
return S
I = get_components('pca') \
.join(get_components('kpca')) \
.join(get_components('fa')) \
.join(get_components('tsne')).T
I = rescale(I)
I
[11]:
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | __mean radius | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pca_C0 | 20.000000 | 27.660293 | 21.012388 | 20.089757 | 24.708053 | 25.028562 | 64.230320 | 100.000000 | 23.131311 | 20.070961 | ... | 25.415234 | 31.177773 | 23.188636 | 23.694922 | 57.657200 | 40.778190 | 38.716474 | 25.768649 | 20.242635 | 20.000000 |
pca_C1 | 22.456687 | 25.707383 | 20.694835 | 57.455354 | 35.779462 | 21.683326 | 20.272224 | 20.750085 | 38.187873 | 100.000000 | ... | 26.703584 | 33.359398 | 62.578362 | 51.705521 | 37.667882 | 38.185203 | 20.203508 | 33.673474 | 20.712339 | 22.456687 |
kpca_C0 | 20.124217 | 60.862211 | 20.000000 | 20.338962 | 34.834190 | 25.588299 | 100.000000 | 24.587450 | 27.726151 | 21.120664 | ... | 31.514524 | 59.600710 | 35.243865 | 40.871446 | 57.364089 | 31.329845 | 21.401781 | 27.499165 | 20.917665 | 20.124217 |
kpca_C1 | 29.785557 | 37.624838 | 22.388116 | 93.877927 | 56.684593 | 22.079040 | 21.095430 | 20.598039 | 35.412102 | 74.147545 | ... | 31.779264 | 65.516090 | 100.000000 | 47.628771 | 51.862636 | 34.163889 | 20.482895 | 34.958548 | 20.524924 | 29.785557 |
fa_C0 | 100.000000 | 20.000000 | 100.000000 | 100.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | ... | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 20.000000 | 100.000000 |
fa_C1 | 20.485784 | 29.111360 | 20.073381 | 22.051355 | 75.004853 | 100.000000 | 21.330962 | 20.438261 | 57.870181 | 49.368413 | ... | 24.869753 | 20.951976 | 21.719272 | 65.038759 | 83.453114 | 100.000000 | 20.252925 | 33.431891 | 100.000000 | 20.485784 |
tsne_C0 | 20.133475 | 69.253239 | 20.226565 | 20.000000 | 52.425656 | 41.197356 | 29.660382 | 60.896868 | 24.332261 | 20.166627 | ... | 80.989377 | 100.000000 | 40.055801 | 100.000000 | 100.000000 | 34.865060 | 100.000000 | 27.238558 | 20.766349 | 20.133475 |
tsne_C1 | 22.389117 | 100.000000 | 21.237619 | 41.932225 | 100.000000 | 30.457442 | 20.524996 | 21.595629 | 100.000000 | 44.820743 | ... | 100.000000 | 43.921935 | 40.136060 | 89.301745 | 83.488993 | 67.899110 | 21.297724 | 100.000000 | 23.009203 | 22.389117 |
8 rows × 31 columns
In the table above, you can understand it by going down each colum or row. Going down each column, the value indicates the strength of influence of the original dimension on the new one. Going across each row, the value indicates how each column influences the component in magnitude (note that direction was removed by taking the absolute value of all coefficients).
For PCA, this regression of the new dimension on the old ones produces the eigenvectors!
4.7. Parallel coordinates
Now, we make the desperate attempt to visualize to see if there is any sensible higher-order meaning in the components. There’s a lot to look at, but for tSNE it seems the first component is about the “worst” features.
[12]:
fig, ax = plt.subplots(figsize=(20, 5), dpi=100)
pd.plotting.parallel_coordinates(
I.assign(y=I.index)[[True if i.endswith('C0') else False for i in I.index]] \
.drop(columns=['__mean radius']),
'y',
color=sns.color_palette('tab10'),
sort_labels=False,
axvlines=False,
ax=ax
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
fig.tight_layout()
When looking at the second component for tSNE, it seems this component is about the “error” features.
[13]:
fig, ax = plt.subplots(figsize=(20, 5), dpi=100)
pd.plotting.parallel_coordinates(
I.assign(y=I.index)[[True if i.endswith('C1') else False for i in I.index]] \
.drop(columns=['__mean radius']),
'y',
color=sns.color_palette('tab10'),
sort_labels=False,
axvlines=False,
ax=ax
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
fig.tight_layout()
4.8. Radial plots
Perhaps radial plots can also give us a better intuition into what the concepts or higher-order meaning of the components are. The radial plot for tSNE gravitates towards the “worst” features, which confirms the interpretation from the parallel coordinate plot. Interestingly, for PCA, the first component is about the “mean” features.
[14]:
fig, ax = plt.subplots(figsize=(10, 5), dpi=100)
pd.plotting.radviz(
I.assign(y=I.index)[[True if i.endswith('C0') else False for i in I.index]] \
.drop(columns=['__mean radius']),
'y',
color=sns.color_palette('tab10'),
ax=ax
)
ax.axis('off')
fig.tight_layout()
Let me know if you think there’s any signal with visualizing the coefficients of the features on the second component. It seems tSNE is moving towards the “error” features.
[15]:
fig, ax = plt.subplots(figsize=(10, 5), dpi=100)
pd.plotting.radviz(
I.assign(y=I.index)[[True if i.endswith('C1') else False for i in I.index]] \
.drop(columns=['__mean radius']),
'y',
color=sns.color_palette('tab10'),
ax=ax
)
ax.axis('off')
fig.tight_layout()
4.9. Polar plots
What about polar plots?
[16]:
from matplotlib.text import Text
def plot_polar(X, markersize=3, linewidth=1, alpha=1.0):
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'}, figsize=(10, 7))
d = X.shape[1]
s = 360 / (d - 1)
m = s * d + 1
degrees = np.arange(s, m, s)
radians = np.deg2rad(degrees)
colors = sns.color_palette('tab10')
for (name, r), color in zip(X.iterrows(), colors):
ax.plot(
radians,
r,
marker='o',
markersize=markersize,
linewidth=linewidth,
color=color,
alpha=alpha,
label=name
)
ax.set_yticklabels([])
ax.set_ylim(0, 150)
ax.spines['polar'].set_alpha(0.1)
x2l = {k: v for k, v in zip([0] + list(radians), X.columns[:X.shape[1]-1])}
x_ticks = [x for x, _ in x2l.items()]
x_labels = [Text(x, 0, label) for x, label in x2l.items()]
ax.xaxis.set_ticks(x_ticks)
ax.xaxis.set_ticklabels(x_labels)
ax.legend(loc='center right', bbox_to_anchor=(1.35, 0.5))
for gl in ax.xaxis.get_gridlines():
gl.set_alpha(0.1)
for gl in ax.yaxis.get_gridlines():
gl.set_alpha(0.1)
fig.tight_layout()
This polar plot is of the coefficients of the raw features on the projection into the first component.
[17]:
plot_polar(I[[True if i.endswith('C0') else False for i in I.index]])
This polar plot is of the coefficients of the raw features on the projection into the second component.
[18]:
plot_polar(I[[True if i.endswith('C1') else False for i in I.index]])
What if we just look at PCA?
[19]:
plot_polar(I[[True if i.startswith('pca') else False for i in I.index]])
What if we just looked at kernel PCA?
[20]:
plot_polar(I[[True if i.startswith('kpca') else False for i in I.index]])
Or just factor analysis?
[21]:
plot_polar(I[[True if i.startswith('fa') else False for i in I.index]])
And lastly, for tSNE.
[22]:
plot_polar(I[[True if i.startswith('tsne') else False for i in I.index]])
4.10. Summary
Dimension reduction may be necessary to avoid certain problems associated with high-dimensionality. However, when reducing the dimensions, we are not simply doing “feature selection” (eg taking 30 features, in this case, and selecting only 2 from the 30). The dimension reduction techniques here are actually projecting the old/original dimensions into new ones, and those new (but fewer) dimensions have (higher-order) meaning, although it may not be easy to interpret exactly and precisely what those meanings may be.