1. Missing Data
In this noteboook, we show how to deal with missing data. We will generate two synthetic datasets and then randomly knockout
some of their values to be missing. The first dataset will have its parameters manually specified, while the second dataset will have its parameters randomly generated.
1.1. Manually created example
1.1.1. Synthetic data
In this section we manually generate a dataset while manually specifying the parameters. The number of variables that we are generating is 3, and their means are \([10, 20, 30]\) and the covariance matrix between them is as follows.
\(\left[ {\begin{array}{ccccc} 1, 0, 0\\ 0, 1, 0\\ 0, 0, 1\\ \end{array} } \right]\)
The full data is data
.
[1]:
%matplotlib inline
import numpy as np
np.random.seed(37)
means = np.array([10.0, 20.0, 30.0], dtype=np.float64)
cov = np.array([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], dtype=np.float64)
data = np.random.multivariate_normal(means, cov, 100)
Now we randomly generate pairs of indices (row and column) for which we will make the corresponding element missing.
[2]:
def get_null_indices(num_rows, num_cols, num_nulls=50):
null_indices = []
indices_pairs = set()
while len(null_indices) < num_nulls:
r = np.random.randint(num_rows)
c = np.random.randint(num_cols)
key = '{}-{}'.format(r, c)
if key not in indices_pairs:
indices_pairs.add(key)
t = (r, c)
null_indices.append(t)
return null_indices
null_indices = get_null_indices(data.shape[0], data.shape[1])
Let’s create the missing data m_data
.
[3]:
def knockout_data(data, null_indices):
m_data = np.copy(data)
for pair in null_indices:
r = pair[0]
c = pair[1]
m_data[r][c] = None
return m_data
m_data = knockout_data(data, null_indices)
1.1.2. Visualize
[4]:
import missingno as msno
import matplotlib
import pandas as pd
data_df = pd.DataFrame(m_data, columns=['x{}'.format(c) for c in range(data.shape[1])])
msno.matrix(data_df)
msno.bar(data_df)
msno.heatmap(data_df)
_ = msno.dendrogram(data_df)
1.1.3. Imputation
Let’s impute the missing data using several different techniques.
[5]:
import warnings
from fancyimpute import KNN
from fancyimpute import NuclearNormMinimization
from fancyimpute import SoftImpute
from fancyimpute import IterativeImputer
from fancyimpute import BiScaler
with warnings.catch_warnings(record=True):
ii_data = IterativeImputer(verbose=0).fit_transform(m_data)
nn_data = KNN(k=7, verbose=False).fit_transform(m_data)
bs_data = BiScaler(verbose=False).fit_transform(m_data)
si_data = SoftImpute(verbose=False).fit_transform(bs_data)
Using TensorFlow backend.
1.1.4. Performance
Let’s observe the performances of the imputation techniques.
[6]:
def get_mse(f_data, i_data, null_indices):
t = []
for pair in null_indices:
r, c = pair[0], pair[1]
v1 = f_data[r][c]
v2 = i_data[r][c]
diff = np.power(v1 - v2, 2.0)
# print('{:.10f}, {:.10f}, {:.10f}'.format(v1, v2, diff))
t.append(diff)
t = np.array(t)
m = np.mean(t)
return m
def get_mean_mse(f_data, i_data):
m1 = np.mean(f_data, axis=0)
m2 = np.mean(i_data, axis=0)
num_means = len(m1)
t = []
for i in range(num_means):
v1 = m1[i]
v2 = m2[i]
diff = np.power(v1 - v2, 2.0)
t.append(diff)
t = np.array(t)
m = np.mean(t)
return m
def get_cov_mse(f_data, i_data):
cov1 = np.cov(f_data.T)
cov2 = np.cov(i_data.T)
num_rows, num_cols = cov1.shape[0], cov1.shape[1]
t = []
for r in range(num_rows):
for c in range(num_cols):
v1 = cov1[r][c]
v2 = cov2[r][c]
diff = np.power(v1 - v2, 2.0)
t.append(diff)
t = np.array(t)
m = np.mean(t)
return m
[7]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
scaler = StandardScaler()
s_data = scaler.fit_transform(data)
perfs = pd.DataFrame([(get_mse(data, ii_data, null_indices),
get_mean_mse(data, ii_data),
get_cov_mse(data, ii_data)),
(get_mse(data, nn_data, null_indices),
get_mean_mse(data, nn_data),
get_cov_mse(data, nn_data)),
(get_mse(s_data, si_data, null_indices),
get_mean_mse(s_data, si_data),
get_cov_mse(s_data, si_data))],
columns=['MSE', 'Average MSE', 'Covariance MSE'],
index=['Iterative Imputer', 'KNN Imputer', 'Soft Imputer'])
perfs
[7]:
MSE | Average MSE | Covariance MSE | |
---|---|---|---|
Iterative Imputer | 0.886978 | 0.003315 | 0.008129 |
KNN Imputer | 1.338415 | 0.003687 | 0.005270 |
Soft Imputer | 0.988587 | 0.000258 | 0.115665 |
1.2. Randomly generated example
1.2.1. Synthetic data
Now we generate a dataset of 10 random variables with the parameters (means and covariance matrix) generated randomly (as opposed to manually specified as before). We go through the whole process of
parameter generation,
data generation,
missing data creation, and
imputation.
We then visualize the missingness, followed by computing the performances of the imputation techniques.
[8]:
num_vars = 10
means = np.array([np.random.randint(20, 100) for i in range(num_vars)], dtype=np.float64)
cov = []
for i in range(num_vars):
for j in range(num_vars):
if i == j:
cov.append(1.0)
else:
cov.append(np.random.randint(1, 10))
with warnings.catch_warnings(record=True):
cov = np.array(cov, dtype=np.float64).reshape((num_vars, num_vars))
data = np.random.multivariate_normal(means, cov, 500)
null_indices = get_null_indices(data.shape[0], data.shape[1], num_nulls=1000)
m_data = knockout_data(data, null_indices)
1.2.2. Imputation
[9]:
with warnings.catch_warnings(record=True):
ii_data = IterativeImputer(verbose=0).fit_transform(m_data)
nn_data = KNN(k=7, verbose=False).fit_transform(m_data)
bs_data = BiScaler(verbose=False).fit_transform(m_data)
si_data = SoftImpute(verbose=False).fit_transform(bs_data)
scaler = StandardScaler()
s_data = scaler.fit_transform(data)
1.2.3. Visualize
[10]:
data_df = pd.DataFrame(m_data, columns=['x{}'.format(c) for c in range(data.shape[1])])
msno.matrix(data_df)
msno.bar(data_df)
msno.heatmap(data_df)
_ = msno.dendrogram(data_df)
1.2.4. Performance
As can be seen below, iterative imputation seems to do the best by perserving the mean of each variable and also the covariance.
[11]:
scaler = StandardScaler()
s_data = scaler.fit_transform(data)
perfs = pd.DataFrame([(get_mse(data, ii_data, null_indices),
get_mean_mse(data, ii_data),
get_cov_mse(data, ii_data)),
(get_mse(data, nn_data, null_indices),
get_mean_mse(data, nn_data),
get_cov_mse(data, nn_data)),
(get_mse(s_data, si_data, null_indices),
get_mean_mse(s_data, si_data),
get_cov_mse(s_data, si_data))],
columns=['MSE', 'Average MSE', 'Covariance MSE'],
index=['Iterative Imputer', 'KNN Imputer', 'Soft Imputer'])
perfs
[11]:
MSE | Average MSE | Covariance MSE | |
---|---|---|---|
Iterative Imputer | 6.554412 | 0.001915 | 0.141364 |
KNN Imputer | 8.686041 | 0.007308 | 0.623918 |
Soft Imputer | 0.909642 | 0.000009 | 0.184934 |