1. COVID Hubei Data
The original article references this spreadsheet as having patient-level
data of patients verified to have contracted the COVID-19 virus. However, that spreadsheet is now referencing a
GitHub location that stores a different data set. The original CSV files used seem to be located in a different location with the same GitHub repository. To our knowledge, there is no other dataset available online that has patient-level
data. The original file is stored in an Excel file and there are two data sheets; one for patients
inside of Hubei
and one for those that are outside. Let’s see what we have.
1.1. Load Data
We have 2 data sheets, let’s merge them into one. Does it really matter if the patients are in or outside of Hubei to the world at large?
[1]:
import pandas as pd
inside = pd.read_csv('./covid/data/00/COVID19_2020_open_line_list - Hubei.csv', low_memory=False)
outside = pd.read_csv('./covid/data/00/COVID19_2020_open_line_list - outside_Hubei.csv', low_memory=False)
outside = outside.drop(['data_moderator_initials'], axis=1)
data = pd.concat([inside, outside])
[2]:
data.columns
[2]:
Index(['ID', 'age', 'sex', 'city', 'province', 'country',
'wuhan(0)_not_wuhan(1)', 'latitude', 'longitude', 'geo_resolution',
'date_onset_symptoms', 'date_admission_hospital', 'date_confirmation',
'symptoms', 'lives_in_Wuhan', 'travel_history_dates',
'travel_history_location', 'reported_market_exposure',
'additional_information', 'chronic_disease_binary', 'chronic_disease',
'source', 'sequence_available', 'outcome', 'date_death_or_discharge',
'notes_for_discussion', 'location', 'admin3', 'admin2', 'admin1',
'country_new', 'admin_id'],
dtype='object')
The raw data has 47,201
records across 32
rows.
[3]:
data.shape
[3]:
(47201, 32)
1.2. Age
The age
field shows that age data is not kept consistently. In some cases, age is a whole number, in some cases, it is a float, yet, in many cases, age is a range.
[4]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
[5]:
fig, ax = plt.subplots(figsize=(25, 5))
s = data.age.value_counts().sort_index()
n = s.sum()
_ = s.plot(kind='bar', ax=ax, title=f'Raw Age Values, n={n:,}')
Let’s just look at the unique age
values.
[6]:
for i in data.age.value_counts().sort_index().index:
print(f"'{i}'", end=',')
'0-10','0-17','0-18','0-19','0-6','0-60','0-9','0.08333','0.25','0.58333','0.6666666667','1','1-19','1.75','10','10-19','11','12','13','13-19','13-69','14','15','15-88','16','16-17','16-80','17','17-66','18','18 - 100','18-','18-20','18-49','18-50','18-60','18-64','18-65','18-99','19','19-65','19-77','2','20','20-29','20-30','20-39','20-69','20-70','21','21-39','21-72','22','22-80','23','23-72','24','25','25-89','26','27','27-29','27-40','28','29','3','30','30-35','30-39','30-40','30-69','30-70','31','32','33','34','34-66','35','35-54','36','36-45','37','38','39','4','40','40-45','40-49','40-50','40-69','40-89','41','41-60','42','43','44','45','46','47','48','49','5','5-59','50','50-100','50-59','50-60','50-69','50-99','51','52','53','54','54-56','55','55-','55-74','56','57','58','59','60','60-','60-100','60-60','60-69','60-70','60-99','61','61-80','62','63','64','65','65-','65-99','66','67','68','69','7','70','70-100','70-70','70-79','70-82','71','72','73','74','74-76','75','75-','76','77','78','79','8','8-68','80','80-80','80-89','81','82','83','84','85','86','87','88','89','9','90-99','91','92','94','95','96',
This map is an attempt to map the raw age values into the following age ranges.
[0, 19],
[20, 29],
[30, 39],
[40, 49],
[50, 59],
[60, 69],
[70, 79],
[80, 89], and
[90, 99].
[7]:
import numpy as np
age_map = {
'00_19': set(['0-10','0-17','0-18','0-19','0-6','0-9','0.08333','0.25','0.58333','0.6666666667','1-19','1.75','10','10-19','13-19','16-17'] + [f"'{i}'" for i in range(20)]),
'20_29': set(['20-29','27-29'] + [f"'{i}'" for i in range(20, 30)]),
'30_39': set(['30-35','30-39'] + [f"'{i}'" for i in range(30, 40)]),
'40_49': set(['40-45','40-49'] + [f"'{i}'" for i in range(40, 50)]),
'50_59': set(['50-59','54-56', '55-'] + [f"'{i}'" for i in range(50, 60)]),
'60_69': set(['60-','60-60','60-69','65-'] + [f"'{i}'" for i in range(60, 70)]),
'70_79': set(['70-70','70-79','74-76','75-'] + [f"'{i}'" for i in range(70, 80)]),
'80_89': set(['80-80','80-89'] + [f"'{i}'" for i in range(80, 90)]),
'90_99': set(['90-99'] + [f"'{i}'" for i in range(90, 100)]),
}
get_age_group = lambda x: [k for k, v in age_map.items() if f"'{x}'" in v]
fig, ax = plt.subplots(figsize=(20, 5))
s = pd.Series(list(map(lambda arr: arr[0] if len(arr) > 0 else None, data.age.apply(lambda x: get_age_group(x))))).value_counts().sort_index()
n = s.sum()
_ = s.plot(kind='bar', ax=ax, title=f'Mapped Age Values, n={n:,}')
totals = [i.get_height() for i in ax.patches]
total = sum(totals)
for i in ax.patches:
x = i.get_x() + 0.1
y = i.get_height() + 5.0
p = i.get_height() / total * 100.0
s = f'{i.get_height()} ({p:.1f}%)'
ax.text(x, y, s, fontsize=10, color='red')
1.3. Sex
[8]:
fig, ax = plt.subplots(figsize=(10, 5))
s = data.sex.value_counts().sort_index()
n = s.sum()
_ = s.plot(kind='bar', ax=ax, title=f'Gender, n={n:,}')
for i in ax.patches:
x = i.get_x() + 0.1
y = i.get_height() + 15.0
p = i.get_height() / n * 100.0
s = f'{i.get_height():,} ({p:.1f}%)'
ax.text(x, y, s, fontsize=10, color='red')
1.4. Chronic Disease
Only very few patients had a yes
or no
answer reported to indicate if the patient had chronic diseases. May we assume that missing values indicate no
? Not really. Of the patients that were recorded to have a yes
or no
to having at least one chronic disease, over 86%
reported yes.
[9]:
fig, ax = plt.subplots(figsize=(10, 5))
s = data.chronic_disease_binary.apply(lambda v: 'No' if v == 0.0 else 'Yes' if v == 1.0 else None).value_counts().sort_index()
n = s.sum()
_ = s.plot(kind='bar', ax=ax, title=f'Chronic Disease Indication, n={n}')
totals = [i.get_height() for i in ax.patches]
total = sum(totals)
for i in ax.patches:
x = i.get_x() + 0.2
y = i.get_height() + 0.5
p = i.get_height() / n * 100.0
s = f'{i.get_height()} ({p:.1f}%)'
ax.text(x, y, s, fontsize=10, color='red')
Here, we map the disease to 4 main categories.
hypertension,
diabetes,
coronary heart dieases, and
other.
Other were basically diseases that had only a count of 1 and were all lumped in together.
[10]:
import itertools
def tokenize(s):
if s is None or isinstance(s, float) or len(s) < 1 or pd.isna(s):
return None
try:
delim = ';' if ';' in s else ','
return [t.strip().lower() for t in s.split(delim) if len(t.strip()) > 0]
except:
return s
def map_to_disease(d):
if 'hypertension' in d:
return 'hypertension'
elif 'diabetes' in d:
return 'diabetes'
elif 'coronary' in d or 'pulmonary' in d:
return 'coronary_hearth_disease'
else:
return 'other'
diseases = [tokenize(s) for s in data.chronic_disease if s is not None]
diseases = [tokens for tokens in diseases if tokens is not None and len(tokens) > 0]
n = len(diseases)
diseases = list(itertools.chain(*[item for item in diseases]))
key = lambda s: map_to_disease(s)
diseases = sorted([(k, len(list(g))) for k, g in itertools.groupby(sorted(diseases, key=key), key)], key=lambda tup: tup[1], reverse=True)
s = pd.Series([tup[1] for tup in diseases], [tup[0] for tup in diseases]).sort_index()
fig, ax = plt.subplots(figsize=(10, 5))
_ = s.plot(kind='bar', ax=ax, title=f'Chronic Diseases, n={n}')
for i in ax.patches:
x = i.get_x() + 0.10
y = i.get_height() + 0.5
p = i.get_height() / n * 100.0
s = f'{i.get_height()} ({p:.1f}%)'
ax.text(x, y, s, fontsize=10, color='red')
1.5. Symptoms
Symptoms are also grouped. Note that the grouping approach is far from being medically correct; this exercise is merely to aggregate what could be similar based on spelling, meaning and similarity. However, the groupings do represent some of the major symptoms associated with COVID-19 (reference needed). The major categories of symptoms reportedly associated with COVID are
fever,
coughing, and
joint/muscle pain.
[11]:
import json
import itertools
with open('./covid/data/00/symptom-mapping.json', 'r') as f:
symptom_map = json.load(f)
def map_to_symptoms(s):
if s.startswith('fever') or s.startswith('low fever'):
return ['fever']
return [k for k, v in symptom_map.items() if s in v]
symptoms = [tokenize(s) for s in data.symptoms if s is not None]
symptoms = [tokens for tokens in symptoms if tokens is not None and len(tokens) > 0]
n = len(symptoms)
symptoms = list(itertools.chain(*[item for item in symptoms]))
symptoms = list(itertools.chain(*[map_to_symptoms(s) for s in symptoms]))
key = lambda s: s
symptoms = sorted([(k, len(list(g))) for k, g in itertools.groupby(sorted(symptoms, key=key), key) if k != 'other'], key=lambda tup: tup[1], reverse=True)
s = pd.Series([tup[1] for tup in symptoms], [tup[0] for tup in symptoms]).sort_index()
fig, ax = plt.subplots(2, 1, figsize=(20, 10))
_ = s.plot(kind='bar', ax=ax[0], title=f'Symptoms by percentage, n={n}')
_ = s.plot(kind='bar', ax=ax[1], title=f'Symptoms by counts, n={n}')
for i in ax[0].patches:
x = i.get_x() + 0.0
y = i.get_height() + 10.0
p = i.get_height() / n * 100.0
s = f'{p:.1f}%'
ax[0].text(x, y, s, fontsize=10, color='red')
for i in ax[1].patches:
x = i.get_x() + 0.0
y = i.get_height() + 10.0
p = i.get_height() / n * 100.0
s = f'{i.get_height()}'
ax[1].text(x, y, s, fontsize=10, color='red')
plt.tight_layout()
1.6. Wuhan Residency
Wuhan is the capital of the `Hubei <>`__ province in China, and the original epicenter of the COVID-19 virus. This data is mainly focused on patients infected in Hubei. Let’s see how many patients lived in Wuhan.
[12]:
def map_wuhan_residency(v):
if v is None or pd.isna(v):
return None
r = v.lower()
if 'no' in r:
return 'no'
return 'yes'
s = data.lives_in_Wuhan.apply(lambda v: map_wuhan_residency(v))
s = s[s != None]
s = s.value_counts()
n = s.sum()
fig, ax = plt.subplots(figsize=(10, 5))
_ = s.plot(kind='bar', ax=ax, title=f'Wuhan Residency, n={n}')
for i in ax.patches:
x = i.get_x() + 0.20
y = i.get_height() + 5.0
p = i.get_height() / n * 100.0
s = f'{i.get_height()} ({p:.1f}%)'
ax.text(x, y, s, fontsize=10, color='red')
1.7. Travel History
Let’s see how many people recorded travel. Note, we assume that a record without a travel record did NOT travel, and this assumption is likely wrong. With this assumption, less than 2% of the patients were recorded to have a travel history.
[13]:
def map_travel(v):
if v is None or pd.isna(v):
return 0
return 1
s = data.travel_history_dates.apply(lambda v: map_travel(v))
s = s[s != None]
s = s.value_counts()
n = s.sum()
fig, ax = plt.subplots(figsize=(10, 5))
_ = s.plot(kind='bar', ax=ax, title=f'Travel History, n={n:,}')
for i in ax.patches:
x = i.get_x() + 0.15
y = i.get_height() + 1000.0
p = i.get_height() / n * 100.0
s = f'{i.get_height():,} ({p:.1f}%)'
ax.text(x, y, s, fontsize=10, color='red')
1.8. Outcomes
Surprisingly, less than 20 cases were reported to result in an outcome of death
, while the vast majority of patients ended up being discharged
, followed by being listed in stable
condition.
[14]:
def map_outcome(v):
if v is None:
return None
if isinstance(v, float):
return None
o = v.lower()
if 'discharge' in o:
return 'discharged'
elif 'stable' in o:
return 'stable'
elif 'death' in o or 'dead' in o:
return 'death'
elif 'recov' in o:
return 'recovered'
else:
return 'other'
s = data.outcome.apply(lambda v: map_outcome(v)).value_counts().sort_index()
n = s.sum()
fig, ax = plt.subplots(figsize=(15, 5))
_ = s.plot(kind='bar', ax=ax, title=f'COVID Outcomes, n={n:,}')
for i in ax.patches:
x = i.get_x() + 0.05
y = i.get_height() + 2.0
p = i.get_height() / n * 100.0
s = f'{i.get_height()} ({p:.1f}%)'
ax.text(x, y, s, fontsize=10, color='red')
1.9. Modeling
Let’s try to model how age, gender, chronic diseases and symptoms influence prognosis of death. Filtering through the data, we lose ever more records; from over 47,000
to just a little over 200
. Without further data manipulation, we have only 200
of 310
(about 65%
) patients with a known reported outcome left for the analysis. Also, we transform the data to be nearly all binary.
[15]:
data[['age', 'sex', 'chronic_disease', 'symptoms', 'outcome']].shape
[15]:
(47201, 5)
[16]:
data[['age', 'sex', 'chronic_disease', 'symptoms', 'outcome']].dropna(how='all').shape
[16]:
(3605, 5)
[17]:
def get_single_age_group(x):
groups = get_age_group(x)
return None if len(groups) < 1 else groups[0]
d = data[['age', 'sex', 'chronic_disease', 'symptoms', 'outcome', 'travel_history_dates', 'lives_in_Wuhan']].dropna(how='all').copy(deep=True)
d.age = d.age.apply(lambda x: get_single_age_group(x))
d = d[pd.notna(d.age)]
d.age = d.age.apply(lambda a: int(a.split('_')[0]))
print(d.shape)
d.sex = d.sex.apply(lambda v: 1 if v == 'male' else 0)
d.outcome = d.outcome.apply(lambda v: map_outcome(v))
d = d[pd.notna(d.outcome)]
d.outcome = d.outcome.apply(lambda v: 1 if v == 'death' else 0)
print(d.shape)
d.chronic_disease = d.chronic_disease.apply(lambda s: tokenize(s))
d.chronic_disease = d.chronic_disease.apply(lambda tokens: [map_to_disease(d) for d in tokens] if tokens is not None else None)
d.symptoms = d.symptoms.apply(lambda s: tokenize(s))
d.symptoms = d.symptoms.apply(lambda tokens: [map_to_symptoms(s) for s in tokens] if tokens is not None else None)
d.symptoms = d.symptoms.apply(lambda arrs: None if arrs is None else list(itertools.chain(*arrs)))
d['hypertension'] = d.chronic_disease.apply(lambda arr: 0 if arr is None else 1 if 'hypertension' in arr else 0)
d['chd'] = d.chronic_disease.apply(lambda arr: 0 if arr is None else 1 if 'coronary_hearth_disease' in arr else 0)
d['diabetes'] = d.chronic_disease.apply(lambda arr: 0 if arr is None else 1 if 'diabetes' in arr else 0)
for s in symptom_map.keys():
d[s] = d.symptoms.apply(lambda arr: 0 if arr is None else 1 if s in arr else 0)
d['wuhan'] = d.lives_in_Wuhan.apply(lambda v: 1 if map_wuhan_residency(v) == 'yes' else 0)
d['travel'] = d.travel_history_dates.apply(lambda v: map_travel(v))
d = d.drop(['chronic_disease', 'symptoms', 'travel_history_dates', 'lives_in_Wuhan'], axis=1)
print(d.shape)
d
(1674, 7)
(205, 7)
(205, 40)
[17]:
age | sex | outcome | hypertension | chd | diabetes | anosmia | sputum | muscle | chills | ... | discharge | dizziness | dryness | digestive | coronary | mild | myelofibrosis | sweating | wuhan | travel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 60 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 40 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 | 80 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
21840 | 90 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23099 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
23100 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
23102 | 60 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
24528 | 70 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
205 rows × 40 columns
[18]:
d.describe()
[18]:
age | sex | outcome | hypertension | chd | diabetes | anosmia | sputum | muscle | chills | ... | discharge | dizziness | dryness | digestive | coronary | mild | myelofibrosis | sweating | wuhan | travel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 205.000000 | 205.00000 | 205.000000 | 205.000000 | 205.000000 | 205.000000 | 205.0 | 205.000000 | 205.000000 | 205.000000 | ... | 205.0 | 205.000000 | 205.0 | 205.0 | 205.0 | 205.0 | 205.000000 | 205.0 | 205.000000 | 205.000000 |
mean | 44.292683 | 0.57561 | 0.048780 | 0.073171 | 0.034146 | 0.058537 | 0.0 | 0.019512 | 0.024390 | 0.029268 | ... | 0.0 | 0.004878 | 0.0 | 0.0 | 0.0 | 0.0 | 0.004878 | 0.0 | 0.317073 | 0.278049 |
std | 20.889818 | 0.49546 | 0.215936 | 0.261054 | 0.182050 | 0.235330 | 0.0 | 0.138655 | 0.154635 | 0.168970 | ... | 0.0 | 0.069843 | 0.0 | 0.0 | 0.0 | 0.0 | 0.069843 | 0.0 | 0.466475 | 0.449134 |
min | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 |
25% | 30.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 |
50% | 40.000000 | 1.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 |
75% | 60.000000 | 1.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 1.000000 | 1.000000 |
max | 90.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | 1.000000 | 1.000000 | ... | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.0 | 1.000000 | 1.000000 |
8 rows × 40 columns
Note the class imbalance, where the positive class is about 5%
!
[19]:
d.outcome.value_counts()
[19]:
0 195
1 10
Name: outcome, dtype: int64
1.9.1. Bivariate Correlation Visualization
Here, we use Jaccard
similarity and the point-biserial
measures to evaluate the strength of associations between binary-to-binary and continuous-to-binary pairs of variables. Diabetes and fever have comparatively high correlations to hypertension and cough, respectively. Age also has scattered, relatively high correlations to other variables.
[20]:
def get_jaccard_sim(df, c1, c2):
s_11 = len(df[(df[c1] == 1) & (df[c2] == 1)])
s_01 = len(df[(df[c1] == 0) & (df[c2] == 1)])
s_10 = len(df[(df[c1] == 1) & (df[c2] == 0)])
denominator = (s_11 + s_01 + s_10)
return s_11 / denominator if denominator > 0 else 0.0
def get_point_biserial(df, c, d):
n_0 = 0
n_1 = 0
c_0 = []
c_1 = []
for _, r in df.iterrows():
d_v = r[d]
c_v = r[c]
if pd.isna(d_v) or pd.isna(c_v):
continue
if d_v == 0:
n_0 += 1
c_0.append(c_v)
elif d_v == 1:
n_1 += 1
c_1.append(c_v)
n = n_0 + n_1
a = c_0 + c_1
c_0 = pd.Series(c_0)
c_1 = pd.Series(c_1)
a = pd.Series(a)
m_0 = c_0.mean()
m_1 = c_1.mean()
s = a.std()
r = (m_1 - m_0) / s if s != 0.0 else 0.0
r = r * np.sqrt((n_1 * n_0) / (n * n))
return r
cols = d.columns
n = len(cols)
corr = np.zeros((n, n))
for i, c1 in enumerate(cols):
for j, c2 in enumerate(cols):
if j < i:
continue
if c1 == 'age':
r = get_point_biserial(d, c1, c2)
else:
r = get_jaccard_sim(d, c1, c2)
corr[i][j] = corr[j][i] = r
[21]:
import seaborn as sns
fig, ax = plt.subplots(figsize=(12, 10))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
_ = sns.heatmap(pd.DataFrame(corr, columns=cols, index=cols), ax=ax, mask=mask, linewidths=.5)
bottom, top = ax.get_ylim()
_ = ax.set_ylim(bottom + 0.5, top - 0.5)
_ = ax.set_title('Bivariate Heatmap of Associations between Transformed Variables')
1.9.2. Logistic Regression
A simple logistic regression model using area under the curve (AUC) for ROC (5%) and PR (3%) shows very unpromising results. Furthermore, the sensitivity (recall on the positive class) is 10% and the specificity (recall on the negative class) is 99%; the classifier is specific, but not sensitive.
[22]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score, classification_report, accuracy_score
from imblearn.metrics import sensitivity_score, specificity_score
X = d.drop('outcome', axis=1)
y = d.outcome
print(X.shape, y.shape)
(205, 39) (205,)
[23]:
model = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=10000, random_state=37)
model.fit(X, y)
y_pred = model.predict(X)
y_prob = model.predict_proba(X)
acc = accuracy_score(y, y_pred)
sen = sensitivity_score(y, y_pred)
spe = specificity_score(y, y_pred)
roc = roc_auc_score(y, y_prob[:,0])
pr = average_precision_score(y, y_prob[:,0])
print(f'acc={acc:.5f}, sen={sen:.5f}, spe={spe:.5f}, roc={roc:.5f}, pr={pr:.5f}')
print(classification_report(y, y_pred))
acc=0.94634, sen=0.10000, spe=0.98974, roc=0.04410, pr=0.02843
precision recall f1-score support
0 0.96 0.99 0.97 195
1 0.33 0.10 0.15 10
accuracy 0.95 205
macro avg 0.64 0.54 0.56 205
weighted avg 0.93 0.95 0.93 205
The only variable with a harmful
effect on prognosis is age
.
[24]:
s = pd.Series(model.coef_[0], X.columns)
fig, ax = plt.subplots(figsize=(15, 5))
_ = s.plot(kind='bar', ax=ax, title='Logistic Regression Coefficients')
Since the AUC for ROC and PR were so bad, we inverted the decision of the classifier and obtained better results.
[25]:
roc = roc_auc_score(y, y_prob[:,1])
pr = average_precision_score(y, y_prob[:,1])
print(f'roc={roc:.5f}, pr={pr:.5f}')
roc=0.95590, pr=0.39988
1.9.3. Oversampling
Let’s try some oversampling techniques.
Random oversampling
SMOTE
ADASYN
[26]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
X = d.drop('outcome', axis=1)
y = d.outcome
print(X.shape, y.shape)
ros = RandomOverSampler(random_state=37)
X, y = ros.fit_resample(X, y)
print(sorted(Counter(y).items()))
(205, 39) (205,)
[(0, 195), (1, 195)]
[27]:
model = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=10000, random_state=37)
model.fit(X, y)
y_pred = model.predict(X)
y_prob = model.predict_proba(X)
acc = accuracy_score(y, y_pred)
sen = sensitivity_score(y, y_pred)
spe = specificity_score(y, y_pred)
roc = roc_auc_score(y, y_prob[:,0])
pr = average_precision_score(y, y_prob[:,0])
print(f'acc={acc:.5f}, sen={sen:.5f}, spe={spe:.5f}, roc={roc:.5f}, pr={pr:.5f}')
print(classification_report(y, y_pred))
acc=0.94103, sen=1.00000, spe=0.88205, roc=0.04247, pr=0.34502
precision recall f1-score support
0 1.00 0.88 0.94 195
1 0.89 1.00 0.94 195
accuracy 0.94 390
macro avg 0.95 0.94 0.94 390
weighted avg 0.95 0.94 0.94 390
[28]:
roc = roc_auc_score(y, y_prob[:,1])
pr = average_precision_score(y, y_prob[:,1])
print(f'roc={roc:.5f}, pr={pr:.5f}')
roc=0.95753, pr=0.92515
[29]:
from imblearn.over_sampling import SMOTE, ADASYN
X = d.drop('outcome', axis=1)
y = d.outcome
print(X.shape, y.shape)
X, y = SMOTE().fit_resample(X, y)
print(sorted(Counter(y).items()))
model = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=10000, random_state=37)
model.fit(X, y)
y_pred = model.predict(X)
y_prob = model.predict_proba(X)
acc = accuracy_score(y, y_pred)
sen = sensitivity_score(y, y_pred)
spe = specificity_score(y, y_pred)
roc = roc_auc_score(y, y_prob[:,0])
pr = average_precision_score(y, y_prob[:,0])
print(f'acc={acc:.5f}, sen={sen:.5f}, spe={spe:.5f}, roc={roc:.5f}, pr={pr:.5f}')
print(classification_report(y, y_pred))
(205, 39) (205,)
[(0, 195), (1, 195)]
acc=0.95385, sen=0.97436, spe=0.93333, roc=0.02886, pr=0.34509
precision recall f1-score support
0 0.97 0.93 0.95 195
1 0.94 0.97 0.95 195
accuracy 0.95 390
macro avg 0.95 0.95 0.95 390
weighted avg 0.95 0.95 0.95 390
[30]:
X = d.drop('outcome', axis=1)
y = d.outcome
print(X.shape, y.shape)
X, y = ADASYN().fit_resample(X, y)
print(sorted(Counter(y).items()))
model = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=10000, random_state=37)
model.fit(X, y)
y_pred = model.predict(X)
y_prob = model.predict_proba(X)
acc = accuracy_score(y, y_pred)
sen = sensitivity_score(y, y_pred)
spe = specificity_score(y, y_pred)
roc = roc_auc_score(y, y_prob[:,0])
pr = average_precision_score(y, y_prob[:,0])
print(f'acc={acc:.5f}, sen={sen:.5f}, spe={spe:.5f}, roc={roc:.5f}, pr={pr:.5f}')
print(classification_report(y, y_pred))
(205, 39) (205,)
[(0, 195), (1, 195)]
acc=0.95128, sen=0.96923, spe=0.93333, roc=0.03074, pr=0.35215
precision recall f1-score support
0 0.97 0.93 0.95 195
1 0.94 0.97 0.95 195
accuracy 0.95 390
macro avg 0.95 0.95 0.95 390
weighted avg 0.95 0.95 0.95 390
1.9.4. Undersampling
We are curious to see what undersampling may do.
[31]:
from imblearn.under_sampling import ClusterCentroids
X = d.drop('outcome', axis=1)
y = d.outcome
print(X.shape, y.shape)
X, y = ClusterCentroids(random_state=37).fit_resample(X, y)
print(sorted(Counter(y).items()))
model = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=10000, random_state=37)
model.fit(X, y)
y_pred = model.predict(X)
y_prob = model.predict_proba(X)
acc = accuracy_score(y, y_pred)
sen = sensitivity_score(y, y_pred)
spe = specificity_score(y, y_pred)
roc = roc_auc_score(y, y_prob[:,0])
pr = average_precision_score(y, y_prob[:,0])
print(f'acc={acc:.5f}, sen={sen:.5f}, spe={spe:.5f}, roc={roc:.5f}, pr={pr:.5f}')
print(classification_report(y, y_pred))
(205, 39) (205,)
[(0, 10), (1, 10)]
acc=0.95000, sen=1.00000, spe=0.90000, roc=0.09000, pr=0.35476
precision recall f1-score support
0 1.00 0.90 0.95 10
1 0.91 1.00 0.95 10
accuracy 0.95 20
macro avg 0.95 0.95 0.95 20
weighted avg 0.95 0.95 0.95 20
1.9.5. K-Fold Cross Validation
Let’s do 5-fold cross validation and see what the average results are for the following performance metrics.
[32]:
from sklearn.model_selection import StratifiedKFold
X = d.drop('outcome', axis=1).to_numpy()
y = d.outcome.to_numpy()
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=37)
results = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train, y_train = SMOTE().fit_resample(X_train, y_train)
model = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=10000, random_state=37)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)
acc = accuracy_score(y_test, y_pred)
sen = sensitivity_score(y_test, y_pred)
spe = specificity_score(y_test, y_pred)
roc = roc_auc_score(y_test, y_prob[:,0])
pr = average_precision_score(y_test, y_prob[:,0])
iroc = roc_auc_score(y_test, y_prob[:,1])
ipr = average_precision_score(y_test, y_prob[:,1])
metrics = [acc, sen, spe, roc, pr, iroc, ipr]
print(','.join([f'{m:.5f}' for m in metrics]))
results.append(metrics)
0.97561,1.00000,0.97436,0.01923,0.03721,0.98077,0.58333
0.95122,0.50000,0.97436,0.04487,0.03790,0.95513,0.45000
0.78049,0.50000,0.79487,0.17949,0.04316,0.82051,0.16234
0.87805,1.00000,0.87179,0.02564,0.03721,0.97436,0.50000
0.90244,1.00000,0.89744,0.03205,0.03782,0.96795,0.50000
The average performances from k-fold cross validation of the classifier are shown below.
[33]:
pd.DataFrame(results, columns=['acc', 'sen', 'spe', 'roc', 'pr', 'iroc', 'ipr']).mean()
[33]:
acc 0.897561
sen 0.800000
spe 0.902564
roc 0.060256
pr 0.038660
iroc 0.939744
ipr 0.439134
dtype: float64
1.10. Survival Analysis
Now we will estimate survival from the recorded date of onset of symptoms using Kaplan-Meier estimation and Cox Proportional Hazards (CPH).
1.10.1. Kaplan-Meier Survival Estimation
[34]:
from datetime import datetime
def get_start_date(r):
if not pd.isna(r.date_onset_symptoms) and len(r.date_onset_symptoms) > 0:
s = r.date_onset_symptoms
elif not pd.isna(r.date_admission_hospital) and len(r.date_admission_hospital) > 0:
s = r.date_admission_hospital
else:
s = r.date_confirmation
s = s.replace('-', '').strip()
s = s.split(' ')[0].strip()
return s
def parse_date(v):
s = v.split(' ')[0]
if s.endswith('20202'):
s = s[0:len(s) - 1]
try:
return datetime.strptime(s, '%d.%m.%Y')
except:
return datetime.strptime(s, '%m.%d.%Y')
d = data[['date_onset_symptoms', 'date_admission_hospital', 'date_confirmation', 'date_death_or_discharge', 'outcome']].copy(deep=True)
d.shape
# drop nulls
d = d.dropna(axis=0, how='all')
print(d.shape, 'all nulls')
d['start'] = d.apply(lambda r: get_start_date(r), axis=1)
d = d[pd.notna(d.start)]
d.start = d.start.apply(lambda s: parse_date(s))
print(d.shape, 'start')
d['stop'] = d.date_death_or_discharge
d = d[pd.notna(d.stop)]
d.stop = d.stop.apply(lambda s: parse_date(s))
print(d.shape, 'stop')
d.outcome = d.outcome.apply(lambda v: map_outcome(v))
d = d[pd.notna(d.outcome)]
d['E'] = d.outcome.apply(lambda v: 1 if v == 'death' else None if v is None else 0)
print(d.shape, 'outcome')
d['T'] = d.apply(lambda r: int((r.stop - r.start).days), axis=1)
d['T'] = d['T'].apply(lambda t: abs(t))
d = d[d['T'] >= 0]
print(d.shape, 'time')
d = d.drop(['date_onset_symptoms', 'date_admission_hospital', 'date_confirmation', 'date_death_or_discharge', 'outcome', 'start', 'stop'], axis=1)
print(d.shape, 'final')
(46240, 5) all nulls
(46240, 6) start
(212, 7) stop
(210, 8) outcome
(210, 9) time
(210, 2) final
[35]:
from lifelines import KaplanMeierFitter
T = d['T']
E = d.E
kmf = KaplanMeierFitter()
_ = kmf.fit(T, event_observed=E)
The median survival time is the time at which 50% of the patients suffer from the outcome. Here, the median survival time is infinite
, meaning, patients have a good chance of surviving the disease. For the patients in this part of the study, all of them died within 41 days, after which all patients no longer died.
[36]:
# kmf.timeline
[37]:
# kmf.cumulative_density_
[38]:
fig, ax = plt.subplots(figsize=(10, 5))
_ = kmf.plot(ax=ax, title='Kaplan-Meier Survival Estimation')
1.10.2. Cox Proportional Hazard
Now we can use survival regression with CPH to see how the covariates influence survival. Note that there are convergence issues involved with the optimization technique to recover these coefficients, and so these results are not to be trusted.
[39]:
d = data[['age', 'sex', 'chronic_disease', 'symptoms',
'date_onset_symptoms', 'date_admission_hospital', 'date_confirmation', 'date_death_or_discharge', 'outcome',
'lives_in_Wuhan', 'travel_history_dates']].dropna(how='all').copy(deep=True)
print(d.shape, 'original')
d.outcome = d.outcome.apply(lambda v: map_outcome(v))
d = d[pd.notna(d.outcome)]
d['E'] = d.outcome.apply(lambda v: 1 if v == 'death' else None if v is None else 0)
print(d.shape, d.E.sum(), 'outcome')
# impute the average age, if needed
avg_age = d.age.apply(lambda x: get_single_age_group(x))
avg_age = np.array([int(a.split('_')[0]) for a in avg_age if pd.notna(a)])
avg_age = avg_age.mean()
d.age = d.age.apply(lambda x: get_single_age_group(x))
d.age = d.age.apply(lambda a: int(a.split('_')[0]) if pd.notna(a) else avg_age)
print(d.shape, d.E.sum(), 'age')
d.sex = d.sex.apply(lambda v: 1 if v == 'male' else 0)
print(d.shape, d.E.sum(), 'sex')
# chronic diseases
d.chronic_disease = d.chronic_disease.apply(lambda s: tokenize(s))
d.chronic_disease = d.chronic_disease.apply(lambda tokens: [map_to_disease(d) for d in tokens] if tokens is not None else None)
print(d.shape, d.E.sum(), 'chronic disease')
d['hypertension'] = d.chronic_disease.apply(lambda arr: 0 if arr is None else 1 if 'hypertension' in arr else 0)
d['chd'] = d.chronic_disease.apply(lambda arr: 0 if arr is None else 1 if 'coronary_hearth_disease' in arr else 0)
d['diabetes'] = d.chronic_disease.apply(lambda arr: 0 if arr is None else 1 if 'diabetes' in arr else 0)
print(d.shape, d.E.sum(), 'chronic disease 2')
# symptoms
d.symptoms = d.symptoms.apply(lambda s: tokenize(s))
d.symptoms = d.symptoms.apply(lambda tokens: [map_to_symptoms(s) for s in tokens] if tokens is not None else None)
d.symptoms = d.symptoms.apply(lambda arrs: None if arrs is None else list(itertools.chain(*arrs)))
print(d.shape, d.E.sum(), 'symptoms')
for s in symptom_map.keys():
d[s] = d.symptoms.apply(lambda arr: 0 if arr is None else 1 if s in arr else 0)
print(d.shape, d.E.sum(), 'symptoms 2')
d['wuhan'] = d.lives_in_Wuhan.apply(lambda v: 1 if map_wuhan_residency(v) == 'yes' else 0)
d['travel'] = d.travel_history_dates.apply(lambda v: map_travel(v))
print(d.shape, d.E.sum(), 'wuhan, travel')
# compute the duration
# assumption: negative durations are turned into absolute value
d['start'] = d.apply(lambda r: get_start_date(r), axis=1)
d = d[pd.notna(d.start)]
d.start = d.start.apply(lambda s: parse_date(s))
print(d.shape, d.E.sum(), 'start')
d['stop'] = d.date_death_or_discharge
d = d[pd.notna(d.stop)]
d.stop = d.stop.apply(lambda s: parse_date(s))
print(d.shape, d.E.sum(), 'stop')
d['T'] = d.apply(lambda r: int((r.stop - r.start).days), axis=1)
d['T'] = d['T'].apply(lambda t: abs(t))
d = d[d['T'] >= 0]
print(d.shape, d.E.sum(), 'duration')
d = d.drop(['date_onset_symptoms', 'date_admission_hospital', 'date_confirmation', 'date_death_or_discharge',
'outcome', 'start', 'stop', 'lives_in_Wuhan', 'chronic_disease', 'symptoms',
'travel_history_dates'], axis=1)
print(d.shape, d.E.sum(), 'final')
(46251, 11) original
(310, 12) 17 outcome
(310, 12) 17 age
(310, 12) 17 sex
(310, 12) 17 chronic disease
(310, 15) 17 chronic disease 2
(310, 15) 17 symptoms
(310, 47) 17 symptoms 2
(310, 49) 17 wuhan, travel
(310, 50) 17 start
(210, 51) 17 stop
(210, 52) 17 duration
(210, 41) 17 final
[40]:
s = d.sum()
remove_cols = [c for c, v in zip(s.index, s.values) if v <= 0 and c != 'E']
d = d.drop(remove_cols, axis=1)
print(d.shape, d.E.sum(), 'remove zero cols')
(210, 28) 17 remove zero cols
[41]:
from lifelines import CoxPHFitter
import warnings
warnings.filterwarnings('ignore')
cph = CoxPHFitter()
cph.fit(d, duration_col='T', event_col='E')
cph.print_summary()
model | lifelines.CoxPHFitter |
---|---|
duration col | 'T' |
event col | 'E' |
baseline estimation | breslow |
number of observations | 210 |
number of events observed | 17 |
partial log-likelihood | -61.08 |
time fit was run | 2020-03-27 00:30:35 UTC |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|
age | 0.07 | 1.07 | 0.02 | 0.04 | 0.10 | 1.04 | 1.11 | 4.50 | <0.005 | 17.20 |
sex | 0.41 | 1.51 | 0.60 | -0.76 | 1.59 | 0.47 | 4.89 | 0.69 | 0.49 | 1.03 |
hypertension | -34.77 | 0.00 | 9070.25 | -17812.14 | 17742.60 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
chd | -4.05 | 0.02 | 14542.30 | -28506.44 | 28498.34 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
diabetes | 18.83 | 1.51e+08 | 5619.11 | -10994.41 | 11032.07 | 0.00 | inf | 0.00 | 1.00 | 0.00 |
sputum | 0.74 | 2.09 | 22961.32 | -45002.62 | 45004.09 | 0.00 | inf | 0.00 | 1.00 | 0.00 |
muscle | -3.14 | 0.04 | 56755.10 | -111241.10 | 111234.82 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
chills | -15.93 | 0.00 | 55930.73 | -109638.15 | 109606.29 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
fever | -0.40 | 0.67 | 1.16 | -2.68 | 1.88 | 0.07 | 6.56 | -0.34 | 0.73 | 0.45 |
fatigue | -17.99 | 0.00 | 8432.34 | -16545.07 | 16509.10 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
headache | -15.63 | 0.00 | 55073.47 | -107957.64 | 107926.38 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
sore_throat | -15.78 | 0.00 | 18207.89 | -35702.60 | 35671.04 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
dry_cough | -18.43 | 0.00 | 5619.11 | -11031.67 | 10994.82 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
diarrhoea | -37.03 | 0.00 | 17713.93 | -34755.69 | 34681.62 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
dyspnea | 1.74 | 5.68 | 1.12 | -0.47 | 3.94 | 0.63 | 51.45 | 1.55 | 0.12 | 3.03 |
nausea | 31.75 | 6.14e+13 | 77312.72 | -151498.40 | 151561.89 | 0.00 | inf | 0.00 | 1.00 | 0.00 |
running_nose | -0.84 | 0.43 | 52675.52 | -103242.95 | 103241.27 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
respiratory | -20.33 | 0.00 | 39733.65 | -77896.86 | 77856.19 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
malaise | -17.17 | 0.00 | 81042.45 | -158857.46 | 158823.12 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
anorexia | 20.35 | 6.85e+08 | 8432.34 | -16506.74 | 16547.43 | 0.00 | inf | 0.00 | 1.00 | 0.00 |
chest | 3.82 | 45.72 | 10171.63 | -19932.21 | 19939.86 | 0.00 | inf | 0.00 | 1.00 | 0.00 |
eye | 17.90 | 5.94e+07 | 68830.34 | -134887.08 | 134922.88 | 0.00 | inf | 0.00 | 1.00 | 0.00 |
dizziness | -20.00 | 0.00 | 73440.12 | -143959.99 | 143919.99 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
myelofibrosis | -20.54 | 0.00 | 81042.45 | -158860.83 | 158819.74 | 0.00 | inf | -0.00 | 1.00 | 0.00 |
wuhan | -0.63 | 0.53 | 1.19 | -2.96 | 1.70 | 0.05 | 5.47 | -0.53 | 0.60 | 0.75 |
travel | -0.94 | 0.39 | 1.10 | -3.09 | 1.21 | 0.05 | 3.35 | -0.86 | 0.39 | 1.35 |
Log-likelihood ratio test | 44.71 on 26 df, -log2(p)=6.30 |
---|