19. One-Way ANOVA

One-way Analysis of Variance (ANOVA) is a multiple comparison technique testing whether the means of 3 or more groups are different. The hypotheses are as follows.

  • \(H_0\) (null): There is no difference between the means; \(\mu_1 = \mu_2 = \ldots = \mu_n\)

  • \(H_a\) (alternative): There is at least one group mean that differs from the overall mean.

ANOVA uses the F-test for testing statistical significance, which measures the ratio of the variance between groups to the variance within groups.

When there are only 2 groups, a t-test may be used instead. However, with comparing 3 or more groups, ANOVA should be used since it corrects for the family-wise error rate FWER holding the probability of a Type I error steady no matter the number of comparisons.

In this notebook, let’s see how a 1-way ANOVA relates to regression problems.

19.1. Data

Let’s generate 50 samples from 3 different normal distributions. These normal distributions are defined as follows.

  • \(A \sim \mathcal{N} (0, 1)\)

  • \(B \sim \mathcal{N} (1, 1)\)

  • \(C \sim \mathcal{N} (1.5, 1)\)

These groups are often called treatments.

[1]:
import numpy as np
import pandas as pd

np.random.seed(37)

n = 50

df = pd.concat([
    pd.DataFrame({'group': 'A', 'y': pd.Series(np.random.normal(0, 1, n))}),
    pd.DataFrame({'group': 'B', 'y': pd.Series(np.random.normal(1, 1, n))}),
    pd.DataFrame({'group': 'C', 'y': pd.Series(np.random.normal(1.5, 1, n))})
])
[2]:
df.head()
[2]:
group y
0 A -0.054464
1 A 0.674308
2 A 0.346647
3 A -1.300346
4 A 1.518512

19.2. Means

You can see that the empirical means and standard deviations are close to the real ones, but not exactly.

[3]:
df.groupby(['group']).agg(['mean', 'std'])
[3]:
y
mean std
group
A 0.139065 0.998983
B 0.982454 1.057929
C 1.552792 1.052197
[4]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(1, 2, figsize=(7, 3))

sns.boxplot(df, x='group', y='y', ax=ax[0])
sns.kdeplot(df, hue='group', x='y', ax=ax[1])

fig.tight_layout()
_images/anova_6_0.png

19.3. t-test

Let’s see what happens when we conduct all pairwise t-tests. You can see that group A is very different from B and C, but groups B and C are comparatively close.

[5]:
from scipy import stats
from scipy import special
import itertools

def do_test(pair):
    x, y = pair

    x = df[df['group']==x]['y']
    y = df[df['group']==y]['y']

    t = stats.ttest_ind(x, y)
    s = t.statistic
    p = t.pvalue
    a = 0.05 / special.comb(3, 2) # bonferroni's correction, however, losing power

    return {
        'x': pair[0],
        'y': pair[1],
        'statistic': s,
        'pvalue': p,
        'alpha': a,
        'is_sig': p < a
    }

t_test = itertools.combinations(df['group'].unique(), 2)
t_test = map(lambda tup: do_test(tup), t_test)
t_test = pd.DataFrame(t_test)
t_test
[5]:
x y statistic pvalue alpha is_sig
0 A B -4.098587 8.576614e-05 0.016667 True
1 A C -6.889943 5.417835e-10 0.016667 True
2 B C -2.702849 8.104302e-03 0.016667 True

19.4. 1-way ANOVA

Now we will apply a 1-way ANOVA. The report F statistic and p-value are shown. The test would lead us to reject the null hypothesis.

[6]:
F, p = stats.f_oneway(*[df[df['group']==g]['y'] for g in ['A', 'B', 'C']])
[7]:
F
[7]:
23.533902104682596
[8]:
p
[8]:
1.3590719185638985e-09

19.5. Tukey’s HSD

Here is a post-hoc test using Tukey’s Honestly Significant Difference test. This test will see which two pair of groups are different.

[9]:
hsd = stats.tukey_hsd(*[df[df['group']==g]['y'] for g in ['A', 'B', 'C']])
[10]:
hsd.statistic
[10]:
array([[ 0.        , -0.84338966, -1.41372695],
       [ 0.84338966,  0.        , -0.57033729],
       [ 1.41372695,  0.57033729,  0.        ]])
[11]:
hsd.pvalue
[11]:
array([[1.00000000e+00, 2.26787268e-04, 6.66437017e-10],
       [2.26787268e-04, 1.00000000e+00, 1.82679298e-02],
       [6.66437017e-10, 1.82679298e-02, 1.00000000e+00]])
[12]:
hsd._nobs
[12]:
3
[13]:
hsd._ntreatments
[13]:
150
[14]:
hsd._stand_err
[14]:
0.14661287403767

19.6. Xy, means

Before we move onto regression, keep in mind the group means and overall mean.

[15]:
df.groupby(['group']).agg(['mean', 'std'])
[15]:
y
mean std
group
A 0.139065 0.998983
B 0.982454 1.057929
C 1.552792 1.052197
[16]:
df['y'].mean()
[16]:
0.8914368828670105

19.7. Regression

Here, we regress y ~ g but with g dummy-encoded (one-hot encoded).

[17]:
X = df.drop(columns=['y'])
y = df['y']
[18]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit(X)
[18]:
OneHotEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[19]:
X = pd.DataFrame(
    ohe.transform(X).todense(),
    columns=ohe.get_feature_names_out()
)
X
[19]:
group_A group_B group_C
0 1.0 0.0 0.0
1 1.0 0.0 0.0
2 1.0 0.0 0.0
3 1.0 0.0 0.0
4 1.0 0.0 0.0
... ... ... ...
145 0.0 0.0 1.0
146 0.0 0.0 1.0
147 0.0 0.0 1.0
148 0.0 0.0 1.0
149 0.0 0.0 1.0

150 rows × 3 columns

[20]:
from sklearn.linear_model import LinearRegression

m = LinearRegression()
m.fit(X.drop(columns=['group_A']), y)
[20]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[21]:
m.intercept_, m.coef_
[21]:
(0.13906468043625642, array([0.84338966, 1.41372695]))
[22]:
m.coef_
[22]:
array([0.84338966, 1.41372695])

19.8. Means vs coefficients

If you notice, the intercept of the regression model is the mean of A. If you add the coefficients to the intercept, you recover the means of B and C!

[23]:
df.groupby(['group']).mean()
[23]:
y
group
A 0.139065
B 0.982454
C 1.552792
[24]:
m.intercept_, m.coef_ + m.intercept_
[24]:
(0.13906468043625642, array([0.98245434, 1.55279163]))

19.9. Test statistics vs coefficients

The coefficients are the test statistics Tukey’s HSD!

[25]:
hsd.statistic
[25]:
array([[ 0.        , -0.84338966, -1.41372695],
       [ 0.84338966,  0.        , -0.57033729],
       [ 1.41372695,  0.57033729,  0.        ]])
[26]:
m.coef_
[26]:
array([0.84338966, 1.41372695])

When using Tukey’s HSD, the p-values are given.

[27]:
hsd.pvalue
[27]:
array([[1.00000000e+00, 2.26787268e-04, 6.66437017e-10],
       [2.26787268e-04, 1.00000000e+00, 1.82679298e-02],
       [6.66437017e-10, 1.82679298e-02, 1.00000000e+00]])

We can also compute the p-values as below. The main complication is with computing the standard error SE.

[28]:
def get_se(df):
    '''
    Computes standard error.
    Adapted from https://github.com/scipy/scipy/blob/v1.10.1/scipy/stats/_hypotests.py.
    When groups are unequal, this code should be adjusted too.
    '''
    _v = np.array([np.var(df[df['group']==g]['y'], ddof=1) for g in df['group'].unique()])
    _s = np.array([df[df['group']==g].shape[0] - 1 for g in df['group'].unique()])
    _n = df.shape[0] - df['group'].unique().shape[0]
    _mse = np.sum(_v * _s) / _n
    _norm = 2 / df[df['group']=='A'].shape[0]
    _se = np.sqrt(_norm * _mse / 2)

    return _se

def get_pvalue(m_0, m_1, se, k, dof):
    diff = np.abs(m_0 - m_1)
    t = diff / se
    print(f'{t=}, {k=}, {dof=}')
    return stats.distributions.studentized_range.sf(t, k, dof)
[29]:
_mean = df.groupby(['group']).mean()['y'].to_numpy()
_se = get_se(df)
_k = df['group'].unique().shape[0]
_dof = df.shape[0] - _k
[30]:
_mean
[30]:
array([0.13906468, 0.98245434, 1.55279163])
[31]:
_se
[31]:
0.14661287403767
[32]:
_k
[32]:
3
[33]:
_dof
[33]:
147
[34]:
get_pvalue(_mean[0], _mean[1], _se, _k, _dof)
t=5.752493850354061, k=3, dof=147
[34]:
0.00022678726812552785
[35]:
get_pvalue(_mean[0], _mean[2], _se, _k, _dof)
t=9.64258398375435, k=3, dof=147
[35]:
6.664370166831191e-10
[36]:
get_pvalue(_mean[1], _mean[2], _se, _k, _dof)
t=3.8900901334002893, k=3, dof=147
[36]:
0.01826792981617653

Here’s yet another way to compute the p-values.

[37]:
1 - stats.studentized_range.cdf(np.abs(_mean[0] - _mean[1]) / _se, _k, _dof)
[37]:
0.00022678726812552785
[38]:
1 - stats.studentized_range.cdf(np.abs(_mean[0] - _mean[2]) / _se, _k, _dof)
[38]:
6.664370166831191e-10
[39]:
1 - stats.studentized_range.cdf(np.abs(_mean[1] - _mean[2]) / _se, _k, _dof)
[39]:
0.01826792981617653