1. Correlation vs Regression Coefficient
For two variables \(X\) and \(Y\), we can compute the correlation and a regression coefficient when regressing \(Y \sim X\). Let’s see the similarities and differences between the correlation and regression coefficient. In general, both provide
direction (sign, positive or negative), and
magnitude (value).
Typically, correlation, as computed using Pearson’s correlation, is in the range \([-1, 1]\), where a value of
\(-1\) indicates negative correlation (as one variable goes up, the other goes down),
\(0\) indicates no correlation (as one variable goes up or down, the other is not affected), and,
\(1\) indicates positive correlation (as one variable goes up, the other goes up as well).
For a regression coefficient \(\beta_X\) between \(X\) and \(Y\) when performing the regression \(Y \sim X\), the direction can be positive or negative (just like in regression, with the same interpretation), however, the magnitude is not bounded in the range \([-1, 1]\) and can be unbounded \([-\infty, \infty]\), in fact. The regression coefficient \(\beta_X\) associated with \(X\) for \(Y \sim X\) is a weight on \(X\) and determines how it influences the baseline (expected/average) value of \(Y\) to change by; in other words, the regression coefficient \(\beta_X\) is how much change we can expect the baseline of \(Y\) to change when \(X\) changes by a unit.
The regression coefficient is useful for prediction problems, while the correlation is useful for summarizing the association between two variables. However, they are related conceptually and computationally.
1.1. Simulation
Let’s simulate the following data.
\(X \sim \mathcal{N}(0, 3)\)
\(Y \sim 5 + 3 X + e\)
\(e \sim \mathcal{N}(0, 1)\)
[1]:
import pandas as pd
import numpy as np
np.random.seed(37)
n = 1_000
x = np.random.normal(0, 3, n)
y = 5 + 3 * x + np.random.normal(0, 1, n)
df = pd.DataFrame({
'x': x,
'y': y
})
1.2. Correlation
Computing the correlation is very easy using Pandas. As you can see, \(X\) and \(Y\) are nearly perfectly correlated in a positive way.
[2]:
df.corr()
[2]:
x | y | |
---|---|---|
x | 1.000000 | 0.993783 |
y | 0.993783 | 1.000000 |
1.3. Regression coefficient
Now, let’s regress \(Y \sim X\). You can see that the intercept is the average value of \(Y\) at 4.9 (which is not far from the truth, 5.0). The coefficient \(\beta_X\) for \(X\) is 2.9987 (which is not far from the truth, 3.0). When \(X\) changes, we can expect the baseline value of \(Y\) to change by a factor of 2.9987.
[3]:
from sklearn.linear_model import LinearRegression
m = LinearRegression()
m.fit(df[['x']], df['y'])
m.intercept_, m.coef_
[3]:
(4.983964548516603, array([2.99871521]))
1.4. Correlation from regression coefficient
We can compute the correlation from the regression coeffcient as follows.
\(r = \dfrac{\sigma^2_{XY}}{\sigma_X \sigma_Y} = \beta_X \dfrac{\sigma_X}{\sigma_Y}\)
[4]:
b = m.coef_[0]
s_x = df['x'].std()
s_y = df['y'].std()
r = b * (s_x / s_y)
r
[4]:
0.993782985850263
1.5. Regression coefficient from correlation
We can compute the regression coefficient from correlation as follows.
\(\beta_X = \dfrac{\sigma^2_{XY}}{\sigma_X} = r \dfrac{\sigma_Y}{\sigma_X}\)
[5]:
v_xy = df.cov().iloc[0, 1]
v_x = df['x'].var()
b = v_xy / v_x
b
[5]:
2.9987152098184437
[6]:
b = r * (s_y / s_x)
b
[6]:
2.998715209818446
1.6. Regressing X ~ Y
Now let’s regress \(X \sim Y\) and see what \(\beta_Y\) is for \(Y\). According to the equations above, we really do not have to do a regression equation to get the regression coefficient.
\(\beta_Y = \dfrac{\sigma^2_{XY}}{\sigma_Y} = r \dfrac{\sigma_X}{\sigma_Y}\)
[7]:
v_xy = df.cov().iloc[0, 1]
v_y = df['y'].var()
b = v_xy / v_y
b
[7]:
0.3293425863622632
[8]:
b = r * (s_x / s_y)
b
[8]:
0.3293425863622633
But, let’s do the regression anyways. You can see that \(\beta_Y\) is the same in call cases.
Bonus: Why is the intercept -1.6 though (and not the average value of X)?
[9]:
m = LinearRegression()
m.fit(df[['y']], df['x'])
m.intercept_, m.coef_
[9]:
(-1.6409565959948018, array([0.32934259]))
The correlation can then be derived from \(\beta_Y\).
\(r = \dfrac{\sigma^2_{XY}}{\sigma_X \sigma_Y} = \beta_Y \dfrac{\sigma_Y}{\sigma_X}\)
[10]:
b * (s_y / s_x)
[10]:
0.993782985850263
1.7. z-scores
Notice how \(\beta_X\) and \(\beta_Y\) are not (necessarily) the same; they are asymmetrical (though not always), while correlation is symmetrical (always). However, if we transform \(X\) and \(Y\) into their z-scores, look what happens to the coefficients with the regression models as well as the correlation coefficient.
\(X_z \sim Y_z\)
\(Y_z \sim X_z\)
It seems both models are now indistinguishable and the coefficients are the same.
[11]:
from scipy.stats import zscore
Z = df.apply(zscore)
m.fit(Z[['x']], Z['y'])
m.intercept_, m.coef_
[11]:
(3.2912297889129536e-17, array([0.99378299]))
[12]:
m.fit(Z[['y']], Z['x'])
m.intercept_, m.coef_
[12]:
(-3.260859790997736e-17, array([0.99378299]))
Now, observe the correlation between \(X_z\) and \(Y_z\).
[13]:
Z.corr()
[13]:
x | y | |
---|---|---|
x | 1.000000 | 0.993783 |
y | 0.993783 | 1.000000 |
It even seems that the coefficients and correlation are the same! In fact they are the same in z-score space to each other and to the correlation in the original space. When the data is standardized (z-score), then the following holds.
\(r = r_z = \beta_{X_z} = \beta_{Y_z}\)
In fact, standardizing the data is a very common thing to do in regression to compare coefficients. When standardizing a dataset, the variables lose their units (they are unitless) and the coefficients are not interpreted per unit changes but per standard deviation changes in influencing the target variable. The above only holds when there are 2 variables at play, but when there are multiple independent variables, the situation changes.
1.8. House value, simulation
Let’s simulate data as follows.
\(I ~ \sim \mathcal{N}(30000, 5000)\)
\(A ~ \sim \mathcal{N}(45, 5)\)
\(e ~ \sim \mathcal{N}(0, 1)\)
\(y = H ~ \sim 250000 + 3 I - 0.5 A + e\)
where,
\(I\) is the yearly income of a person,
\(A\) is the age of a person,
\(e\) is random error,
\(y\) or \(H\) is the worth of a person’s home
[20]:
n = 1_000
income = np.random.normal(30_000, 5_000, n)
age = np.random.normal(45, 5, n)
house = 250_000 + (3 * income) - (0.5 * age) + np.random.normal(0, 1, n)
df = pd.DataFrame({
'income': income,
'age': age,
'y': house
})
[24]:
Z = df.apply(zscore)
As you can see, the correlations are the same, whether we are in unit (raw) or unitless (z-score) space.
[21]:
df.corr()
[21]:
income | age | y | |
---|---|---|---|
income | 1.000000 | 0.006481 | 1.000000 |
age | 0.006481 | 1.000000 | 0.006309 |
y | 1.000000 | 0.006309 | 1.000000 |
[25]:
Z.corr()
[25]:
income | age | y | |
---|---|---|---|
income | 1.000000 | 0.006481 | 1.000000 |
age | 0.006481 | 1.000000 | 0.006309 |
y | 1.000000 | 0.006309 | 1.000000 |
Now look at the regression parameters in unit or unitless space; they look very different from the the correlations.
The coefficients of the regression with units is easy to interpret.
a unit change in income adds 3 times the income to the expected/baseline housing worth
a unit change in age removes 0.5 times the age to the expected/baseline housing worth.
[22]:
# unit space
m = LinearRegression()
m.fit(df[['income', 'age']], df['y'])
m.intercept_, m.coef_
[22]:
(249999.95252662763, array([ 3.0000031 , -0.50079101]))
The coefficients of the regression without units is harder to interpret.
a one standard deviation change in income adds 1 standard deviation change to the expected/baseline housing worth
a one standard deviation change in income adds no standard deviation change to the expected/baseline housing worth
[26]:
# unitless space
m.fit(Z[['income', 'age']], Z['y'])
m.intercept_, m.coef_
[26]:
(-2.348538114834866e-15, array([ 1.00000110e+00, -1.72047727e-04]))
For this mixture of unit and unitless features/variables, what’s the interpretation?
[28]:
# independent variables are unitless, dependent variable has unit
m.fit(Z[['income', 'age']], df['y'])
m.intercept_, m.coef_
[28]:
(339455.06343383784, array([ 1.50604056e+04, -2.59110570e+00]))