2. Massey’s Method
Massey’s Method refers to Kenneth Massey’s method for ranking sport teams. Denote the following.
\(n\) is the number of teams
\(m\) is the number games played (only 2 teams can play in a game)
\(X\) is a \(n\) x \(m\) matrix where columns corresponding to teams and rows correspond to games
Each row of \(X\) is sparse (mostly zeros)
For each row, 1 is placed in the column corresponding to the winning team
For each row, -1 is placed in the column corresponding to the losing team
\(r\) is a \(n\) x 1 vector corresponding to the
rating
of each team\(r\) is what will be estimated
\(y\) is a \(n\) x 1 vector corresponding to the margins of victory
Then, we are trying to solve
\(Xr = y\)
Futhermore, denote coefficient matrix \(M\) as \(M = X^TX\). Thus, \(M\) will be \(n\) x \(n\) and
the diagonal elements \(M_{ii}\) of \(M\) will be the number of games played by the i-th team,
the off-diagonal elements \(M_{ij}\) will be the negation of the number of times the i-th team played the j-th team.
Instead of solving \(Xr = y\), then we can solve the following,
\(Mr = p\),
where \(p = X^Ty\).
2.1. Data
Let’s illustrate Massey’s Method using data from the NCAAF ACC conference for the year 2005. Below, t1
and t2
are the 2 teams playing and s1
and s2
are the scores corresponding to t1
and t2
. There is no information on who was at home or away.
[1]:
import pandas as pd
f_df = pd.read_csv('./ranking/acc-2005-ncaaf.csv')
f_df
[1]:
t1 | t2 | s1 | s2 | |
---|---|---|---|---|
0 | Duke | Miami | 7 | 52 |
1 | Duke | UNC | 21 | 24 |
2 | Duke | UVA | 7 | 38 |
3 | Duke | VT | 0 | 45 |
4 | Miami | UNC | 34 | 16 |
5 | Miami | UVA | 25 | 17 |
6 | Miami | VT | 27 | 7 |
7 | UNC | UVA | 7 | 5 |
8 | UNC | VT | 3 | 30 |
9 | UVA | VT | 14 | 52 |
We can swap the teams and scores and get a view like the following.
[2]:
r_df = pd.DataFrame([{'t1': r.t2, 't2': r.t1, 's1': r.s2, 's2': r.s1} for _, r in f_df.iterrows()])
game_df = pd.concat([f_df, r_df]).reset_index(drop=True)
game_df['differential'] = game_df.s1 - game_df.s2
game_df.sort_values(['t1', 't2'])
[2]:
t1 | t2 | s1 | s2 | differential | |
---|---|---|---|---|---|
0 | Duke | Miami | 7 | 52 | -45 |
1 | Duke | UNC | 21 | 24 | -3 |
2 | Duke | UVA | 7 | 38 | -31 |
3 | Duke | VT | 0 | 45 | -45 |
10 | Miami | Duke | 52 | 7 | 45 |
4 | Miami | UNC | 34 | 16 | 18 |
5 | Miami | UVA | 25 | 17 | 8 |
6 | Miami | VT | 27 | 7 | 20 |
11 | UNC | Duke | 24 | 21 | 3 |
14 | UNC | Miami | 16 | 34 | -18 |
7 | UNC | UVA | 7 | 5 | 2 |
8 | UNC | VT | 3 | 30 | -27 |
12 | UVA | Duke | 38 | 7 | 31 |
15 | UVA | Miami | 17 | 25 | -8 |
17 | UVA | UNC | 5 | 7 | -2 |
9 | UVA | VT | 14 | 52 | -38 |
13 | VT | Duke | 45 | 0 | 45 |
16 | VT | Miami | 7 | 27 | -20 |
18 | VT | UNC | 30 | 3 | 27 |
19 | VT | UVA | 52 | 14 | 38 |
[3]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 4))
_ = sns.boxplot(x='t1', y='differential', data=game_df, ax=ax)
_ = ax.set_title('Box plots of differentials by team')
_ = ax.set_xlabel('Team')
_ = ax.set_ylabel('Differential')
plt.tight_layout()
2.2. Ranking from ratings
A \(n\) x \(n\) matrix of point differential for each team can be created as follows. In the matrix below, UNC
won Duke by 3 points,
lost to Miami by 18 points,
win UVA by 2 points, and
lost to VT by 27 points.
When a team wins, the number in this matrix is positive, and when a team loses, the number in this matrix is negative.
[4]:
import numpy as np
def get_differential(t1, t2):
if t1 == t2:
return np.nan
s = game_df[(game_df.t1 == t1) & (game_df.t2 == t2)].iloc[0]
return s.s1 - s.s2
teams = sorted(list(set(game_df.t1) | set(game_df.t2)))
differentials = [[get_differential(t1, t2) for t2 in teams] for t1 in teams]
diff_df = pd.DataFrame(differentials, index=teams, columns=teams)
diff_df
[4]:
Duke | Miami | UNC | UVA | VT | |
---|---|---|---|---|---|
Duke | NaN | -45.0 | -3.0 | -31.0 | -45.0 |
Miami | 45.0 | NaN | 18.0 | 8.0 | 20.0 |
UNC | 3.0 | -18.0 | NaN | 2.0 | -27.0 |
UVA | 31.0 | -8.0 | -2.0 | NaN | -38.0 |
VT | 45.0 | -20.0 | 27.0 | 38.0 | NaN |
From this point differential matrix, we can then compute the total point differential (sum across the columns) and the number of wins (positive scores) and losses (negative scores).
[5]:
stat_df = pd.DataFrame({
'differential': diff_df.sum(axis=1),
'wins': diff_df.apply(lambda r: len(r[r > 0]), axis=1),
'losses': diff_df.apply(lambda r: len(r[r < 0]), axis=1)
})
stat_df
[5]:
differential | wins | losses | |
---|---|---|---|
Duke | -124.0 | 0 | 4 |
Miami | 91.0 | 4 | 0 |
UNC | -40.0 | 2 | 2 |
UVA | -17.0 | 1 | 3 |
VT | 90.0 | 3 | 1 |
If we let the sum of the point differentials be the ratings, then the ranking produced is as follows where Miami is first and Duke is last.
[6]:
stat_df.differential.sort_values(ascending=False)
[6]:
Miami 91.0
VT 90.0
UVA -17.0
UNC -40.0
Duke -124.0
Name: differential, dtype: float64
We can also let the number of wins be the ratings, and Miami is still first and Duke is last.
[7]:
stat_df.wins.sort_values(ascending=False)
[7]:
Miami 4
VT 3
UNC 2
UVA 1
Duke 0
Name: wins, dtype: int64
Let’s compare the ranking by point differential and wins. Notice that the rankings are nearly identical, except that UVA and UNC switch places?
[8]:
pd.DataFrame({
'by_differential': stat_df.differential.sort_values(ascending=False).index,
'by_wins': stat_df.wins.sort_values(ascending=False).index
})
[8]:
by_differential | by_wins | |
---|---|---|
0 | Miami | Miami |
1 | VT | VT |
2 | UVA | UNC |
3 | UNC | UVA |
4 | Duke | Duke |
2.3. Xr = y
Let’s build or \(X\) matrix.
[9]:
def get_vector(r):
v = [0 for i in range(len(teams))]
if r.s1 > r.s2:
v[t2i[r.t1]] = 1
v[t2i[r.t2]] = -1
else:
v[t2i[r.t1]] = -1
v[t2i[r.t2]] = 1
return v
t2i = {t: i for i, t in enumerate(teams)}
X = pd.DataFrame([get_vector(r) for _, r in f_df.iterrows()], columns=teams)
X
[9]:
Duke | Miami | UNC | UVA | VT | |
---|---|---|---|---|---|
0 | -1 | 1 | 0 | 0 | 0 |
1 | -1 | 0 | 1 | 0 | 0 |
2 | -1 | 0 | 0 | 1 | 0 |
3 | -1 | 0 | 0 | 0 | 1 |
4 | 0 | 1 | -1 | 0 | 0 |
5 | 0 | 1 | 0 | -1 | 0 |
6 | 0 | 1 | 0 | 0 | -1 |
7 | 0 | 0 | 1 | -1 | 0 |
8 | 0 | 0 | -1 | 0 | 1 |
9 | 0 | 0 | 0 | -1 | 1 |
Notice that \(X^TX = M\).
[10]:
X.T.dot(X)
[10]:
Duke | Miami | UNC | UVA | VT | |
---|---|---|---|---|---|
Duke | 4 | -1 | -1 | -1 | -1 |
Miami | -1 | 4 | -1 | -1 | -1 |
UNC | -1 | -1 | 4 | -1 | -1 |
UVA | -1 | -1 | -1 | 4 | -1 |
VT | -1 | -1 | -1 | -1 | 4 |
Our \(y\) will be the point differential.
[11]:
y = f_df.s1 - f_df.s2
y
[11]:
0 -45
1 -3
2 -31
3 -45
4 18
5 8
6 20
7 2
8 -27
9 -38
dtype: int64
We will not attempt to learn the ratings using linear regression.
[12]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
model.intercept_, model.coef_
[12]:
(4.100000000000035, array([ 28.08, -3.08, 1.6 , 1.04, -27.64]))
The ratings are the coefficients (see below where we negate them). However, because \(m >> n\), the linear system is highly overdetermined and inconsistent.
[13]:
sorted(list(zip(teams, -model.coef_)), key=lambda tup: tup[1], reverse=True)
[13]:
[('VT', 27.640000000000022),
('Miami', 3.080000000000026),
('UVA', -1.0400000000000234),
('UNC', -1.5999999999999974),
('Duke', -28.08000000000003)]
Note how Miami is undefeated, but still in second place? It is likely VT came in first place since their point differentials average higher. In most sports ranking system, point differentials are take out of consideration because teams with a high lead may stop scoring points as a matter of sportsmanship.
2.4. Mr = p
Here, we will estimating the ratings and thus the ranking using the coefficient matrix \(M\).
[14]:
def get_games_played(t1, t2):
if t1 == t2:
return f_df[(f_df.t1 == t1) | (f_df.t2 == t2)].shape[0]
else:
q1 = (f_df.t1 == t1) & (f_df.t2 == t2)
q2 = (f_df.t1 == t2) & (f_df.t2 == t1)
q = q1 | q2
return -f_df[q].shape[0]
mat = [[get_games_played(t1, t2) for t2 in teams] for t1 in teams]
mat
[14]:
[[4, -1, -1, -1, -1],
[-1, 4, -1, -1, -1],
[-1, -1, 4, -1, -1],
[-1, -1, -1, 4, -1],
[-1, -1, -1, -1, 4]]
To ensure that \(M\) has full rank, we will convert the last row to all 1’s.
[15]:
mat[-1] = [1 for _ in range(len(mat[-1]))]
mat
[15]:
[[4, -1, -1, -1, -1],
[-1, 4, -1, -1, -1],
[-1, -1, 4, -1, -1],
[-1, -1, -1, 4, -1],
[1, 1, 1, 1, 1]]
Our final \(M\) looks like the following.
[16]:
M = pd.DataFrame(mat, index=teams, columns=teams)
M
[16]:
Duke | Miami | UNC | UVA | VT | |
---|---|---|---|---|---|
Duke | 4 | -1 | -1 | -1 | -1 |
Miami | -1 | 4 | -1 | -1 | -1 |
UNC | -1 | -1 | 4 | -1 | -1 |
UVA | -1 | -1 | -1 | 4 | -1 |
VT | 1 | 1 | 1 | 1 | 1 |
Our \(p\) looks like the following.
[17]:
p = stat_df.differential
p
[17]:
Duke -124.0
Miami 91.0
UNC -40.0
UVA -17.0
VT 90.0
Name: differential, dtype: float64
Applying linear regression and sorting the ratings (coefficients), we get the final ranking below.
[18]:
model = LinearRegression()
model.fit(M, p)
model.intercept_, model.coef_
[18]:
(8.881784197001252e-15, array([-6.8, 36.2, 10. , 14.6, 36. ]))
[19]:
sorted(list(zip(teams, model.coef_)), key=lambda tup: tup[1], reverse=True)
[19]:
[('Miami', 36.2),
('VT', 36.00000000000002),
('UVA', 14.599999999999998),
('UNC', 9.999999999999998),
('Duke', -6.799999999999999)]
2.5. NBA 2021
Let’s apply Massey’s Method to the NBA for the 2021 season and games up to Thanksgiving.
[20]:
def get_nba():
x = pd.read_csv('./nba/2021.csv')\
.rename(columns={
'a_team': 't1',
'h_team': 't2',
'a_score': 's1',
'h_score': 's2'})
x = x[x.preseason == False]\
.drop(columns=['preseason'])\
.reset_index(drop=True)
return x
def get_nfl():
x = pd.read_csv('./nfl/2021.csv')\
.rename(columns={
'team1': 't1',
'team2': 't2',
'score1': 's1',
'score2': 's2'})\
.drop(columns=['week'])
x['t1'] = x['t1'].apply(lambda s: s.strip())
x['t2'] = x['t2'].apply(lambda s: s.strip())
return x
def get_X(df):
def get_vector(r):
v = [0 for i in range(len(teams))]
if r.s1 > r.s2:
v[t2i[r.t1]] = 1
v[t2i[r.t2]] = -1
else:
v[t2i[r.t1]] = -1
v[t2i[r.t2]] = 1
return v
teams = sorted(list(set(df.t1) | set(df.t2)))
t2i = {t: i for i, t in enumerate(teams)}
X = pd.DataFrame([get_vector(r) for _, r in df.iterrows()], columns=teams)
X = X.T.dot(X)
X.iloc[-1,:] = 1
return X
def get_y(df):
def get_diff(t):
a = df[df.t1 == t]
b = df[df.t2 == t]
c = a.s1 - a.s2
d = b.s2 - b.s1
return c.sum() + d.sum()
teams = sorted(list(set(df.t1) | set(df.t2)))
diffs = [get_diff(t) for t in teams]
return pd.Series(diffs, index=teams)
def get_Xy(df):
return get_X(df), get_y(df)
[21]:
X, y = get_Xy(get_nba())
model = LinearRegression()
model.fit(X, y)
pd.DataFrame(zip(X.index, model.coef_), columns=['Team', 'Rating'])\
.sort_values('Rating', ascending=False)
[21]:
Team | Rating | |
---|---|---|
28 | Warriors | 12.899850 |
10 | Jazz | 8.549796 |
24 | Suns | 6.744632 |
8 | Heat | 6.488263 |
16 | Nets | 4.254646 |
2 | Bulls | 4.235154 |
5 | Clippers | 3.792290 |
7 | Hawks | 3.338752 |
18 | Pacers | 3.138730 |
27 | Trail Blazers | 2.456361 |
0 | 76ers | 2.100543 |
1 | Bucks | 1.999356 |
3 | Cavaliers | 1.429301 |
26 | Timberwolves | 1.276979 |
9 | Hornets | 1.227789 |
29 | Wizards | 1.180180 |
21 | Raptors | 1.055674 |
17 | Nuggets | 0.755934 |
4 | Celtics | 0.579776 |
11 | Kings | 0.089514 |
12 | Knicks | -0.306992 |
15 | Mavericks | -0.688728 |
13 | Lakers | -3.357641 |
23 | Spurs | -3.433295 |
6 | Grizzlies | -4.525625 |
19 | Pelicans | -4.916126 |
25 | Thunder | -5.150785 |
20 | Pistons | -7.552551 |
22 | Rockets | -8.788710 |
14 | Magic | -9.818479 |
2.6. NFL
Let’s apply Massey’s Method to the NFL for the 2021 season and games up to Thanksgiving.
[22]:
X, y = get_Xy(get_nfl())
model = LinearRegression()
model.fit(X, y)
pd.DataFrame(zip(X.index, model.coef_), columns=['Team', 'Rating'])\
.sort_values('Rating', ascending=False)
[22]:
Team | Rating | |
---|---|---|
3 | Bills | 8.425259 |
7 | Cardinals | 8.235687 |
21 | Patriots | 6.349719 |
6 | Buccaneers | 6.321022 |
11 | Cowboys | 5.819441 |
10 | Colts | 3.115332 |
9 | Chiefs | 2.706593 |
29 | Titans | 2.584503 |
23 | Rams | 2.472228 |
13 | Eagles | 1.696797 |
0 | 49ers | 1.565063 |
19 | Packers | 1.443518 |
25 | Saints | 0.808172 |
30 | Vikings | 0.737758 |
8 | Chargers | -0.551756 |
2 | Bengals | -0.607977 |
26 | Seahawks | -0.880815 |
24 | Ravens | -1.608072 |
20 | Panthers | -1.742287 |
4 | Broncos | -2.431469 |
5 | Browns | -2.723871 |
31 | Washington | -3.649900 |
27 | Steelers | -3.707054 |
22 | Raiders | -5.414601 |
15 | Giants | -6.524430 |
12 | Dolphins | -8.305769 |
1 | Bears | -9.322540 |
16 | Jaguars | -10.338981 |
28 | Texans | -11.365255 |
18 | Lions | -12.281915 |
14 | Falcons | -12.409125 |
17 | Jets | -14.139647 |