2. Massey’s Method

Massey’s Method refers to Kenneth Massey’s method for ranking sport teams. Denote the following.

  • \(n\) is the number of teams

  • \(m\) is the number games played (only 2 teams can play in a game)

  • \(X\) is a \(n\) x \(m\) matrix where columns corresponding to teams and rows correspond to games

    • Each row of \(X\) is sparse (mostly zeros)

    • For each row, 1 is placed in the column corresponding to the winning team

    • For each row, -1 is placed in the column corresponding to the losing team

  • \(r\) is a \(n\) x 1 vector corresponding to the rating of each team

    • \(r\) is what will be estimated

  • \(y\) is a \(n\) x 1 vector corresponding to the margins of victory

Then, we are trying to solve

\(Xr = y\)

Futhermore, denote coefficient matrix \(M\) as \(M = X^TX\). Thus, \(M\) will be \(n\) x \(n\) and

  • the diagonal elements \(M_{ii}\) of \(M\) will be the number of games played by the i-th team,

  • the off-diagonal elements \(M_{ij}\) will be the negation of the number of times the i-th team played the j-th team.

Instead of solving \(Xr = y\), then we can solve the following,

\(Mr = p\),

where \(p = X^Ty\).

2.1. Data

Let’s illustrate Massey’s Method using data from the NCAAF ACC conference for the year 2005. Below, t1 and t2 are the 2 teams playing and s1 and s2 are the scores corresponding to t1 and t2. There is no information on who was at home or away.

[1]:
import pandas as pd

f_df = pd.read_csv('./ranking/acc-2005-ncaaf.csv')
f_df
[1]:
t1 t2 s1 s2
0 Duke Miami 7 52
1 Duke UNC 21 24
2 Duke UVA 7 38
3 Duke VT 0 45
4 Miami UNC 34 16
5 Miami UVA 25 17
6 Miami VT 27 7
7 UNC UVA 7 5
8 UNC VT 3 30
9 UVA VT 14 52

We can swap the teams and scores and get a view like the following.

[2]:
r_df = pd.DataFrame([{'t1': r.t2, 't2': r.t1, 's1': r.s2, 's2': r.s1} for _, r in f_df.iterrows()])

game_df = pd.concat([f_df, r_df]).reset_index(drop=True)
game_df['differential'] = game_df.s1 - game_df.s2
game_df.sort_values(['t1', 't2'])
[2]:
t1 t2 s1 s2 differential
0 Duke Miami 7 52 -45
1 Duke UNC 21 24 -3
2 Duke UVA 7 38 -31
3 Duke VT 0 45 -45
10 Miami Duke 52 7 45
4 Miami UNC 34 16 18
5 Miami UVA 25 17 8
6 Miami VT 27 7 20
11 UNC Duke 24 21 3
14 UNC Miami 16 34 -18
7 UNC UVA 7 5 2
8 UNC VT 3 30 -27
12 UVA Duke 38 7 31
15 UVA Miami 17 25 -8
17 UVA UNC 5 7 -2
9 UVA VT 14 52 -38
13 VT Duke 45 0 45
16 VT Miami 7 27 -20
18 VT UNC 30 3 27
19 VT UVA 52 14 38
[3]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

fig, ax = plt.subplots(figsize=(10, 4))

_ = sns.boxplot(x='t1', y='differential', data=game_df, ax=ax)
_ = ax.set_title('Box plots of differentials by team')
_ = ax.set_xlabel('Team')
_ = ax.set_ylabel('Differential')

plt.tight_layout()
_images/massey-method_5_0.png

2.2. Ranking from ratings

A \(n\) x \(n\) matrix of point differential for each team can be created as follows. In the matrix below, UNC

  • won Duke by 3 points,

  • lost to Miami by 18 points,

  • win UVA by 2 points, and

  • lost to VT by 27 points.

When a team wins, the number in this matrix is positive, and when a team loses, the number in this matrix is negative.

[4]:
import numpy as np

def get_differential(t1, t2):
    if t1 == t2:
        return np.nan
    s = game_df[(game_df.t1 == t1) & (game_df.t2 == t2)].iloc[0]
    return s.s1 - s.s2

teams = sorted(list(set(game_df.t1) | set(game_df.t2)))
differentials = [[get_differential(t1, t2) for t2 in teams] for t1 in teams]

diff_df = pd.DataFrame(differentials, index=teams, columns=teams)
diff_df
[4]:
Duke Miami UNC UVA VT
Duke NaN -45.0 -3.0 -31.0 -45.0
Miami 45.0 NaN 18.0 8.0 20.0
UNC 3.0 -18.0 NaN 2.0 -27.0
UVA 31.0 -8.0 -2.0 NaN -38.0
VT 45.0 -20.0 27.0 38.0 NaN

From this point differential matrix, we can then compute the total point differential (sum across the columns) and the number of wins (positive scores) and losses (negative scores).

[5]:
stat_df = pd.DataFrame({
    'differential': diff_df.sum(axis=1),
    'wins': diff_df.apply(lambda r: len(r[r > 0]), axis=1),
    'losses': diff_df.apply(lambda r: len(r[r < 0]), axis=1)
})

stat_df
[5]:
differential wins losses
Duke -124.0 0 4
Miami 91.0 4 0
UNC -40.0 2 2
UVA -17.0 1 3
VT 90.0 3 1

If we let the sum of the point differentials be the ratings, then the ranking produced is as follows where Miami is first and Duke is last.

[6]:
stat_df.differential.sort_values(ascending=False)
[6]:
Miami     91.0
VT        90.0
UVA      -17.0
UNC      -40.0
Duke    -124.0
Name: differential, dtype: float64

We can also let the number of wins be the ratings, and Miami is still first and Duke is last.

[7]:
stat_df.wins.sort_values(ascending=False)
[7]:
Miami    4
VT       3
UNC      2
UVA      1
Duke     0
Name: wins, dtype: int64

Let’s compare the ranking by point differential and wins. Notice that the rankings are nearly identical, except that UVA and UNC switch places?

[8]:
pd.DataFrame({
    'by_differential': stat_df.differential.sort_values(ascending=False).index,
    'by_wins': stat_df.wins.sort_values(ascending=False).index
})
[8]:
by_differential by_wins
0 Miami Miami
1 VT VT
2 UVA UNC
3 UNC UVA
4 Duke Duke

2.3. Xr = y

Let’s build or \(X\) matrix.

[9]:
def get_vector(r):
    v = [0 for i in range(len(teams))]

    if r.s1 > r.s2:
        v[t2i[r.t1]] = 1
        v[t2i[r.t2]] = -1
    else:
        v[t2i[r.t1]] = -1
        v[t2i[r.t2]] = 1
    return v

t2i = {t: i for i, t in enumerate(teams)}

X = pd.DataFrame([get_vector(r) for _, r in f_df.iterrows()], columns=teams)
X
[9]:
Duke Miami UNC UVA VT
0 -1 1 0 0 0
1 -1 0 1 0 0
2 -1 0 0 1 0
3 -1 0 0 0 1
4 0 1 -1 0 0
5 0 1 0 -1 0
6 0 1 0 0 -1
7 0 0 1 -1 0
8 0 0 -1 0 1
9 0 0 0 -1 1

Notice that \(X^TX = M\).

[10]:
X.T.dot(X)
[10]:
Duke Miami UNC UVA VT
Duke 4 -1 -1 -1 -1
Miami -1 4 -1 -1 -1
UNC -1 -1 4 -1 -1
UVA -1 -1 -1 4 -1
VT -1 -1 -1 -1 4

Our \(y\) will be the point differential.

[11]:
y = f_df.s1 - f_df.s2
y
[11]:
0   -45
1    -3
2   -31
3   -45
4    18
5     8
6    20
7     2
8   -27
9   -38
dtype: int64

We will not attempt to learn the ratings using linear regression.

[12]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
model.intercept_, model.coef_
[12]:
(4.100000000000035, array([ 28.08,  -3.08,   1.6 ,   1.04, -27.64]))

The ratings are the coefficients (see below where we negate them). However, because \(m >> n\), the linear system is highly overdetermined and inconsistent.

[13]:
sorted(list(zip(teams, -model.coef_)), key=lambda tup: tup[1], reverse=True)
[13]:
[('VT', 27.640000000000022),
 ('Miami', 3.080000000000026),
 ('UVA', -1.0400000000000234),
 ('UNC', -1.5999999999999974),
 ('Duke', -28.08000000000003)]

Note how Miami is undefeated, but still in second place? It is likely VT came in first place since their point differentials average higher. In most sports ranking system, point differentials are take out of consideration because teams with a high lead may stop scoring points as a matter of sportsmanship.

2.4. Mr = p

Here, we will estimating the ratings and thus the ranking using the coefficient matrix \(M\).

[14]:
def get_games_played(t1, t2):
    if t1 == t2:
        return f_df[(f_df.t1 == t1) | (f_df.t2 == t2)].shape[0]
    else:
        q1 = (f_df.t1 == t1) & (f_df.t2 == t2)
        q2 = (f_df.t1 == t2) & (f_df.t2 == t1)
        q = q1 | q2
        return -f_df[q].shape[0]

mat = [[get_games_played(t1, t2) for t2 in teams] for t1 in teams]
mat
[14]:
[[4, -1, -1, -1, -1],
 [-1, 4, -1, -1, -1],
 [-1, -1, 4, -1, -1],
 [-1, -1, -1, 4, -1],
 [-1, -1, -1, -1, 4]]

To ensure that \(M\) has full rank, we will convert the last row to all 1’s.

[15]:
mat[-1] = [1 for _ in range(len(mat[-1]))]
mat
[15]:
[[4, -1, -1, -1, -1],
 [-1, 4, -1, -1, -1],
 [-1, -1, 4, -1, -1],
 [-1, -1, -1, 4, -1],
 [1, 1, 1, 1, 1]]

Our final \(M\) looks like the following.

[16]:
M = pd.DataFrame(mat, index=teams, columns=teams)
M
[16]:
Duke Miami UNC UVA VT
Duke 4 -1 -1 -1 -1
Miami -1 4 -1 -1 -1
UNC -1 -1 4 -1 -1
UVA -1 -1 -1 4 -1
VT 1 1 1 1 1

Our \(p\) looks like the following.

[17]:
p = stat_df.differential
p
[17]:
Duke    -124.0
Miami     91.0
UNC      -40.0
UVA      -17.0
VT        90.0
Name: differential, dtype: float64

Applying linear regression and sorting the ratings (coefficients), we get the final ranking below.

[18]:
model = LinearRegression()
model.fit(M, p)
model.intercept_, model.coef_
[18]:
(8.881784197001252e-15, array([-6.8, 36.2, 10. , 14.6, 36. ]))
[19]:
sorted(list(zip(teams, model.coef_)), key=lambda tup: tup[1], reverse=True)
[19]:
[('Miami', 36.2),
 ('VT', 36.00000000000002),
 ('UVA', 14.599999999999998),
 ('UNC', 9.999999999999998),
 ('Duke', -6.799999999999999)]

2.5. NBA 2021

Let’s apply Massey’s Method to the NBA for the 2021 season and games up to Thanksgiving.

[20]:
def get_nba():
    x = pd.read_csv('./nba/2021.csv')\
        .rename(columns={
            'a_team': 't1',
            'h_team': 't2',
            'a_score': 's1',
            'h_score': 's2'})
    x = x[x.preseason == False]\
        .drop(columns=['preseason'])\
        .reset_index(drop=True)
    return x

def get_nfl():
    x = pd.read_csv('./nfl/2021.csv')\
        .rename(columns={
            'team1': 't1',
            'team2': 't2',
            'score1': 's1',
            'score2': 's2'})\
        .drop(columns=['week'])
    x['t1'] = x['t1'].apply(lambda s: s.strip())
    x['t2'] = x['t2'].apply(lambda s: s.strip())

    return x

def get_X(df):
    def get_vector(r):
        v = [0 for i in range(len(teams))]

        if r.s1 > r.s2:
            v[t2i[r.t1]] = 1
            v[t2i[r.t2]] = -1
        else:
            v[t2i[r.t1]] = -1
            v[t2i[r.t2]] = 1
        return v

    teams = sorted(list(set(df.t1) | set(df.t2)))
    t2i = {t: i for i, t in enumerate(teams)}

    X = pd.DataFrame([get_vector(r) for _, r in df.iterrows()], columns=teams)
    X = X.T.dot(X)
    X.iloc[-1,:] = 1

    return X

def get_y(df):
    def get_diff(t):
        a = df[df.t1 == t]
        b = df[df.t2 == t]

        c = a.s1 - a.s2
        d = b.s2 - b.s1

        return c.sum() + d.sum()

    teams = sorted(list(set(df.t1) | set(df.t2)))
    diffs = [get_diff(t) for t in teams]

    return pd.Series(diffs, index=teams)

def get_Xy(df):
    return get_X(df), get_y(df)
[21]:
X, y = get_Xy(get_nba())

model = LinearRegression()
model.fit(X, y)

pd.DataFrame(zip(X.index, model.coef_), columns=['Team', 'Rating'])\
    .sort_values('Rating', ascending=False)
[21]:
Team Rating
28 Warriors 12.899850
10 Jazz 8.549796
24 Suns 6.744632
8 Heat 6.488263
16 Nets 4.254646
2 Bulls 4.235154
5 Clippers 3.792290
7 Hawks 3.338752
18 Pacers 3.138730
27 Trail Blazers 2.456361
0 76ers 2.100543
1 Bucks 1.999356
3 Cavaliers 1.429301
26 Timberwolves 1.276979
9 Hornets 1.227789
29 Wizards 1.180180
21 Raptors 1.055674
17 Nuggets 0.755934
4 Celtics 0.579776
11 Kings 0.089514
12 Knicks -0.306992
15 Mavericks -0.688728
13 Lakers -3.357641
23 Spurs -3.433295
6 Grizzlies -4.525625
19 Pelicans -4.916126
25 Thunder -5.150785
20 Pistons -7.552551
22 Rockets -8.788710
14 Magic -9.818479

2.6. NFL

Let’s apply Massey’s Method to the NFL for the 2021 season and games up to Thanksgiving.

[22]:
X, y = get_Xy(get_nfl())

model = LinearRegression()
model.fit(X, y)

pd.DataFrame(zip(X.index, model.coef_), columns=['Team', 'Rating'])\
    .sort_values('Rating', ascending=False)
[22]:
Team Rating
3 Bills 8.425259
7 Cardinals 8.235687
21 Patriots 6.349719
6 Buccaneers 6.321022
11 Cowboys 5.819441
10 Colts 3.115332
9 Chiefs 2.706593
29 Titans 2.584503
23 Rams 2.472228
13 Eagles 1.696797
0 49ers 1.565063
19 Packers 1.443518
25 Saints 0.808172
30 Vikings 0.737758
8 Chargers -0.551756
2 Bengals -0.607977
26 Seahawks -0.880815
24 Ravens -1.608072
20 Panthers -1.742287
4 Broncos -2.431469
5 Browns -2.723871
31 Washington -3.649900
27 Steelers -3.707054
22 Raiders -5.414601
15 Giants -6.524430
12 Dolphins -8.305769
1 Bears -9.322540
16 Jaguars -10.338981
28 Texans -11.365255
18 Lions -12.281915
14 Falcons -12.409125
17 Jets -14.139647