4. Colley’s Method
Massey’s Method of rating is defined as
\(Mr = p\),
where,
\(M\) is a \(n\) x \(n\) matrix,
each \(M_{ii}\) is the number of games played by the i-th team
each \(M_{ij}\) is the negation of games played by the i-th team agains the j-th team
\(r\) are the ratings we are trying to estimate, and
\(p\) is the point differentials across games played.
Colley’s Method is another method of rating defined as
\(Cr = b\),
where,
\(C = 2I + M\), and
\(b = 1 + \frac{1}{2}(w - l)\).
\(w\) is the number of wins
\(l\) is the number of losses
The benefits of Colley’s Method are stated as follows.
Colley’s Method is bias-free since it does not use point differential and just the wins and losses. Point differential may distort the ratings as teams the dominant team may overwhelm the opponent in a game (generate a lot of points) or hold their score steady so as not to appear overwhelming the opponent (sportsmanship).
Colley’s Method also considers the ratings together; when one team’s rating improves, another must suffer.
Colley’s Method is applicable when point differential is not available or undesireable.
4.1. NCAAF ACC 2005
Let’s look at the NCAAF ACC 2005 data.
[1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
def get_ncaaf():
return pd.read_csv('./ranking/acc-2005-ncaaf.csv')
def get_teams(df):
return sorted(list(set(df.t1) | set(df.t2)))
def get_fap(df):
def get_f(t):
return df[df.t1 == t].s1.sum() + df[df.t2 == t].s2.sum()
def get_a(t):
return -df[df.t1 == t].s2.sum() + -df[df.t2 == t].s1.sum()
teams = get_teams(df)
x = pd.DataFrame([{'for': get_f(t), 'against': get_a(t)} for t in teams], index=teams)
x['differential'] = x['for'] + x['against']
return x
def get_wlb(df):
def get_wins(t):
w1 = df[(df.t1 == t) & (df.s1 > df.s2)].shape[0]
w2 = df[(df.t2 == t) & (df.s2 > df.s1)].shape[0]
return w1 + w2
def get_losses(t):
l1 = df[(df.t1 == t) & (df.s1 < df.s2)].shape[0]
l2 = df[(df.t2 == t) & (df.s2 < df.s1)].shape[0]
return l1 + l2
teams = get_teams(df)
x = pd.DataFrame({
'w': [get_wins(t) for t in teams],
'l': [get_losses(t) for t in teams]
}, index=teams)
x['b'] = 1 + 0.5 * (x.w - x.l)
return x
def get_M(df):
def get_games_played(t1, t2):
if t1 == t2:
return df[(df.t1 == t1) | (df.t2 == t2)].shape[0]
else:
q1 = (df.t1 == t1) & (df.t2 == t2)
q2 = (df.t1 == t2) & (df.t2 == t1)
q = q1 | q2
return -df[q].shape[0]
teams = get_teams(df)
mat = [[get_games_played(t1, t2) for t2 in teams] for t1 in teams]
mat = pd.DataFrame(mat, index=teams, columns=teams)
return mat
def get_C(df):
M = get_M(df)
C = 2 * np.eye(M.shape[0], M.shape[1]) + M
return C
def get_MTP(df):
M = get_M(df)
teams = get_teams(df)
T = pd.DataFrame(np.diag(pd.Series(np.diag(M))), index=teams, columns=teams)
P = T - M
return M, T, P
def get_massey_r(df):
M = get_M(df)
M.iloc[-1,:] = 1
p = get_fap(df).differential
model = LinearRegression()
model.fit(M, p)
return pd.Series(model.coef_, index=M.index)
def get_colley_r(df):
C = get_C(df)
b = b = get_wlb(df).b
model = LinearRegression()
model.fit(C, b)
return pd.Series(model.coef_, index=C.index)
def get_rankings(df):
return pd.DataFrame({c: df[c].sort_values(ascending=False).index for c in df.columns})
Compare \(M\) to \(C\).
[2]:
M = get_M(get_ncaaf())
M
[2]:
Duke | Miami | UNC | UVA | VT | |
---|---|---|---|---|---|
Duke | 4 | -1 | -1 | -1 | -1 |
Miami | -1 | 4 | -1 | -1 | -1 |
UNC | -1 | -1 | 4 | -1 | -1 |
UVA | -1 | -1 | -1 | 4 | -1 |
VT | -1 | -1 | -1 | -1 | 4 |
[3]:
C = get_C(get_ncaaf())
C
[3]:
Duke | Miami | UNC | UVA | VT | |
---|---|---|---|---|---|
Duke | 6.0 | -1.0 | -1.0 | -1.0 | -1.0 |
Miami | -1.0 | 6.0 | -1.0 | -1.0 | -1.0 |
UNC | -1.0 | -1.0 | 6.0 | -1.0 | -1.0 |
UVA | -1.0 | -1.0 | -1.0 | 6.0 | -1.0 |
VT | -1.0 | -1.0 | -1.0 | -1.0 | 6.0 |
Compare \(p\) to \(b\).
[4]:
p = get_fap(get_ncaaf())['differential']
p
[4]:
Duke -124
Miami 91
UNC -40
UVA -17
VT 90
Name: differential, dtype: int64
[5]:
b = get_wlb(get_ncaaf())['b']
b
[5]:
Duke -1.0
Miami 3.0
UNC 1.0
UVA 0.0
VT 2.0
Name: b, dtype: float64
Compare the ratings between Massey and Colley’s Methods.
[6]:
get_massey_r(get_ncaaf()).sort_values(ascending=False)
[6]:
Miami 36.2
VT 36.0
UVA 14.6
UNC 10.0
Duke -6.8
dtype: float64
[7]:
get_colley_r(get_ncaaf()).sort_values(ascending=False)
[7]:
Miami 2.857143e-01
VT 1.428571e-01
UNC -2.775558e-17
UVA -1.428571e-01
Duke -2.857143e-01
dtype: float64
Massey and Colley’s Methods differ only by the positions of UVA and UNC.
[8]:
pd.DataFrame({
'massey': get_massey_r(get_ncaaf()).sort_values(ascending=False).index,
'colley': get_colley_r(get_ncaaf()).sort_values(ascending=False).index
})
[8]:
massey | colley | |
---|---|---|
0 | Miami | Miami |
1 | VT | VT |
2 | UVA | UNC |
3 | UNC | UVA |
4 | Duke | Duke |
4.2. NBA, 2021
Let’s apply these methods to the NBA 2021 season up to Thanksgiving.
[9]:
def get_nba():
x = pd.read_csv('./nba/2021.csv')\
.rename(columns={
'a_team': 't1',
'h_team': 't2',
'a_score': 's1',
'h_score': 's2'})
x = x[x.preseason == False]\
.drop(columns=['preseason'])\
.reset_index(drop=True)
return x
Here are the ratings.
[10]:
pd.DataFrame({
'massey': get_massey_r(get_nba()),
'colley': get_colley_r(get_nba())
})
[10]:
massey | colley | |
---|---|---|
76ers | 2.100543 | -0.004808 |
Bucks | 1.999356 | 0.037718 |
Bulls | 4.235154 | 0.086708 |
Cavaliers | 1.429301 | 0.051312 |
Celtics | 0.579776 | -0.015193 |
Clippers | 3.792290 | 0.073897 |
Grizzlies | -4.525625 | 0.011843 |
Hawks | 3.338752 | 0.038129 |
Heat | 6.488263 | 0.134159 |
Hornets | 1.227789 | 0.099810 |
Jazz | 8.549796 | 0.085264 |
Kings | 0.089514 | -0.082525 |
Knicks | -0.306992 | 0.009597 |
Lakers | -3.357641 | -0.049940 |
Magic | -9.818479 | -0.263918 |
Mavericks | -0.688728 | 0.042523 |
Nets | 4.254646 | 0.170953 |
Nuggets | 0.755934 | 0.004656 |
Pacers | 3.138730 | -0.064910 |
Pelicans | -4.916126 | -0.201009 |
Pistons | -7.552551 | -0.241295 |
Raptors | 1.055674 | -0.033273 |
Rockets | -8.788710 | -0.281985 |
Spurs | -3.433295 | -0.217864 |
Suns | 6.744632 | 0.285625 |
Thunder | -5.150785 | -0.155606 |
Timberwolves | 1.276979 | -0.016636 |
Trail Blazers | 2.456361 | 0.035358 |
Warriors | 12.899850 | 0.330721 |
Wizards | 1.180180 | 0.130689 |
These are the rankings. If we take point differential out of the picture, then the Jazz fall from second to eighth place!
[11]:
pd.DataFrame({
'massey': get_massey_r(get_nba()).sort_values(ascending=False).index,
'colley': get_colley_r(get_nba()).sort_values(ascending=False).index
})
[11]:
massey | colley | |
---|---|---|
0 | Warriors | Warriors |
1 | Jazz | Suns |
2 | Suns | Nets |
3 | Heat | Heat |
4 | Nets | Wizards |
5 | Bulls | Hornets |
6 | Clippers | Bulls |
7 | Hawks | Jazz |
8 | Pacers | Clippers |
9 | Trail Blazers | Cavaliers |
10 | 76ers | Mavericks |
11 | Bucks | Hawks |
12 | Cavaliers | Bucks |
13 | Timberwolves | Trail Blazers |
14 | Hornets | Grizzlies |
15 | Wizards | Knicks |
16 | Raptors | Nuggets |
17 | Nuggets | 76ers |
18 | Celtics | Celtics |
19 | Kings | Timberwolves |
20 | Knicks | Raptors |
21 | Mavericks | Lakers |
22 | Lakers | Pacers |
23 | Spurs | Kings |
24 | Grizzlies | Thunder |
25 | Pelicans | Pelicans |
26 | Thunder | Spurs |
27 | Pistons | Pistons |
28 | Rockets | Magic |
29 | Magic | Rockets |
4.3. NFL, 2021
Let’s apply these methods to the NFL 2021 season up to Thanksgiving.
[12]:
def get_nfl():
x = pd.read_csv('./nfl/2021.csv')\
.rename(columns={
'team1': 't1',
'team2': 't2',
'score1': 's1',
'score2': 's2'})\
.drop(columns=['week'])
x['t1'] = x['t1'].apply(lambda s: s.strip())
x['t2'] = x['t2'].apply(lambda s: s.strip())
return x
Here are the ratings.
[13]:
pd.DataFrame({
'massey': get_massey_r(get_nfl()),
'colley': get_colley_r(get_nfl())
})
[13]:
massey | colley | |
---|---|---|
49ers | 1.565063 | 0.001040 |
Bears | -9.322540 | -0.108254 |
Bengals | -0.607977 | 0.046633 |
Bills | 8.425259 | 0.065165 |
Broncos | -2.431469 | -0.024019 |
Browns | -2.723871 | 0.042118 |
Buccaneers | 6.321022 | 0.144495 |
Cardinals | 8.235687 | 0.264187 |
Chargers | -0.551756 | 0.134031 |
Chiefs | 2.706593 | 0.170924 |
Colts | 3.115332 | 0.019958 |
Cowboys | 5.819441 | 0.133962 |
Dolphins | -8.305769 | -0.134007 |
Eagles | 1.696797 | -0.034021 |
Falcons | -12.409125 | -0.112008 |
Giants | -6.524430 | -0.136574 |
Jaguars | -10.338981 | -0.243478 |
Jets | -14.139647 | -0.240181 |
Lions | -12.281915 | -0.354927 |
Packers | 1.443518 | 0.183655 |
Panthers | -1.742287 | -0.063501 |
Patriots | 6.349719 | 0.070305 |
Raiders | -5.414601 | 0.056728 |
Rams | 2.472228 | 0.139103 |
Ravens | -1.608072 | 0.153518 |
Saints | 0.808172 | -0.028999 |
Seahawks | -0.880815 | -0.117927 |
Steelers | -3.707054 | 0.035267 |
Texans | -11.365255 | -0.221763 |
Titans | 2.584503 | 0.178996 |
Vikings | 0.737758 | 0.035146 |
Washington | -3.649900 | -0.055569 |
Here are the rankings. It seems if we take point differential out of the picture, then the Bills fall way out of first place (11-th place)!
[14]:
pd.DataFrame({
'massey': get_massey_r(get_nfl()).sort_values(ascending=False).index,
'colley': get_colley_r(get_nfl()).sort_values(ascending=False).index
})
[14]:
massey | colley | |
---|---|---|
0 | Bills | Cardinals |
1 | Cardinals | Packers |
2 | Patriots | Titans |
3 | Buccaneers | Chiefs |
4 | Cowboys | Ravens |
5 | Colts | Buccaneers |
6 | Chiefs | Rams |
7 | Titans | Chargers |
8 | Rams | Cowboys |
9 | Eagles | Patriots |
10 | 49ers | Bills |
11 | Packers | Raiders |
12 | Saints | Bengals |
13 | Vikings | Browns |
14 | Chargers | Steelers |
15 | Bengals | Vikings |
16 | Seahawks | Colts |
17 | Ravens | 49ers |
18 | Panthers | Broncos |
19 | Broncos | Saints |
20 | Browns | Eagles |
21 | Washington | Washington |
22 | Steelers | Panthers |
23 | Raiders | Bears |
24 | Giants | Falcons |
25 | Dolphins | Seahawks |
26 | Bears | Dolphins |
27 | Jaguars | Giants |
28 | Texans | Texans |
29 | Lions | Jets |
30 | Falcons | Jaguars |
31 | Jets | Lions |
4.4. Movie ratings
Let’s look at sample movie rating data where there are 4 users rating 4 movies. The ratings are from \([1, 5]\) and a 0 indicates no rating.
[15]:
def get_movie():
return pd.DataFrame([
[5, 4, 3, 0],
[5, 5, 3, 1],
[0, 0, 0, 5],
[0, 0, 2, 0],
[4, 0, 0, 3],
[1, 0, 0, 4]
], columns=[f'Movie {i}' for i in range(1, 5)], index=[f'User {i}' for i in range(1, 7)])
df = get_movie()
df
[15]:
Movie 1 | Movie 2 | Movie 3 | Movie 4 | |
---|---|---|---|---|
User 1 | 5 | 4 | 3 | 0 |
User 2 | 5 | 5 | 3 | 1 |
User 3 | 0 | 0 | 0 | 5 |
User 4 | 0 | 0 | 2 | 0 |
User 5 | 4 | 0 | 0 | 3 |
User 6 | 1 | 0 | 0 | 4 |
We can flatten this matrix by considering each pair of rating per user.
[16]:
from itertools import combinations, chain
def get_pairwise_ratings(r):
ratings = [{'t1': m1, 't2': m2, 's1': r[m1], 's2': r[m2]} for m1, m2 in pairs]
ratings = [d for d in ratings if d['s1'] > 0 and d['s2'] > 0]
return ratings
pairs = list(combinations(df.columns, 2))
movie_df = pd.DataFrame(chain(*[get_pairwise_ratings(r) for _, r in df.iterrows()]))
movie_df
[16]:
t1 | t2 | s1 | s2 | |
---|---|---|---|---|
0 | Movie 1 | Movie 2 | 5 | 4 |
1 | Movie 1 | Movie 3 | 5 | 3 |
2 | Movie 2 | Movie 3 | 4 | 3 |
3 | Movie 1 | Movie 2 | 5 | 5 |
4 | Movie 1 | Movie 3 | 5 | 3 |
5 | Movie 1 | Movie 4 | 5 | 1 |
6 | Movie 2 | Movie 3 | 5 | 3 |
7 | Movie 2 | Movie 4 | 5 | 1 |
8 | Movie 3 | Movie 4 | 3 | 1 |
9 | Movie 1 | Movie 4 | 4 | 3 |
10 | Movie 1 | Movie 4 | 1 | 4 |
Here are the ratings.
[17]:
pd.DataFrame({
'massey': get_massey_r(movie_df),
'colley': get_colley_r(movie_df)
})
[17]:
massey | colley | |
---|---|---|
Movie 1 | -0.634021 | 0.168605 |
Movie 2 | 0.084683 | 0.127261 |
Movie 3 | -1.486745 | -0.150517 |
Movie 4 | -3.649485 | -0.145349 |
Here are the rankings.
[18]:
pd.DataFrame({
'massey': get_massey_r(movie_df).sort_values(ascending=False).index,
'colley': get_colley_r(movie_df).sort_values(ascending=False).index
})
[18]:
massey | colley | |
---|---|---|
0 | Movie 2 | Movie 1 |
1 | Movie 1 | Movie 2 |
2 | Movie 3 | Movie 4 |
3 | Movie 4 | Movie 3 |
What if we removed ties?
[19]:
pd.DataFrame({
'massey': get_massey_r(movie_df[movie_df.s1 != movie_df.s2]),
'colley': get_colley_r(movie_df[movie_df.s1 != movie_df.s2])
})
[19]:
massey | colley | |
---|---|---|
Movie 1 | -0.792651 | 0.173453 |
Movie 2 | 0.149606 | 0.121336 |
Movie 3 | -1.556430 | -0.150651 |
Movie 4 | -3.648294 | -0.144137 |
[20]:
pd.DataFrame({
'massey': get_massey_r(movie_df[movie_df.s1 != movie_df.s2]).sort_values(ascending=False).index,
'colley': get_colley_r(movie_df[movie_df.s1 != movie_df.s2]).sort_values(ascending=False).index
})
[20]:
massey | colley | |
---|---|---|
0 | Movie 2 | Movie 1 |
1 | Movie 1 | Movie 2 |
2 | Movie 3 | Movie 4 |
3 | Movie 4 | Movie 3 |