4. Colley’s Method

Massey’s Method of rating is defined as

\(Mr = p\),

where,

\(M\) is a \(n\) x \(n\) matrix,
- each \(M_{ii}\) is the number of games played by the i-th team
- each \(M_{ij}\) is the negation of games played by the i-th team agains the j-th team
\(r\) are the ratings we are trying to estimate, and
\(p\) is the point differentials across games played.

Colley’s Method is another method of rating defined as

\(Cr = b\),

where,

\(C = 2I + M\), and
\(b = 1 + \frac{1}{2}(w - l)\).
- \(w\) is the number of wins
- \(l\) is the number of losses

The benefits of Colley’s Method are stated as follows.

Colley’s Method is bias-free since it does not use point differential and just the wins and losses. Point differential may distort the ratings as teams the dominant team may overwhelm the opponent in a game (generate a lot of points) or hold their score steady so as not to appear overwhelming the opponent (sportsmanship).
Colley’s Method also considers the ratings together; when one team’s rating improves, another must suffer.
Colley’s Method is applicable when point differential is not available or undesireable.

4.1. NCAAF ACC 2005

Let’s look at the NCAAF ACC 2005 data.

[1]:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

def get_ncaaf():
    return pd.read_csv('./ranking/acc-2005-ncaaf.csv')

def get_teams(df):
    return sorted(list(set(df.t1) | set(df.t2)))

def get_fap(df):
    def get_f(t):
        return df[df.t1 == t].s1.sum() + df[df.t2 == t].s2.sum()

    def get_a(t):
        return -df[df.t1 == t].s2.sum() + -df[df.t2 == t].s1.sum()

    teams = get_teams(df)
    x = pd.DataFrame([{'for': get_f(t), 'against': get_a(t)} for t in teams], index=teams)
    x['differential'] = x['for'] + x['against']
    return x

def get_wlb(df):
    def get_wins(t):
        w1 = df[(df.t1 == t) & (df.s1 > df.s2)].shape[0]
        w2 = df[(df.t2 == t) & (df.s2 > df.s1)].shape[0]
        return w1 + w2

    def get_losses(t):
        l1 = df[(df.t1 == t) & (df.s1 < df.s2)].shape[0]
        l2 = df[(df.t2 == t) & (df.s2 < df.s1)].shape[0]
        return l1 + l2

    teams = get_teams(df)

    x = pd.DataFrame({
        'w': [get_wins(t) for t in teams],
        'l': [get_losses(t) for t in teams]
    }, index=teams)
    x['b'] = 1 + 0.5 * (x.w - x.l)
    return x

def get_M(df):
    def get_games_played(t1, t2):
        if t1 == t2:
            return df[(df.t1 == t1) | (df.t2 == t2)].shape[0]
        else:
            q1 = (df.t1 == t1) & (df.t2 == t2)
            q2 = (df.t1 == t2) & (df.t2 == t1)
            q = q1 | q2
            return -df[q].shape[0]

    teams = get_teams(df)
    mat = [[get_games_played(t1, t2) for t2 in teams] for t1 in teams]
    mat = pd.DataFrame(mat, index=teams, columns=teams)
    return mat

def get_C(df):
    M = get_M(df)
    C = 2 * np.eye(M.shape[0], M.shape[1]) + M
    return C

def get_MTP(df):
    M = get_M(df)

    teams = get_teams(df)
    T = pd.DataFrame(np.diag(pd.Series(np.diag(M))), index=teams, columns=teams)
    P = T - M
    return M, T, P

def get_massey_r(df):
    M = get_M(df)
    M.iloc[-1,:] = 1

    p = get_fap(df).differential

    model = LinearRegression()
    model.fit(M, p)

    return pd.Series(model.coef_, index=M.index)

def get_colley_r(df):
    C = get_C(df)
    b = b = get_wlb(df).b

    model = LinearRegression()
    model.fit(C, b)

    return pd.Series(model.coef_, index=C.index)

def get_rankings(df):
    return pd.DataFrame({c: df[c].sort_values(ascending=False).index for c in df.columns})

Compare \(M\) to \(C\).

[2]:

M = get_M(get_ncaaf())
M

[2]:

	Duke	Miami	UNC	UVA	VT
Duke	4	-1	-1	-1	-1
Miami	-1	4	-1	-1	-1
UNC	-1	-1	4	-1	-1
UVA	-1	-1	-1	4	-1
VT	-1	-1	-1	-1	4

[3]:

C = get_C(get_ncaaf())
C

[3]:

	Duke	Miami	UNC	UVA	VT
Duke	6.0	-1.0	-1.0	-1.0	-1.0
Miami	-1.0	6.0	-1.0	-1.0	-1.0
UNC	-1.0	-1.0	6.0	-1.0	-1.0
UVA	-1.0	-1.0	-1.0	6.0	-1.0
VT	-1.0	-1.0	-1.0	-1.0	6.0

Compare \(p\) to \(b\).

[4]:

p = get_fap(get_ncaaf())['differential']
p

[4]:

Duke    -124
Miami     91
UNC      -40
UVA      -17
VT        90
Name: differential, dtype: int64

[5]:

b = get_wlb(get_ncaaf())['b']
b

[5]:

Duke    -1.0
Miami    3.0
UNC      1.0
UVA      0.0
VT       2.0
Name: b, dtype: float64

Compare the ratings between Massey and Colley’s Methods.

[6]:

get_massey_r(get_ncaaf()).sort_values(ascending=False)

[6]:

Miami    36.2
VT       36.0
UVA      14.6
UNC      10.0
Duke     -6.8
dtype: float64

[7]:

get_colley_r(get_ncaaf()).sort_values(ascending=False)

[7]:

Miami    2.857143e-01
VT       1.428571e-01
UNC     -2.775558e-17
UVA     -1.428571e-01
Duke    -2.857143e-01
dtype: float64

Massey and Colley’s Methods differ only by the positions of UVA and UNC.

[8]:

pd.DataFrame({
    'massey': get_massey_r(get_ncaaf()).sort_values(ascending=False).index,
    'colley': get_colley_r(get_ncaaf()).sort_values(ascending=False).index
})

[8]:

	massey	colley
0	Miami	Miami
1	VT	VT
2	UVA	UNC
3	UNC	UVA
4	Duke	Duke

4.2. NBA, 2021

Let’s apply these methods to the NBA 2021 season up to Thanksgiving.

[9]:

def get_nba():
    x = pd.read_csv('./nba/2021.csv')\
        .rename(columns={
            'a_team': 't1',
            'h_team': 't2',
            'a_score': 's1',
            'h_score': 's2'})
    x = x[x.preseason == False]\
        .drop(columns=['preseason'])\
        .reset_index(drop=True)
    return x

Here are the ratings.

[10]:

pd.DataFrame({
    'massey': get_massey_r(get_nba()),
    'colley': get_colley_r(get_nba())
})

[10]:

	massey	colley
76ers	2.100543	-0.004808
Bucks	1.999356	0.037718
Bulls	4.235154	0.086708
Cavaliers	1.429301	0.051312
Celtics	0.579776	-0.015193
Clippers	3.792290	0.073897
Grizzlies	-4.525625	0.011843
Hawks	3.338752	0.038129
Heat	6.488263	0.134159
Hornets	1.227789	0.099810
Jazz	8.549796	0.085264
Kings	0.089514	-0.082525
Knicks	-0.306992	0.009597
Lakers	-3.357641	-0.049940
Magic	-9.818479	-0.263918
Mavericks	-0.688728	0.042523
Nets	4.254646	0.170953
Nuggets	0.755934	0.004656
Pacers	3.138730	-0.064910
Pelicans	-4.916126	-0.201009
Pistons	-7.552551	-0.241295
Raptors	1.055674	-0.033273
Rockets	-8.788710	-0.281985
Spurs	-3.433295	-0.217864
Suns	6.744632	0.285625
Thunder	-5.150785	-0.155606
Timberwolves	1.276979	-0.016636
Trail Blazers	2.456361	0.035358
Warriors	12.899850	0.330721
Wizards	1.180180	0.130689

These are the rankings. If we take point differential out of the picture, then the Jazz fall from second to eighth place!

[11]:

pd.DataFrame({
    'massey': get_massey_r(get_nba()).sort_values(ascending=False).index,
    'colley': get_colley_r(get_nba()).sort_values(ascending=False).index
})

[11]:

	massey	colley
0	Warriors	Warriors
1	Jazz	Suns
2	Suns	Nets
3	Heat	Heat
4	Nets	Wizards
5	Bulls	Hornets
6	Clippers	Bulls
7	Hawks	Jazz
8	Pacers	Clippers
9	Trail Blazers	Cavaliers
10	76ers	Mavericks
11	Bucks	Hawks
12	Cavaliers	Bucks
13	Timberwolves	Trail Blazers
14	Hornets	Grizzlies
15	Wizards	Knicks
16	Raptors	Nuggets
17	Nuggets	76ers
18	Celtics	Celtics
19	Kings	Timberwolves
20	Knicks	Raptors
21	Mavericks	Lakers
22	Lakers	Pacers
23	Spurs	Kings
24	Grizzlies	Thunder
25	Pelicans	Pelicans
26	Thunder	Spurs
27	Pistons	Pistons
28	Rockets	Magic
29	Magic	Rockets

4.3. NFL, 2021

Let’s apply these methods to the NFL 2021 season up to Thanksgiving.

[12]:

def get_nfl():
    x = pd.read_csv('./nfl/2021.csv')\
        .rename(columns={
            'team1': 't1',
            'team2': 't2',
            'score1': 's1',
            'score2': 's2'})\
        .drop(columns=['week'])
    x['t1'] = x['t1'].apply(lambda s: s.strip())
    x['t2'] = x['t2'].apply(lambda s: s.strip())

    return x

Here are the ratings.

[13]:

pd.DataFrame({
    'massey': get_massey_r(get_nfl()),
    'colley': get_colley_r(get_nfl())
})

[13]:

	massey	colley
49ers	1.565063	0.001040
Bears	-9.322540	-0.108254
Bengals	-0.607977	0.046633
Bills	8.425259	0.065165
Broncos	-2.431469	-0.024019
Browns	-2.723871	0.042118
Buccaneers	6.321022	0.144495
Cardinals	8.235687	0.264187
Chargers	-0.551756	0.134031
Chiefs	2.706593	0.170924
Colts	3.115332	0.019958
Cowboys	5.819441	0.133962
Dolphins	-8.305769	-0.134007
Eagles	1.696797	-0.034021
Falcons	-12.409125	-0.112008
Giants	-6.524430	-0.136574
Jaguars	-10.338981	-0.243478
Jets	-14.139647	-0.240181
Lions	-12.281915	-0.354927
Packers	1.443518	0.183655
Panthers	-1.742287	-0.063501
Patriots	6.349719	0.070305
Raiders	-5.414601	0.056728
Rams	2.472228	0.139103
Ravens	-1.608072	0.153518
Saints	0.808172	-0.028999
Seahawks	-0.880815	-0.117927
Steelers	-3.707054	0.035267
Texans	-11.365255	-0.221763
Titans	2.584503	0.178996
Vikings	0.737758	0.035146
Washington	-3.649900	-0.055569

Here are the rankings. It seems if we take point differential out of the picture, then the Bills fall way out of first place (11-th place)!

[14]:

pd.DataFrame({
    'massey': get_massey_r(get_nfl()).sort_values(ascending=False).index,
    'colley': get_colley_r(get_nfl()).sort_values(ascending=False).index
})

[14]:

	massey	colley
0	Bills	Cardinals
1	Cardinals	Packers
2	Patriots	Titans
3	Buccaneers	Chiefs
4	Cowboys	Ravens
5	Colts	Buccaneers
6	Chiefs	Rams
7	Titans	Chargers
8	Rams	Cowboys
9	Eagles	Patriots
10	49ers	Bills
11	Packers	Raiders
12	Saints	Bengals
13	Vikings	Browns
14	Chargers	Steelers
15	Bengals	Vikings
16	Seahawks	Colts
17	Ravens	49ers
18	Panthers	Broncos
19	Broncos	Saints
20	Browns	Eagles
21	Washington	Washington
22	Steelers	Panthers
23	Raiders	Bears
24	Giants	Falcons
25	Dolphins	Seahawks
26	Bears	Dolphins
27	Jaguars	Giants
28	Texans	Texans
29	Lions	Jets
30	Falcons	Jaguars
31	Jets	Lions

4.4. Movie ratings

Let’s look at sample movie rating data where there are 4 users rating 4 movies. The ratings are from \([1, 5]\) and a 0 indicates no rating.

[15]:

def get_movie():
    return pd.DataFrame([
        [5, 4, 3, 0],
        [5, 5, 3, 1],
        [0, 0, 0, 5],
        [0, 0, 2, 0],
        [4, 0, 0, 3],
        [1, 0, 0, 4]
    ], columns=[f'Movie {i}' for i in range(1, 5)], index=[f'User {i}' for i in range(1, 7)])

df = get_movie()
df

[15]:

	Movie 1	Movie 2	Movie 3	Movie 4
User 1	5	4	3	0
User 2	5	5	3	1
User 3	0	0	0	5
User 4	0	0	2	0
User 5	4	0	0	3
User 6	1	0	0	4

We can flatten this matrix by considering each pair of rating per user.

[16]:

from itertools import combinations, chain

def get_pairwise_ratings(r):
    ratings = [{'t1': m1, 't2': m2, 's1': r[m1], 's2': r[m2]} for m1, m2 in pairs]
    ratings = [d for d in ratings if d['s1'] > 0 and d['s2'] > 0]
    return ratings

pairs = list(combinations(df.columns, 2))
movie_df = pd.DataFrame(chain(*[get_pairwise_ratings(r) for _, r in df.iterrows()]))
movie_df

[16]:

	t1	t2	s1	s2
0	Movie 1	Movie 2	5	4
1	Movie 1	Movie 3	5	3
2	Movie 2	Movie 3	4	3
3	Movie 1	Movie 2	5	5
4	Movie 1	Movie 3	5	3
5	Movie 1	Movie 4	5	1
6	Movie 2	Movie 3	5	3
7	Movie 2	Movie 4	5	1
8	Movie 3	Movie 4	3	1
9	Movie 1	Movie 4	4	3
10	Movie 1	Movie 4	1	4

Here are the ratings.

[17]:

pd.DataFrame({
    'massey': get_massey_r(movie_df),
    'colley': get_colley_r(movie_df)
})

[17]:

	massey	colley
Movie 1	-0.634021	0.168605
Movie 2	0.084683	0.127261
Movie 3	-1.486745	-0.150517
Movie 4	-3.649485	-0.145349

Here are the rankings.

[18]:

pd.DataFrame({
    'massey': get_massey_r(movie_df).sort_values(ascending=False).index,
    'colley': get_colley_r(movie_df).sort_values(ascending=False).index
})

[18]:

	massey	colley
0	Movie 2	Movie 1
1	Movie 1	Movie 2
2	Movie 3	Movie 4
3	Movie 4	Movie 3

What if we removed ties?

[19]:

pd.DataFrame({
    'massey': get_massey_r(movie_df[movie_df.s1 != movie_df.s2]),
    'colley': get_colley_r(movie_df[movie_df.s1 != movie_df.s2])
})

[19]:

	massey	colley
Movie 1	-0.792651	0.173453
Movie 2	0.149606	0.121336
Movie 3	-1.556430	-0.150651
Movie 4	-3.648294	-0.144137

[20]:

pd.DataFrame({
    'massey': get_massey_r(movie_df[movie_df.s1 != movie_df.s2]).sort_values(ascending=False).index,
    'colley': get_colley_r(movie_df[movie_df.s1 != movie_df.s2]).sort_values(ascending=False).index
})

[20]:

	massey	colley
0	Movie 2	Movie 1
1	Movie 1	Movie 2
2	Movie 3	Movie 4
3	Movie 4	Movie 3