6. Markov Method

The Markov Method is a way of producing a ranking through Markov chains. The main idea starts with the probability matrix or transition matrix, which is square (\(n\) x \(n\)), where \(n\) is the number of players/teams that we are trying to rank. The elements of this matrix starts off as votes, where each vote \(v_{ij}\) is the i-th team voting in favor of the j-th team. For example, \(v_{ij}\) can be

the number of times the i-th team has lost to the j-th team,
the number of points the i-th team allowed the j-th team to score,
the number of yards the i-th team allowed the j-th team,
the number of turnovers the i-th team commited when playing the j-th team, and
the total possession time the i-th team allowed the j-th team.

As you can see, \(v_{ij}\) is a negative statistics against the i-th team when playing the j-th team. After this voting matrix is created, then it needs to be normalized into probabilities such that each row sums to 1. Where a row sums to 0, then each element in that row should be filled with equal probabilities. Denote this square matrix of probabilities as \(S\).

There must be a vector \(r\) such that \(Sr = r\), and such vector is the vector of ratings, also called the stationary probability. To find \(r\), we can

conduct sampling on \(S\),
apply the Power Method on \(S\), or
decompose \(S\) to find the eigenvalues and eigenvectors.

If we apply the Power Method (e.g. \(S^q\)), then the diagonal of the resulting matrix is \(r\). Additionally, if we find the eigenvalues and eigenvectors, then the eigenvector whose eigenvalue is closest to 1 is \(r\). Let’s see how the Markov Method can be used to induce a ratings and rankings. We will find the ratings and rankings based on finding the eigenvalues and eigenvectors.

6.1. NCAAF ACC 2005

The data comes from the ACC (NCAAF) during the 2005 season. We can transform the data into a \(n\) x \(n\) matrix easily, where each element can be points, yards, turnovers or possessions.

[1]:

import pandas as pd
import numpy as np

def get_teams(df):
    return sorted(list(set(df.t1) | set(df.t2)))

def get_stats(df, f1='s1', f2='s2'):
    def get_scores(t1, t2):
        x = df[(df.t1 == t1) & (df.t2 == t2)]
        if x.shape[0] > 0:
            return x.iloc[0][f1], x.iloc[0][f2]

        x = df[(df.t1 == t2) & (df.t2 == t1)]
        if x.shape[0] > 0:
            return x.iloc[0][f2], x.iloc[0][f1]

        return np.nan

    teams = get_teams(df)
    return pd.DataFrame([[get_scores(t1, t2) for t2 in teams] for t1 in teams], index=teams, columns=teams)

fpath = './ranking/acc-2005-ncaaf.csv'
df = get_stats(pd.read_csv(fpath))
df

[1]:

	Duke	Miami	UNC	UVA	VT
Duke	NaN	(7, 52)	(21, 24)	(7, 38)	(0, 45)
Miami	(52, 7)	NaN	(34, 16)	(25, 17)	(27, 7)
UNC	(24, 21)	(16, 34)	NaN	(7, 5)	(3, 30)
UVA	(38, 7)	(17, 25)	(5, 7)	NaN	(14, 52)
VT	(45, 0)	(7, 27)	(30, 3)	(52, 14)	NaN

[2]:

def adjust(df):
    for r, s in enumerate(df.sum(axis=1)):
        if s == 0:
            df.iloc[r,:] = 1
    return df

def normalize(df):
    return pd.DataFrame([df.iloc[r,:] * s
                         for r, s in enumerate(1 / df.sum(axis=1))],
                        index=df.index, columns=df.columns)

def vote_by_loss(df):
    def get_vote(v):
        if pd.isna(v):
            return 0
        s1, s2 = v
        return 1 if s1 < s2 else 0

    return pd.DataFrame([[get_vote(r[c]) for c in df.columns]
                         for _, r in df.iterrows()],
                        index=df.index, columns=df.columns)

def vote_by_point_differential(df):
    def get_point(v):
        if pd.isna(v):
            return 0
        s1, s2 = v
        diff = s2 - s1
        return 0 if diff < 0 else diff

    return pd.DataFrame([[get_point(r[c]) for c in df.columns]
                         for _, r in df.iterrows()],
                        index=df.index, columns=df.columns)

def vote_by_opponent_stats(df):
    def get_stats(v):
        if pd.isna(v):
            return 0
        _, s2 = v
        return s2

    return pd.DataFrame([[get_stats(r[c]) for c in df.columns]
                         for _, r in df.iterrows()],
                        index=df.index, columns=df.columns)

def vote_by_team_stats(df):
    def get_stats(v):
        if pd.isna(v):
            return 0
        s1, _ = v
        return s1

    return pd.DataFrame([[get_stats(r[c]) for c in df.columns]
                         for _, r in df.iterrows()],
                        index=df.index, columns=df.columns)

def get_stationary_p(X):
    S, U = np.linalg.eig(X.T)
    r = (U[:,np.isclose(S, 1)][:,0] / U[:,np.isclose(S, 1)][:,0].sum()).real
    return pd.Series(r, index=X.index)

Here is the ranking based on losses.

[3]:

get_stationary_p(normalize(adjust(vote_by_loss(df)))).sort_values(ascending=False)

[3]:

Miami    0.437956
VT       0.218978
UNC      0.145985
UVA      0.109489
Duke     0.087591
dtype: float64

Here is the ranking based on point differential.

[4]:

get_stationary_p(normalize(adjust(vote_by_point_differential(df)))).sort_values(ascending=False)

[4]:

Miami    0.441519
VT       0.264757
UVA      0.110380
UNC      0.095039
Duke     0.088304
dtype: float64

[5]:

S_p = normalize(adjust(vote_by_opponent_stats(df)))

S_y = normalize(
        adjust(
            vote_by_opponent_stats(
                get_stats(pd.read_csv(fpath), f1='y1', f2='y2'))))

S_t = normalize(
        adjust(
            vote_by_team_stats(
                get_stats(pd.read_csv(fpath), f1='to1', f2='to2'))))

S_poss = normalize(
        adjust(
            vote_by_opponent_stats(
                get_stats(pd.read_csv(fpath), f1='p1', f2='p2'))))

Here is the ranking based on points given up.

[6]:

get_stationary_p(S_p).sort_values(ascending=False)

[6]:

Miami    0.296301
VT       0.243967
UVA      0.215845
UNC      0.148504
Duke     0.095384
dtype: float64

Here is the ranking by yards given up.

[7]:

get_stationary_p(S_y).sort_values(ascending=False)

[7]:

UVA      0.259757
Miami    0.248631
VT       0.217161
UNC      0.169855
Duke     0.104596
dtype: float64

Here is the ranking by turnovers.

[8]:

get_stationary_p(S_t).sort_values(ascending=False)

[8]:

Miami    0.241413
VT       0.232275
UNC      0.212132
Duke     0.189286
UVA      0.124894
dtype: float64

Here is the ranking by possesion time of the opponent.

[9]:

get_stationary_p(S_poss).sort_values(ascending=False)

[9]:

UVA      0.211133
Miami    0.204502
Duke     0.199181
VT       0.193825
UNC      0.191358
dtype: float64

6.2. Advanced modeling

We can combine all the stochastic matrices above and then induce ratings and a ranking on the combined matrix. In this case, the final stochastic matrix \(S\) is defined as follows,

\(S = \alpha_1 S_p + \alpha_2 S_y + \alpha_3 S_t + \alpha_4 S_{\mathrm{poss}}\),

where

\(\sum \alpha_i = 1\),
\(S_p\) is the matrix based on points given up,
\(S_y\) is the matrix based on yards given up,
\(S_t\) is the matrix on turnovers commited, and
\(S_{\mathrm{poss}}\) is the matrix on possession time allowed by the opponent.

Each \(a_i\) must be given some thoughts, but in this case, we set each \(a_i = \dfrac{1}{4}\) (equal weight is given to all matrices).

[10]:

a = 1 / 4
S = (a * S_p) + (a * S_y) + (a * S_t) + (a * S_poss)
S

[10]:

	Duke	Miami	UNC	UVA	VT
Duke	0.000000	0.246439	0.197926	0.234021	0.321614
Miami	0.211919	0.000000	0.298939	0.305075	0.184067
UNC	0.238747	0.347531	0.000000	0.159557	0.254165
UVA	0.167831	0.222127	0.176976	0.000000	0.433066
VT	0.094886	0.442535	0.185634	0.276945	0.000000

[11]:

get_stationary_p(S).sort_values(ascending=False)

[11]:

Miami    0.243905
VT       0.225703
UVA      0.200676
UNC      0.179965
Duke     0.149750
dtype: float64