13. Estimating Conditional Probabilities
For two variables \(X\) and \(Y\), the conditional probability of \(Y\) given \(X\) is written as follows.
\(P(Y=y|X=x)\)
To compute \(P(Y=y|X=y)\), we divide the joint distribution of \(X\) and \(Y\) by the marginal distribution of \(X\) as follows,
\(P(Y=y|X=x) = \dfrac{P(X=x, Y=y)}{P(X=x)}\),
where
\(P(X=x, Y=y) = \dfrac{N_{xy}}{N}\),
\(P(X=x) = \dfrac{N_x}{N}\),
\(N_{xy}\) is the count of how many times \(X=x, Y=y\) has occured,
\(N_x\) is the count of how may times \(X=x\) has occured, and
\(N\) is the total number of record counts.
Using subsitution, we can rewrite \(P(Y=y|X=x)\) in terms of counts as follows.
\(P(Y=y|X=x) = \dfrac{N_{xy}}{N_x}\)
In Bayesian Belief Networks (BBNs), the parameters of the model is defined by a joint distribution \(P(U) = P(X_1, X_2, \ldots, X_n)\). Due to the Markov condition, which states that node is conditionally independent of its nondescendants given its parents, and using the chain rule, we can decompose the \(P(U)\) as follows
\(P(U) = \prod P(X|\pi_X)\),
where
\(\pi_X\) are the parents of \(X\).
\(P(U)\) is effectively decomposed into “local probability models.” Typically, these local probability models are a set of conditional probabilities called a conditional probability tables (CPTs). One way to learn a CPT is to simply issue as many queries as required to compute the corresponding joint and marginal probabilities. However, this operation may be intractable to query/compute; for a binary node with 5 parents (also binary nodes), it would require \(2^5\) queries. You see that in general, computing a full CPT requires \(2^n\) (exponential) queries. We need another way to estimate the parameters and local probability models that is less computationally expensive.
In this notebook, we show how to use logistic regression and random forest to estimate CPTs. When using gradient descent for logistic regression, the running time complexity is close to \(O(n^2)\). For random forest, the running time complexity is close to \(O(n \log(n))\). Clearly, \(O(n \log(n)) \leq O(n^2) \leq O(k^n)\).
13.1. Binary variables
Let’s assume 3 binary variables \(X_0\), \(X_1\) and \(Y\) with the structure \(X_0 \rightarrow Y \leftarrow X_1\). The local probability models binomial distributions parameterized as follows.
\(X_0 \sim \mathcal{B}(1, 0.1)\)
\(X_1 \sim \mathcal{B}(1, 0.5)\)
\(Y \sim \mathcal{B}(1, \dfrac{1}{1 + e^{-(0.1 + 2.3 X_0 - 3.4 X_1)}})\)
For all instantiations of \(X_0\), \(X_1\) and \(Y\), we want to estimate from the data \(P(Y=y | X_0=x_0, X_1=x_1)\).
[1]:
import pandas as pd
import numpy as np
from scipy.stats import binom
np.random.seed(37)
N = 1_000
x_0 = np.random.binomial(1, 0.1, N)
x_1 = np.random.binomial(1, 0.5, N)
y = binom.rvs(1, 1 / (1 + np.exp(-(0.1 + 2.3 * x_0 - 3.4 * x_1))))
Xy = pd.DataFrame({
'x_0': x_0,
'x_1': x_1,
'y': y
})
X, y = Xy[['x_0', 'x_1']], Xy['y']
X.shape, y.shape
[1]:
((1000, 2), (1000,))
Now we will learn logistic regression and random forest models.
[2]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
M = {
'L': LogisticRegression(solver='saga', random_state=37, n_jobs=-1).fit(X, y),
'R': RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=20).fit(X, y)
}
[3]:
def get_conditional_probs(m):
def p(x_0, x_1, y):
n = (Xy['x_0']==x_0) & (Xy['x_1']==x_1) & (Xy['y']==y)
d = (Xy['x_0']==x_0) & (Xy['x_1']==x_1)
return Xy[n].shape[0] / Xy[d].shape[0]
m_pred = pd.DataFrame({'x_0': [0, 0, 1, 1], 'x_1': [0, 1, 0, 1]}) \
.assign(
p=lambda d: [r for r in m.predict_proba(d)],
m_p_0=lambda d: d.apply(lambda r: r['p'][0], axis=1),
m_p_1=lambda d: d.apply(lambda r: r['p'][1], axis=1),
y_0=0,
y_1=1
) \
.drop(columns=['p']) \
[['x_0', 'x_1', 'y_0', 'y_1', 'm_p_0', 'm_p_1']] \
.set_index(['x_0', 'x_1', 'y_0', 'y_1'])
d_pred = pd.DataFrame({
'x_0': [0, 0, 1, 1],
'x_1': [0, 1, 0, 1],
'y_0': [0, 0, 0, 0],
'y_1': [1, 1, 1, 1]
}) \
.assign(
d_p_0 = lambda d: d.apply(lambda r: p(r.x_0, r.x_1, r.y_0), axis=1),
d_p_1 = lambda d: d.apply(lambda r: p(r.x_0, r.x_1, r.y_1), axis=1)
) \
.set_index(['x_0', 'x_1', 'y_0', 'y_1'])
return m_pred.join(d_pred)[['m_p_0', 'd_p_0', 'm_p_1', 'd_p_1']]
This table compares the conditional probabilities estimated by the logistic regression model compared to the empirical conditional probabilities. The columns are denoted as follows.
m_p_0: is the model’s estimation of \(P(Y=0 | X_0=x_0, X_1=x_1)\)
m_p_1: is the model’s estimation of \(P(Y=1 | X_0=x_1, X_1=x_1)\)
d_p_0: is the data estimation of \(P(Y=0 | X_0=x_0, X_1=x_1)\)
d_p_1: is the data estimation of \(P(Y=1 | X_0=x_1, X_1=x_1)\)
[4]:
get_conditional_probs(M['L'])
[4]:
m_p_0 | d_p_0 | m_p_1 | d_p_1 | ||||
---|---|---|---|---|---|---|---|
x_0 | x_1 | y_0 | y_1 | ||||
0 | 0 | 0 | 1 | 0.451874 | 0.448661 | 0.548126 | 0.551339 |
1 | 0 | 1 | 0.936365 | 0.944700 | 0.063635 | 0.055300 | |
1 | 0 | 0 | 1 | 0.092490 | 0.068966 | 0.907510 | 0.931034 |
1 | 0 | 1 | 0.645277 | 0.633333 | 0.354723 | 0.366667 |
This next table shows the CPT learned for the random forest model.
[5]:
get_conditional_probs(M['R'])
[5]:
m_p_0 | d_p_0 | m_p_1 | d_p_1 | ||||
---|---|---|---|---|---|---|---|
x_0 | x_1 | y_0 | y_1 | ||||
0 | 0 | 0 | 1 | 0.443748 | 0.448661 | 0.556252 | 0.551339 |
1 | 0 | 1 | 0.943741 | 0.944700 | 0.056259 | 0.055300 | |
1 | 0 | 0 | 1 | 0.070870 | 0.068966 | 0.929130 | 0.931034 |
1 | 0 | 1 | 0.627743 | 0.633333 | 0.372257 | 0.366667 |
Both models do very well to recover the local probability model for \(Y\).
13.2. Multinomial variables
Let’s figure a way to estimate the CPTs beyond binary variables to those that may have 3 or more values (multinomial variables).
13.2.1. One-Hot Encoding
Let’s assume 3 variables \(X_0\), \(X_1\) and \(Y\) with the structure \(X_0 \rightarrow Y \leftarrow X_1\). Note that \(X_0\) and \(X_1\) have 3 values each \(a, b, c\) and \(Y\) is binary. The local probability models for \(X_0\) and \(X_1\) are multinomial and for \(Y\) binomial and are parameterized and simulated as follows.
\(X_0 \sim \mathcal{M}(1, [0.25, 0.35, 0.4])\)
\(X_1 \sim \mathcal{M}(1, [0.35, 0.4, 0.25])\)
\(Y \sim \mathcal{B}(1, \dfrac{1}{1 + e^{-(0.5 - x_0^a + 0.5 x_0^b + 0.37 x_0^c - 0.4 x_1^a + 0.8 x_1^b + 0.1 x_1^c)}})\)
[6]:
x_0 = np.random.multinomial(1, [0.25, 0.35, 0.4], N)
x_1 = np.random.multinomial(1, [0.35, 0.4, 0.25], N)
y = binom.rvs(1, 1 / (1 + np.exp(-(0.5 + np.hstack([x_0, x_1]).dot(np.array([-1, 0.5, 0.37, -0.4, 0.8, 0.1]))))))
x_0 = pd.Series(np.argmax(x_0, axis=1)).map({0: 'a', 1: 'b', 2: 'c'})
x_1 = pd.Series(np.argmax(x_1, axis=1)).map({0: 'a', 1: 'b', 2: 'c'})
Xy = pd.DataFrame({
'x_0': x_0,
'x_1': x_1,
'y': y
})
X, y = Xy[['x_0', 'x_1']], Xy['y']
X.shape, y.shape
[6]:
((1000, 2), (1000,))
A typical way to handle categorical variables with 3 or more values is to one-hot encode them (OHE). The pd.get_dummies()
function is used to OHE \(X_0\) and \(X_1\) QED. We can learn a random forest model and attempt to see if it makes sense to estimate a CPT for \(Y\).
[7]:
m = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=20)
m.fit(pd.get_dummies(X), y)
[7]:
RandomForestClassifier(n_estimators=20, n_jobs=-1, random_state=37)
When immediately run into trouble using OHE and a random forest to compute the CPT for \(Y\). As you can see from the table below, it is hard to translate some of the rows back to conditional probabilities. For example, in the table below, how do we translate \(P(Y=0|X_0^a=0, X_0^b=0, X_0^c=0)\)? When \(X_0^a=0, X_0^b=0, X_0^c=0\), this instantiation means \(X_0\) is not even one of the 3 values \(a, b, c\)! Let’s abandon this route.
[8]:
import itertools
_X = pd.DataFrame(list(itertools.product(*[[0, 1] for _ in range(6)])), columns=pd.get_dummies(X).columns)
_y = m.predict_proba(_X)
_X.assign(y_0=_y[:,0], y_1=_y[:,1]).head(10)
[8]:
x_0_a | x_0_b | x_0_c | x_1_a | x_1_b | x_1_c | y_0 | y_1 | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.368178 | 0.631822 |
1 | 0 | 0 | 0 | 0 | 0 | 1 | 0.380901 | 0.619099 |
2 | 0 | 0 | 0 | 0 | 1 | 0 | 0.240887 | 0.759113 |
3 | 0 | 0 | 0 | 0 | 1 | 1 | 0.275771 | 0.724229 |
4 | 0 | 0 | 0 | 1 | 0 | 0 | 0.518981 | 0.481019 |
5 | 0 | 0 | 0 | 1 | 0 | 1 | 0.477193 | 0.522807 |
6 | 0 | 0 | 0 | 1 | 1 | 0 | 0.445838 | 0.554162 |
7 | 0 | 0 | 0 | 1 | 1 | 1 | 0.426210 | 0.573790 |
8 | 0 | 0 | 1 | 0 | 0 | 0 | 0.300497 | 0.699503 |
9 | 0 | 0 | 1 | 0 | 0 | 1 | 0.302161 | 0.697839 |
13.2.2. Frequency Encoding
A promising way to encode the categorical values to numeric ones is through frequency encoding. In frequency encoding, we represent each value by its percentage of frequency. You can see below that the values for \(X_0\) are frequency encoded as follows.
a: 0.250
b: 0.358
c: 0.392
Likewise, the values for \(X_1\) are frequency encoded as follows.
a: 0.375
b: 0.382
c: 0.243
Just be careful on what to do if two values have the same percentages! But, you will see we have a solution for this problem a bit later.
[9]:
enc_0 = (X['x_0'].value_counts().sort_index() / X.shape[0]).to_dict()
enc_1 = (X['x_1'].value_counts().sort_index() / X.shape[0]).to_dict()
dec_0 = {v: k for k, v in enc_0.items()}
dec_1 = {v: k for k, v in enc_1.items()}
enc_0, enc_1
[9]:
({'a': 0.25, 'b': 0.358, 'c': 0.392}, {'a': 0.375, 'b': 0.382, 'c': 0.243})
The data Xy is now frequency encoded to Ab.
[10]:
A = X.assign(x_0=lambda d: d['x_0'].map(enc_0), x_1=lambda d: d['x_1'].map(enc_1))
b = y
Ab = A.assign(y=b)
A random forest model is learned from Ab.
[11]:
m = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=20)
m.fit(A, b)
[11]:
RandomForestClassifier(n_estimators=20, n_jobs=-1, random_state=37)
You can see from the table below that frequency encoding definitely helps the random forest model to recover the CPT!
[12]:
def get_conditional_probs(m):
def p(x_0, x_1, y):
n = (Ab['x_0']==x_0) & (Ab['x_1']==x_1) & (Ab['y']==y)
d = (Ab['x_0']==x_0) & (Ab['x_1']==x_1)
n = Ab[n].shape[0]
d = Ab[d].shape[0]
return 0 if d == 0 else n / d
m_pred = pd.DataFrame(list(itertools.product(*[list(dec_0.keys()), list(dec_1.keys())])), columns=['x_0', 'x_1']) \
.assign(
p=lambda d: [r for r in m.predict_proba(d)],
m_p_0=lambda d: d.apply(lambda r: r['p'][0], axis=1),
m_p_1=lambda d: d.apply(lambda r: r['p'][1], axis=1),
y_0=0,
y_1=1
) \
.drop(columns=['p']) \
[['x_0', 'x_1', 'y_0', 'y_1', 'm_p_0', 'm_p_1']] \
.set_index(['x_0', 'x_1', 'y_0', 'y_1'])
d_pred = pd.DataFrame(itertools.product(*[list(dec_0.keys()), list(dec_1.keys())]), columns=['x_0', 'x_1']) \
.assign(
y_0=lambda d: [0 for _ in range(d.shape[0])],
y_1=lambda d: [1 for _ in range(d.shape[0])],
d_p_0 = lambda d: d.apply(lambda r: p(r.x_0, r.x_1, r.y_0), axis=1),
d_p_1 = lambda d: d.apply(lambda r: p(r.x_0, r.x_1, r.y_1), axis=1)
) \
.set_index(['x_0', 'x_1', 'y_0', 'y_1'])
return m_pred.join(d_pred)[['m_p_0', 'd_p_0', 'm_p_1', 'd_p_1']]
get_conditional_probs(m)
[12]:
m_p_0 | d_p_0 | m_p_1 | d_p_1 | ||||
---|---|---|---|---|---|---|---|
x_0 | x_1 | y_0 | y_1 | ||||
0.250 | 0.375 | 0 | 1 | 0.765424 | 0.768421 | 0.234576 | 0.231579 |
0.382 | 0 | 1 | 0.351099 | 0.350515 | 0.648901 | 0.649485 | |
0.243 | 0 | 1 | 0.505376 | 0.534483 | 0.494624 | 0.465517 | |
0.358 | 0.375 | 0 | 1 | 0.429723 | 0.430657 | 0.570277 | 0.569343 |
0.382 | 0 | 1 | 0.116522 | 0.125000 | 0.883478 | 0.875000 | |
0.243 | 0 | 1 | 0.249348 | 0.247059 | 0.750652 | 0.752941 | |
0.392 | 0.375 | 0 | 1 | 0.378901 | 0.384615 | 0.621099 | 0.615385 |
0.382 | 0 | 1 | 0.153403 | 0.154362 | 0.846597 | 0.845638 | |
0.243 | 0 | 1 | 0.302161 | 0.310000 | 0.697839 | 0.690000 |
13.2.3. Ordinal encoding
If the values of the categorical variable are ordinal, we may convert them to integers. Here, \(X_0\) and \(X_1\) are ordinal encoded as follows.
a: 0
b: 1
c: 2
Again, you see that ordinal encoding also helps the random forest to learn the CPT quite well.
[13]:
enc_0, enc_1 = {'a': 0, 'b': 1, 'c': 2}, {'a': 0, 'b': 1, 'c': 2}
dec_0 = {v: k for k, v in enc_0.items()}
dec_1 = {v: k for k, v in enc_1.items()}
A = X.assign(x_0=lambda d: d['x_0'].map(enc_0), x_1=lambda d: d['x_1'].map(enc_1))
b = y
Ab = A.assign(y=b)
m = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=20)
m.fit(A, b)
get_conditional_probs(m)
[13]:
m_p_0 | d_p_0 | m_p_1 | d_p_1 | ||||
---|---|---|---|---|---|---|---|
x_0 | x_1 | y_0 | y_1 | ||||
0 | 0 | 0 | 1 | 0.765424 | 0.768421 | 0.234576 | 0.231579 |
1 | 0 | 1 | 0.351099 | 0.350515 | 0.648901 | 0.649485 | |
2 | 0 | 1 | 0.505376 | 0.534483 | 0.494624 | 0.465517 | |
1 | 0 | 0 | 1 | 0.429723 | 0.430657 | 0.570277 | 0.569343 |
1 | 0 | 1 | 0.116522 | 0.125000 | 0.883478 | 0.875000 | |
2 | 0 | 1 | 0.249348 | 0.247059 | 0.750652 | 0.752941 | |
2 | 0 | 0 | 1 | 0.378901 | 0.384615 | 0.621099 | 0.615385 |
1 | 0 | 1 | 0.153403 | 0.154362 | 0.846597 | 0.845638 | |
2 | 0 | 1 | 0.302161 | 0.310000 | 0.697839 | 0.690000 |
13.2.4. Any encoding
With “any encoding”, we simply reencode the categorical values to any arbitrary integer value. \(X_0\) is encoded as follows.
a: 88
b: 19
c: 2
\(X_1\) is encoded as follows.
a: 44
b: -1
c: 10
Again, the CPT is recovered quite well. What these findings tell us is that we can simply just pick arbitrary integer to represent the categorical value, and we will still be able to recover the CPT!
[14]:
enc_0, enc_1 = {'a': 88, 'b': 19, 'c': 2}, {'a': 44, 'b': -1, 'c': 10}
dec_0 = {v: k for k, v in enc_0.items()}
dec_1 = {v: k for k, v in enc_1.items()}
A = X.assign(x_0=lambda d: d['x_0'].map(enc_0), x_1=lambda d: d['x_1'].map(enc_1))
b = y
Ab = A.assign(y=b)
m = RandomForestClassifier(random_state=37, n_jobs=-1, n_estimators=20)
m.fit(A, b)
get_conditional_probs(m)
[14]:
m_p_0 | d_p_0 | m_p_1 | d_p_1 | ||||
---|---|---|---|---|---|---|---|
x_0 | x_1 | y_0 | y_1 | ||||
88 | 44 | 0 | 1 | 0.765424 | 0.768421 | 0.234576 | 0.231579 |
-1 | 0 | 1 | 0.351099 | 0.350515 | 0.648901 | 0.649485 | |
10 | 0 | 1 | 0.505376 | 0.534483 | 0.494624 | 0.465517 | |
19 | 44 | 0 | 1 | 0.429723 | 0.430657 | 0.570277 | 0.569343 |
-1 | 0 | 1 | 0.116522 | 0.125000 | 0.883478 | 0.875000 | |
10 | 0 | 1 | 0.249348 | 0.247059 | 0.750652 | 0.752941 | |
2 | 44 | 0 | 1 | 0.378901 | 0.384615 | 0.621099 | 0.615385 |
-1 | 0 | 1 | 0.153403 | 0.154362 | 0.846597 | 0.845638 | |
10 | 0 | 1 | 0.302161 | 0.310000 | 0.697839 | 0.690000 |