8. Inference in Gaussian Networks
The junction tree algorithm (JTA
) is a widely used algorithm for exact inference in Bayesian Belief Networks (BBNs
). A great paper to learn the mechanics of JTA is authored by Huang and Darwiche. The Huang and Darwiche paper focuses only on discrete variables and leaves a lot to be desired if one has continuous variables.
The JTA may also be applied to continuous variables, and in particular, a set of Gaussian variables (multivariate Gaussian). The key idea in JTA is to transform a directed acylic graph (DAG
) into a junction tree (JT
). Once a JT is created, probabilistic inference to extract marginal probabilities (with or without evidence) is possible. Just like a BBN, which is composed of a graph (structure), G
, and joint probability (parameters), P
, a JT also has this type of duality. The graph
of a JT is composed of nodes (cliques
and separation sets
) and edges (structure), and associated with each node is a potential
(parameters).
8.1. Potentials, discrete variables
Potentials are the key to inference
in JTA. Inference is conducted in JTA by passing messages
(potentials) around. The representation of potentials is one of the two key differences between using JTA for discrete versus continuous variables (the other being how evidence is injected into the JT). For JTA with discrete variables, a potential is essentially a table. Below, we have a clique associated with the binary variables A
and B
. The potential associated with this clique would
be the cross-product of the domains and a value (typically between \([0, 1]\)) associated to each combination of the instantiations.
| A | B | value | |-----|-----|-------| | on | on | 0.2 | | on | off | 0.3 | | off | on | 0.4 | | off | off | 0.8 |
8.2. Potentials, Gaussian variables
The representation of potentials in a JT for Gaussian variables is entirely different but based on the multivariate Gaussian density function. Let’s look at the multivariate Gaussian density function.
\(f_X(x_1, x_2, \ldots, x_n)=\cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}\)
Where
\(n\) is the number of Gaussian variables
\(\Sigma\) is the \(n \times n\) covariance matrix
\(\mathbf{\mu}\) is a \(n\) dimensional vector of means
\(\mathbf{x}\) is a \(n\) dimensional vector of values
This form of the multivariate Gaussian is called the covariance form
. In expanded form, the density function may be rewritten as follows.
\(\begin{equation} \begin{split} f_X(x_1, x_2, \ldots, x_n) & = \cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x} + \mathbf{\mu}^T\Sigma^{-1}\mathbf{x} - \frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu} - \log\left((2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}\right)\right) \end{split} \end{equation}\)
Let’s denote the following.
\(K = \Sigma^{-1}\)
\(\mathbf{h} = \Sigma^{-1}\mathbf{\mu}\)
\(g = -\frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu} - \log\left((2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}\right)\)
\(K\) is called the precision matrix
or information matrix
[Kol09, Mur12, Ros], and it is the inverse of the covariance matrix. \(h\) has a special name and depending on the context, is called the information vector
, observation vector
or even shift vector
. Note that \(\mathbf{h}\) is a \(n\) dimensional vector and \(g\) evaluates to a scalar (a single number). Substituting \(K\), \(h\) and \(g\) into the
covariance form
, we can rewrite the multivariate density function as follows.
\(\begin{equation} \begin{split} f_X(x_1, x_2, \ldots, x_n) & = \cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x} + \mathbf{\mu}^T\Sigma^{-1}\mathbf{x} - \frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu} - \log\left((2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}\right)\right) \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^TK\mathbf{x} + \mathbf{h}^T\mathbf{x} + g\right) \end{split} \end{equation}\)
This last form of the multivariate Gaussian is the potential, \(\phi\), of a node in a JT of all Gaussian variables. This form is also called the canonical form
or information form
of the multivariate Gaussian.
\(\phi(\mathbf{x}; K, \mathbf{h}, g) = \exp\left(-\frac{1}{2}\mathbf{x}^TK\mathbf{x} + \mathbf{h}^T\mathbf{x} + g\right)\)
When marginalizing over a multivariate Gaussian distribution, it is easier to do so using the covariance form
. When conditioning against a multivariate Gaussian distribution, it is easier to do so using the information form
[Kol09].
8.3. Operations on \(\phi\)
The three main operations we would like to do against \(\phi\) are
multiplication,
division, and
marginalization.
8.3.1. Multiplication
We may multiply two potentials
as follows (omitting the reference to \(\mathbf{x}\)).
\(\phi(K_1,\mathbf{h}_1, g_1) \cdot \phi(K_2,\mathbf{h}_2, g_2) = \phi(K_1 + K_2, \mathbf{h}_1 + \mathbf{h}_2, g_1 + g_2)\)
8.3.2. Division
We may divide two potentials
as follows.
\(\cfrac{\phi(K_1,\mathbf{h}_1, g_1)}{\phi(K_2,\mathbf{h}_2, g_2)} = \phi(K_1 - K_2, \mathbf{h}_1 - \mathbf{h}_2, g_1 - g_2)\)
8.3.3. Marginalization
Marginalization
over \(\phi\) is integration and used in message passing. Assume two sets of disjoint variables \(X\) and \(Y\), then, the potential is written as \(\phi(X,Y; K, \mathbf{h}, g)\). We may integrate \(\phi(X,Y; K, \mathbf{h}, g)\) over \(Y\) to give us \(\phi(X; K', \mathbf{h}', g')\).
\(\begin{equation} \begin{split} \int \phi(X,Y; K, \mathbf{h}, g) dY & = \phi(X; K', \mathbf{h}', g') \end{split} \end{equation}\)
Where
\(K' = K_{XX} - K_{XY}K_{YY}^{-1}K_{YX}\)
\(\mathbf{h}' = \mathbf{h}_{X} - K_{XY}K_{YY}^{-1}\mathbf{h}_{Y}\)
\(g' = g + \frac{1}{2}\left(\log|2\pi K_{YY}^{-1}| + \mathbf{h}_Y^TK_{YY}^{-1}\mathbf{h}_Y\right)\)
8.3.4. Evidence
When there is evidence
, \(\phi(X,Y; K, \mathbf{h}, g)\) may be reduced to \(\phi(X; K', \mathbf{h}', g')\), where
\(K' = K_{XX}\)
\(h' = \mathbf{h}_X - K_{XY}\mathbf{y}\)
\(g' = g + \mathbf{h}_Y^T \mathbf{y} - \frac{1}{2} \mathbf{y}^T K_{YY} \mathbf{y}\)
8.4. Message passing
For two cliques \(X\) and \(Y\) with a separation set \(R\) in between them, message passing from \(X\) to \(Y\) through \(R\) consists of projection
and absorption
. The cliques and separation set will have the following denoted potentials: \(\phi_X\), \(\phi_Y\), and \(\phi_R\).
Projection involves saving \(\phi_R\) and replacing it through marginalization of \(\phi_X\).
\(\phi_R^{0} \leftarrow \phi_R\)
\(\phi_R \leftarrow \displaystyle \sum_{X \setminus R} \phi_X\)
Absorption involves multiplication and division of the potentials.
\(\phi_Y \leftarrow \phi_Y \cfrac{\phi_R}{\phi_R^0}\)
In JTA, global propagation
uses message passing in two phases, collect-evidence
and distribute-evidence
to make the potentials locally consistent. In collect-evidence, we start by picking an initial clique (not separation set) in the JT and recursively travel to the terminal cliques (marking each one on the way). Once a terminal clique is reached, we pass messages back to the clique that came before it. In distribute-evidence, we start with the same initial clique and pass messages
to its neighboring cliques, and then recursively call distribute-evidence to those neighboring cliques after.
After global propagation, for any variable whose density function we are interested, we pick a clique containing the variable and marginalize (over the other variables).