8. Inference in Gaussian Networks

The junction tree algorithm (JTA) is a widely used algorithm for exact inference in Bayesian Belief Networks (BBNs). A great paper to learn the mechanics of JTA is authored by Huang and Darwiche. The Huang and Darwiche paper focuses only on discrete variables and leaves a lot to be desired if one has continuous variables.

The JTA may also be applied to continuous variables, and in particular, a set of Gaussian variables (multivariate Gaussian). The key idea in JTA is to transform a directed acylic graph (DAG) into a junction tree (JT). Once a JT is created, probabilistic inference to extract marginal probabilities (with or without evidence) is possible. Just like a BBN, which is composed of a graph (structure), G, and joint probability (parameters), P, a JT also has this type of duality. The graph of a JT is composed of nodes (cliques and separation sets) and edges (structure), and associated with each node is a potential (parameters).

8.1. Potentials, discrete variables

Potentials are the key to inference in JTA. Inference is conducted in JTA by passing messages (potentials) around. The representation of potentials is one of the two key differences between using JTA for discrete versus continuous variables (the other being how evidence is injected into the JT). For JTA with discrete variables, a potential is essentially a table. Below, we have a clique associated with the binary variables A and B. The potential associated with this clique would be the cross-product of the domains and a value (typically between \([0, 1]\)) associated to each combination of the instantiations.

| A   | B   | value |
|-----|-----|-------|
| on  | on  | 0.2   |
| on  | off | 0.3   |
| off | on  | 0.4   |
| off | off | 0.8   |

8.2. Potentials, Gaussian variables

The representation of potentials in a JT for Gaussian variables is entirely different but based on the multivariate Gaussian density function. Let’s look at the multivariate Gaussian density function.

\(f_X(x_1, x_2, \ldots, x_n)=\cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}\)

Where

\(n\) is the number of Gaussian variables
\(\Sigma\) is the \(n \times n\) covariance matrix
\(\mathbf{\mu}\) is a \(n\) dimensional vector of means
\(\mathbf{x}\) is a \(n\) dimensional vector of values

This form of the multivariate Gaussian is called the covariance form. In expanded form, the density function may be rewritten as follows.

\(\begin{equation} \begin{split} f_X(x_1, x_2, \ldots, x_n) & = \cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x} + \mathbf{\mu}^T\Sigma^{-1}\mathbf{x} - \frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu} - \log\left((2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}\right)\right) \end{split} \end{equation}\)

Let’s denote the following.

\(K = \Sigma^{-1}\)
\(\mathbf{h} = \Sigma^{-1}\mathbf{\mu}\)
\(g = -\frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu} - \log\left((2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}\right)\)

\(K\) is called the precision matrix or information matrix [Kol09, Mur12, Ros], and it is the inverse of the covariance matrix. \(h\) has a special name and depending on the context, is called the information vector, observation vector or even shift vector. Note that \(\mathbf{h}\) is a \(n\) dimensional vector and \(g\) evaluates to a scalar (a single number). Substituting \(K\), \(h\) and \(g\) into the covariance form, we can rewrite the multivariate density function as follows.

\(\begin{equation} \begin{split} f_X(x_1, x_2, \ldots, x_n) & = \cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x} + \mathbf{\mu}^T\Sigma^{-1}\mathbf{x} - \frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu} - \log\left((2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}\right)\right) \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^TK\mathbf{x} + \mathbf{h}^T\mathbf{x} + g\right) \end{split} \end{equation}\)

This last form of the multivariate Gaussian is the potential, \(\phi\), of a node in a JT of all Gaussian variables. This form is also called the canonical form or information form of the multivariate Gaussian.

\(\phi(\mathbf{x}; K, \mathbf{h}, g) = \exp\left(-\frac{1}{2}\mathbf{x}^TK\mathbf{x} + \mathbf{h}^T\mathbf{x} + g\right)\)

When marginalizing over a multivariate Gaussian distribution, it is easier to do so using the covariance form. When conditioning against a multivariate Gaussian distribution, it is easier to do so using the information form [Kol09].

8.3. Operations on \(\phi\)

The three main operations we would like to do against \(\phi\) are

multiplication,
division, and
marginalization.

8.3.1. Multiplication

We may multiply two potentials as follows (omitting the reference to \(\mathbf{x}\)).

\(\phi(K_1,\mathbf{h}_1, g_1) \cdot \phi(K_2,\mathbf{h}_2, g_2) = \phi(K_1 + K_2, \mathbf{h}_1 + \mathbf{h}_2, g_1 + g_2)\)

8.3.2. Division

We may divide two potentials as follows.

\(\cfrac{\phi(K_1,\mathbf{h}_1, g_1)}{\phi(K_2,\mathbf{h}_2, g_2)} = \phi(K_1 - K_2, \mathbf{h}_1 - \mathbf{h}_2, g_1 - g_2)\)

8.3.3. Marginalization

Marginalization over \(\phi\) is integration and used in message passing. Assume two sets of disjoint variables \(X\) and \(Y\), then, the potential is written as \(\phi(X,Y; K, \mathbf{h}, g)\). We may integrate \(\phi(X,Y; K, \mathbf{h}, g)\) over \(Y\) to give us \(\phi(X; K', \mathbf{h}', g')\).

\(\begin{equation} \begin{split} \int \phi(X,Y; K, \mathbf{h}, g) dY & = \phi(X; K', \mathbf{h}', g') \end{split} \end{equation}\)

Where

\(K' = K_{XX} - K_{XY}K_{YY}^{-1}K_{YX}\)
\(\mathbf{h}' = \mathbf{h}_{X} - K_{XY}K_{YY}^{-1}\mathbf{h}_{Y}\)
\(g' = g + \frac{1}{2}\left(\log|2\pi K_{YY}^{-1}| + \mathbf{h}_Y^TK_{YY}^{-1}\mathbf{h}_Y\right)\)

8.3.4. Evidence

When there is evidence, \(\phi(X,Y; K, \mathbf{h}, g)\) may be reduced to \(\phi(X; K', \mathbf{h}', g')\), where

\(K' = K_{XX}\)
\(h' = \mathbf{h}_X - K_{XY}\mathbf{y}\)
\(g' = g + \mathbf{h}_Y^T \mathbf{y} - \frac{1}{2} \mathbf{y}^T K_{YY} \mathbf{y}\)

8.4. Message passing

For two cliques \(X\) and \(Y\) with a separation set \(R\) in between them, message passing from \(X\) to \(Y\) through \(R\) consists of projection and absorption. The cliques and separation set will have the following denoted potentials: \(\phi_X\), \(\phi_Y\), and \(\phi_R\).

Projection involves saving \(\phi_R\) and replacing it through marginalization of \(\phi_X\).

\(\phi_R^{0} \leftarrow \phi_R\)
\(\phi_R \leftarrow \displaystyle \sum_{X \setminus R} \phi_X\)

Absorption involves multiplication and division of the potentials.

\(\phi_Y \leftarrow \phi_Y \cfrac{\phi_R}{\phi_R^0}\)

In JTA, global propagation uses message passing in two phases, collect-evidence and distribute-evidence to make the potentials locally consistent. In collect-evidence, we start by picking an initial clique (not separation set) in the JT and recursively travel to the terminal cliques (marking each one on the way). Once a terminal clique is reached, we pass messages back to the clique that came before it. In distribute-evidence, we start with the same initial clique and pass messages to its neighboring cliques, and then recursively call distribute-evidence to those neighboring cliques after.

After global propagation, for any variable whose density function we are interested, we pick a clique containing the variable and marginalize (over the other variables).