Part 6: Learning Parameters of a Markov Network

By: DJ Rich

Posted: October 01, 2018 Updated: September 25, 2023

In Parts 2-4, we assumed a PGM, whether it’s a Markov Network (‘MN’) or a Bayesian Network (‘BN’), was given with fitted parameters. Now we’ll review the theory governing how a MN’s parameters can be learned from data. Metaphorically, it’s an unfriendly government, as we’ll discover:

In the general case, learning parameters of a Markov Network is hard.

If there’s one thing to understand, it’s this fact. It’ll explain the otherwise oddly strong graph constraints or aggressive approximations algorithms may apply when tackling this problem. At the same time, the mathematics is satisfying. We’ll import the exponential function, which will produce results that confirm our intuitions. In total, we’ll see MN parameter learning seeks an almost self-evident condition, yet arriving at it requires often prohibitively heavy computations.

Before proceeding, it may help to review the Notation Guide, the Part 1 summary and the Part 2 summary.

The Starting Point and Goal

The goal is to use data to determine the factors, functions mapping assignments of a subset of variables to nonnegative numbers, of a MN with an assumed graph \(\mathcal{H}\). As will be discussed in Part 7, knowing the graph isn’t sufficient for determining the count of factors and the subset of variables they map from. Here, we will assume we have this additional information. Further, letting \(\mathbf{X}=\mathcal{X}\), we assume we have \(w\) observations of data, \(\mathcal{D}=\{\mathbf{x}^{(i)}\}_{i=1}^w\).

As a consequence of the Hammerseley-Clifford theorem, the goal is substantially simplified if it is restricted slightly. Markov Networks cannot well represent distributions with factors that assign zero probability to some assignments, so we’ll omit them. That is, we’ll only consider factors which map assignments to positive numbers. With this, all factors may be re-expressed using the exponential function:

\[\phi_j(\mathbf{D}_j) = \exp(f_j(\mathbf{D}_j))\]

where \(f_j(\cdot)\) is called a feature, a function mapping assignments to real numbers. It’s merely an alternative expression of a positive factor. The Gibbs Rule becomes:

\[\begin{align} P_M(X_1,\cdots,X_n) & = \frac{1}{Z} \prod_{j = 1}^m \exp(f_j(\mathbf{D}_j)) \\ & = \frac{1}{Z} \exp\big(\sum_{j = 1}^m f_j(\mathbf{D}_j)\big) \\ \end{align}\]

Further, we’ll begin assuming the data is complete, meaning there are no missing entries in the observations of \(\mathbf{X}\).

With the above, we can state the parameter learning objective:

Assuming \(m\) and the sets \(\{\mathbf{D}_j\}_{j=1}^m\) are known, we seek to determine the functions \(\{f_j(\mathbf{D}_j)\}_{j=1}^m\).

Functions to Parameters

The goal may be further simplified by turning the task of learning functions into one of learning parameters. Specifically, we’d like to represent \(f_j(\mathbf{D}_j)\) using coefficients. This can be done with a basis expansion, whereby each basis function is an indicator function selecting one assignment of \(\mathbf{D}_j\). That is, if \(\boldsymbol{\theta}_j\) is a length \(\vert \textrm{Val}(\mathbf{D}_j) \vert\) vector and \(b_j^k(\mathbf{D}_j)\) is an indicator for the \(k\)-th assignment of \(\mathbf{D}_j\), then any \(f_j(\mathbf{D}_j)\) may be written as:

\[f_j(\mathbf{D}_j) = \sum_{k=1}^{\vert \textrm{Val}(\mathbf{D}_j) \vert} \theta_j^k b_j^k(\mathbf{D}_j)\]

where \(\theta_j^k\) is the output of \(f_j(\cdot)\) for the \(k\)-th assignment of \(\mathbf{D}_j\). And so:

\[P_M(X_1,\cdots,X_n) = \frac{1}{Z} \exp\big(\sum_{j = 1}^m \sum_{k=1}^{\vert \textrm{Val}(\mathbf{D}_j) \vert} \theta_j^k b_j^k(\mathbf{D}_j)\big)\]

But this introduces heavy notation for a simple point. We may in fact write it:

\[\sum_{j = 1}^m \sum_{k=1}^{\vert \textrm{Val}(\mathbf{D}_j) \vert} \theta_j^k b_j^k(\mathbf{D}_j) = \sum_{j = 1}^{m'} \theta'_j f'_j(\mathbf{D}'_j)\]

where \(m'\), \(\theta'_j\), \(f'_j(\cdot)\) and \(\mathbf{D}'_j\) are defined as necessary. For example, \(m' = \sum_{j=1}^m \vert \textrm{Val}(\mathbf{D}_j) \vert\) and the \(f'_j(\cdot)\)’s are indicators. In fact, we’ll drop the apostrophes and re-interpret the following as though they’re there:

\[P_M(X_1,\cdots,X_n) = \frac{1}{Z} \exp\big(\sum_{j = 1}^m \theta_j f_j(\mathbf{D}_j)\big)\]

According to the re-definitions, \(\{f_j(\cdot)\}_{i=1}^m\) encodes both the assumed graph \(\mathcal{H}\) and original specifications of \(\mathbf{D}_j\). This is called the log-linear form of a Markov Network.

In summary:

The goal is to determine the \(\theta_j\)’s, collectively labeled \(\boldsymbol{\theta}\), given data \(\mathcal{D}\) and a set of indicator functions \(\{f_j(\mathbf{D}_j)\}_{i=1}^m\).¹

It should be noted the above form, relying on indicators for all possible assignments, is useful for the following theoretical analysis. It is not practically useful; the extreme number of terms would, in most cases, be unmanageable and impossible to estimate.

The Likelihood Function

In Part 5, we made a relevant statement:

The approach will rely on the likelihood function, a function of \(\boldsymbol{\theta}\) that quantifies how likely \(\mathcal{D}\) is according to \(\boldsymbol{\theta}\). It is labeled as:
\[\mathcal{L}(\boldsymbol{\theta})\]
Regularization considerations aside, we assume the best parameters are those which maximize the likelihood. Such parameters, labeled \(\hat{\boldsymbol{\theta}}\), are called the Maximum Likelihood Estimate (‘MLE’):
\[\hat{\boldsymbol{\theta}} = \textrm{argmax}_\boldsymbol{\theta} \log \mathcal{L}(\boldsymbol{\theta})\]
We may restate our goal as finding \(\hat{\boldsymbol{\theta}}\).

In the case of a BN, we saw a fortunate decomposition. We could optimize pieces of \(\boldsymbol{\theta}\) independently and paste them together to produce the MLE. This followed from the log transform of the Chain Rule, whereby the MLE for one variable only depends on the observations of it and its parents.

In the case of a MN, there is no analogous decomposition. The entirity of \(\hat{\boldsymbol{\theta}}\) must be determined jointly. This coupling follows from the normalizer, \(Z\), which is in fact a function of \(\boldsymbol{\theta}\):

\[Z(\boldsymbol{\theta}) = \sum_{\mathbf{x}\in \textrm{Val}(\mathbf{X})} \exp\big(\sum_{j = 1}^m \theta_j f_j(\mathbf{D}_j)\big)\]

This term is involved in calculating the likelihood for every variable, preventing us from separating the task into smaller optimizations.

The MLE Condition

Since it’ll assist the interpretation, we consider the average log-likelihood:

\[\begin{align} \frac{1}{w} \log \mathcal{L}(\boldsymbol{\theta}) & = \frac{1}{w} \log \big( \prod_{i=1}^w P_M(\mathbf{x}^{(i)}) \big) \\ & = \frac{1}{w} \log \big( \prod_{i=1}^w \frac{1}{Z(\boldsymbol{\theta})} \exp\big(\sum_{j = 1}^m \theta_j f_j(\mathbf{d}_j^{(i)})\big) \big) \\ & = \frac{1}{w} \big( \sum_{i=1}^w \sum_{j = 1}^m \theta_j f_j(\mathbf{d}_j^{(i)}) \big) - \log Z(\boldsymbol{\theta}) \\ & = \sum_{j = 1}^m \theta_j \mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)] - \log Z(\boldsymbol{\theta}) \end{align}\]

where \(\mathbf{d}_j^{(i)}\) are the \(\mathbf{D}_j\)-elements of \(\mathbf{x}^{(i)}\) and \(\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)]\) is the empirical expectation of \(f_j(\mathbf{D}_j)\). That is, \(\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)]\) is the proportion of times \(f_j(\mathbf{D}_j) = 1\) in \(\mathcal{D}\).

Considering most optimizations will rely on gradients, we consider the derivative:

\[\frac{\partial }{\partial \theta_j} \frac{1}{w} \log \mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)] - \frac{\partial }{\partial \theta_j} \log Z(\boldsymbol{\theta})\]

Next, because a positive MN qualifies as a distribution from the exponential family, we have the following result, stated without proof:

\[\frac{\partial }{\partial \theta_j} \log Z(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]\]

The right term is the expected value of \(f_j(\mathbf{D}_j)\) when \(\mathbf{D}_j\) is distributed per \(\boldsymbol{\theta}\). So:

\[\frac{\partial }{\partial \theta_j} \frac{1}{w} \log \mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)] - \mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]\]

The left side is the objective’s sensitivity to \(\theta_j\) at \(\boldsymbol{\theta}\). We see it’s a difference of:

\(\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)]\): The proportion of times \(f_j(\mathbf{D}_j) = 1\) in the data.
\(\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]\): The proportion of times \(f_j(\mathbf{D}_j) = 1\) is expected if \(\mathbf{X}\) were generated according to \(P_M\) with parameters \(\boldsymbol{\theta}\).

Next, there’s another result we can pull from the facts of an exponential family:

\[\log \mathcal{L}(\boldsymbol{\theta})\textrm{ is concave in } \boldsymbol{\theta}\]

Therefore, \(\log \mathcal{L}(\boldsymbol{\theta})\) has no local optimum and so the gradient condition:

\[\nabla_{\boldsymbol{\theta}} \log \mathcal{L}(\boldsymbol{\theta}) = \mathbf{0}\]

holds only for a global optimum \(\hat{\boldsymbol{\theta}}\).

This is equivalent to:
\[\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)] = \mathbb{E}_{\hat{\boldsymbol{\theta}}}[f_j(\mathbf{D}_j)]\quad\quad \textrm{for } j = 1, \cdots, m\]
In words, the MLE \(\hat{\boldsymbol{\theta}}\) is that which makes the expected values of \(f_j(\mathbf{D}_j)\) equal to their empirical averages observed in the data, \(\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)]\).

This condition should match our intuitions. \(\hat{\boldsymbol{\theta}}\) is that which maximizes agreement with the data. A MN determines probabilities only via \(f_j(\cdot)\)-measurements. A \(\boldsymbol{\theta}\) implying expected measurements different from those observed in the data surely could be improved by bringing those measurements into agreement. If those measurements were equated, there’s no remaining criteria to justify changing \(\hat{\boldsymbol{\theta}}\).

Optimizing \(\log \mathcal{L}(\boldsymbol{\theta})\) is Hard

Assuming a gradient based algorithmer is used, we must compute one of the gradient’s terms:
\[\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]\]
In general, this is hard.

To see this, it may be written:

\[\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)] = \sum_{\mathbf{d}_j \in \textrm{Val}(\mathbf{D}_j)} P_M(\mathbf{d}_j) f_j(\mathbf{d}_j)\]

where the effect of \(\boldsymbol{\theta}\) is via \(P_M(\mathbf{d}_j)\).

We see \(P_M(\mathbf{d}_j)\) must be computed, which means we must perform inference, a sometimes-exponentially-difficult task discussed in Part 2. Further, this is necessary for a single gradient step, many of which will be needed. So we’re faced with many instances of an intensive computation. There are approaches to tackle this, but it is frequently prohibitive.

Missing Data

We consider the case where \(\mathcal{D}=\{\mathbf{x}^{(i)}\}_{i=1}^w\) and we face observations like:

\[\mathbf{x}^{(1)} = [x_1^0,\ ?,\ x_3^1, ?,\ x_5^3]\]

To keep the presentation short , we assume entries are missing completely at random, meaning the pattern of missingness does not depend on the variables.

To form the likelihood, we need to compute \(P_M(\mathbf{x}^{(1)})\), which will involve summing probabilities over all possible combinations of the second and forth entries. This summation is what gives the marginal probability of the assignment \(X_1=x_1^0\), \(X_3=x_3^1\) and \(X_5=x_5^3\). More precisely, let \(h(i)\) be the number possible assignments to the missing variables at observations \(i\). If \(\vert \textrm{Val}(X_2) \vert = 4\) and \(\vert \textrm{Val}(X_4) \vert = 5\), there are \(h(1)=4 \times 5 = 20\) ways to assign \(X_2\) and \(X_4\). Further, let \(k\) index over the joint assignments and let \(\mathbf{x}^{(i),k}\) be \(\mathbf{x}^{(i)}\) but with the \(k\)-th assignment of the missing values filled in. With that, we can say:

\[P_M(\mathbf{x}^{(i)}) = \sum_{k=1}^{h(i)} P_M(\mathbf{x}^{(i),k})\]

Next, we revisit the average log likelihood:

\[\begin{align} \frac{1}{w} \log \mathcal{L}(\boldsymbol{\theta}) & = \frac{1}{w} \log \big( \prod_{i=1}^w P_M(\mathbf{x}^{(i)}) \big) \\ & = \frac{1}{w} \sum_{i=1}^w \log \big( \sum_{k=1}^{h(i)} P_M(\mathbf{x}^{(i),k}) \big) \\ & = \frac{1}{w} \sum_{i=1}^w \log \big( \sum_{k=1}^{h(i)} \exp\big(\sum_{j = 1}^m \theta_j f_j(\mathbf{d}_j^{(i),k}) \big) -\log Z(\boldsymbol{\theta})\\ \end{align}\]

This is a harder objective to work with; we have sums intertwined with logs and exponents. One consequence is we have lost concavity:

With missing data, \(\mathcal{L}(\boldsymbol{\theta})\) may have local optima.

Another issue follows from the derivative, which is now:

\[\frac{\partial }{\partial \theta_j} \log \mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{\mathcal{D},\boldsymbol{\theta}}[f_j(\mathbf{D}_j)] - \mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]\]

What’s new is \(\mathbb{E}_{\mathcal{D},\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]\), which may be expressed as:

\[\mathbb{E}_{\mathcal{D},\boldsymbol{\theta}}[f_j(\mathbf{D}_j)] = \frac{1}{w}\sum_i^m \mathbb{E}_\boldsymbol{\theta}[f_j(\mathbf{D}_j) \vert \mathbf{d}_j^{(i)}]\]

That is, it’s the empirical average of the expectation of \(f_j(\mathbf{D}_j)\) according to \(\boldsymbol{\theta}\) given the non-missing observations. The expectation is over the missing entries. Stated differently, each expectation is that of \(f_j(\mathbf{D}_j)\) where some entries of \(\mathbf{D}_j\) are ‘clamped’ to the observed values of a \(\mathbf{d}_j^{(i)}\). This expectation creates a problem:

Missing data creates additional inferences tasks, which can be computationally expensive.

In one sense, it’s not as burdensome as computing \(\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]\), since the observed values shrink the space to be summed over to form the expectation. On the other hand, we are burdened by conditioning the expectation on every observation, which presents other challenges, depending on the model. The distribution of hidden variables is likely to depend on what’s observed and incorporating this information everywhere may be no easy task. This is further complicated if the missing-complete-at-random assumption, often an unrealistic one, does not hold.

Approaches

At this point, it seems only the futility of MN parameter learning has been presented. This is not representative, as there are approximate approaches, such as:

Constrain the graph: For example, if the graph is triangulated, then there exists a closed form expression for the MLE. See section 6.1 of Wainwright et al. (2008) for description of this approach.
Use approximate inference in the inference step: Approximate inference, discussed in Part 3 and Part 4, allows one to trade accuracy for speed, so one may use a fast-enough method such that gradient descent becomes manageable. There is an inaccuracy suffered, but sometimes it’s not so significant to prevent learning a performant model.
Use a different objective that approximates the likelihood function: One example is to use the pseudolikelihood, where we accept the approximation:
\[P_M(\mathbf{x}) \approx \prod_{i=1}^n P_M(x_i \vert \mathbf{x}_{-i})\]
where \(\mathbf{x}_{-i}\) refers to all the elements of \(\mathbf{x}\) except \(i\). This allows us to completely eliminate \(Z(\boldsymbol{\theta})\).
For missing data, use the expectation-maximization algorithm, which is derived for the general exponential family case in section 6.2 of Wainwright et al. (2008). An approximate variational form is also presented.

Regularization

One form of regularization for a MN amounts to expanding the graph to include \(\boldsymbol{\theta}\) as random variables. They are never observed; they are hidden variables. Parameters governing hidden variables are typically exogenously decided, such that they are biased towards more simple value, such as those with smaller absolute values. The design of these variables is a deep topic we can’t explore in this brief tour.

The Next Step

Only one task remains:

Part 7: Structure Learning

References

D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press. 2009.
M. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning. 2008.

Something to Add?

If you see an error, egregious omission, something confusing or something worth adding, please email dj@truetheta.io with your suggestion. If it’s substantive, you’ll be credited. Thank you in advance!

Footnotes

For emphasis and clarity, \(m\) has changed its meaning in this context. Before it was the number of factors. Now it’s the sum of all joint assignments across those factors. ↩

Share on

Twitter Facebook LinkedIn

Writing