Latent Variable Models, Expectation Maximization, and Variational Inference

7 minute read

This is the second post in my series: From KL Divergence to Variational Autoencoder in PyTorch. The previous post in the series is A Quick Primer on KL Divergence and the next post is Variational Autoencoder Theory.


Latent variable models are a powerful form of [typically unsupervised] machine learning used for a variety of tasks such as clustering, dimensionality reduction, data generation, and topic modeling. The basic premise is that there is some latent and unobserved variable that causes the observed data point . Here is the graphical model (or Bayesian network) representation:

Latent variable models model the probability distribution:

and are trained by maximizing the marginal likelihood:

The introduction of latent variables allow us to more accurately model the data and discover valuable insights from the latent variables themselves. In the topic modeling case we know beforehand that each document in a corpus tends to have a focus on a particular topic or subset of topics. For example, articles in a newspaper typically address topics such as politics, business, or sports. Real world corpora encountered in industry, such as customer support transcripts, product reviews, or legal contracts, can be more complex and ambiguous but the concept still applies. By structuring a model to incorporate this knowledge we are able to more accurately calculate the probability of a document, and perhaps more importantly, discover the topics being discussed in a corpus and provide topic assignment to individual documents.

The learned probability distributions such as , , and can be used directly for tasks like anomaly detection or data generation. More commonly though, the main contribution is inference of the latent variables themselves from these distributions. In the Gaussian mixture model (GMM) the latent variables are the cluster assignments. In latent Dirichlet allocation (LDA) the latent variables are the topic assignments. In the variational autoencoder (VAE) the latent variables are the compressed representations of that data.

Marginal likelihood training

Latent variable models are trained by maximizing the marginal likelihood of the observed data under the model parameters . Since the logarithm is a monotonically increasing function, the marginal log likelihood is maximized instead since the logarithm simplifies the computation.

Given a training dataset consisting of data points , the marginal log likelihood is expressed as

Ideally we would maximize this expression directly, but the integral is typically intractable. For example, if is high dimensional, the integral takes the form .

As previously discussed, we must also be able to compute the posterior of the latent variables in order to gain utility from these models:

Again, this calculation is typically intractable because appears in the denominator. There are two main approaches to handling this issue: Monte Carlo sampling and variational inference. We will be focusing on variational inference in this post.

Derivation of Variational Lower Bound

To start, let’s assume that the posterior is intractable. To deal with this we will introduce a new distribution . We would like to closely approximate and we are free to choose any form we like for . For example, we could choose to be static or conditional on in some way (as you might guess, is typically conditioned on ). A good approximation can be seen as one that minimizes the KL divergence between the distributions (for a primer on KL divergence, see this post):

Note: For simplicity I will be using summations instead of integrals in the derivation.

Now, substituting using Bayes’ rule and arranging variables in a convenient way:

Note that in the second term, , is constant w.r.t. the summation so it can be moved outside, leaving . By definition of a probability distribution, , so the term ultimately simplifies to . So, we are left with:

Rearranging for clarity:

Now, let’s circle back to Eq. 1. Notice that we have derived an expression for the marginal log likelihood composed of two terms. The first term is the KL divergence between our variational distribution and the intractable posterior . The second term is is called the variational lower bound or evidence lower bound (the acronym ELBO is frequently used in the literature).

Looking at Eq. 2 and noting that KL divergence is non-negative, you can see that must be a lower bound for the marginal log likelihood: . Variational inference methods focus on the tractable task of maximizing the ELBO instead of maximizing the likelihood directly.

Expectation Maximization

In the simplest case, when is tractable (e.g., GMMs), the expectation maximization (EM) algorithm can be applied. First, parameters are randomly initialized. EM then exploits the tractable posterior by holding fixed and updating by simply setting in the E-step.

Notice that since we are holding fixed, the left hand side of Eq. 2 is a constant during this step, and the update to sets the KL term to zero. This means the ELBO term is equal to the log likelihood, which is the best possible optimization step. It’s interesting because, in this interpretation, the EM algorithm does not bother with the ELBO directly in the E-step and instead maximizes it indirectly by minimizing the KL term.

In the M-step, is fixed and is updated by maximizing the ELBO. Isolating the terms that depend on

Since the second term does not depend on , we see that the M-step is maximizing the expected joint likelihood of the data

Although I won’t prove it here, EM has some nice convergence guarantees; it always converges to a local maximum or a saddle point of the marginal likelihood.

Conclusion

In this post we introduced latent variable models provided some insight on their utility in real-world scenarios. Maximum likelihood training is typically intractable so we derived the variational lower bound (or ELBO) which is maximized instead.

In the simplest case, when is tractable (e.g., GMMs), we showed how the expectation maximization algorithm can be applied.

However, there are plenty of cases where the posterior is not tractable. A more recent approach to solving this problem is to use deep neural networks to jointly learn and with an ELBO loss function, such as in the variational autoencoder. For more on this see my post on variational autoencoder theory, where we will further refine the theory presented here to form the basis for the variational autoencoder.

Resources

[1] Volodymyr Kuleshov, Stefano Ermon, Learning in latent variable models

[2] Ali Ghodsi, Lec : Deep Learning, Variational Autoencoder, Oct 12 2017 [Lect 6.2]

[3] Daniil Polykovskiy, Alexander Novikov, National Research University Higher School of Economics, Coursera, Bayesian Methods for Machine Learning