Variational Inference III 📦

In this post I'm going to continue explaining concepts that I introduced in previous posts (1 and 2) about Variational inference (VI). Algorithms presented till now have scalability problems. For each iteration the algorithm requires to go through all the data and this, for massive volumes of data, is impracticable. An alternative to solve this problem is: Sthocastic Variational Inference. This version is based on using a batch (set of points) of data in each iteration. In this way, after more iterations than the conventional solutions, the solution will tend to a local optimum. The main advantage of this mechanism is that it doesn't require to keep all dataset in memory, solving the bottleneck that could be formed by using VI with very big datasets.

Sthocastic Variational Inference

Sthocastic optimization

This technique obtains estimations of the real gradient of an objective function. Thus we obtain an algorithm which iterates each batch and adjusts the hidden structure of the model based only on that batch of data. Stochastic optimization finds a function relative maximum or minimum using real gradient estimations. Estimations expectation E(∇λC(λ,x)), being x a batch of data, is equivalent to the real gradient ∇λC(λ,X), being X the full dataset.

Under ideal conditions these stochastic algorithms converge to a local optimum of the function if ρ meets the Robins-Monro conditions:

The use of this technique results in the algorithm Stochastic Gradient Ascent Variational Inference (SGAVI) and Stochastic Natural Gradient Ascent Variational Inference (SNGAVI) if natural gradients are used to estimate. These algorthims, thanks to the real gradient estimations, can avoid saddle points.

Saddle point. (Source: Wikipedia)

Algorithm

In VI, the function to be optimized is called ELBO. If variational parameters are updated by closed analytical formulas, the algorithm is known as Sthocastic Coordinate Ascent Variational Inference (SCAVI) while if we use sthocastic optimization the algorithm is known as SGAVI. This version uses a corrector term using calculations from previous iterations. The generic algorithm would be as follows:

Black Box Variational Inference

Starting from the ELBO formula that was reached in the previous post:

The main idea of Black Box Variational Inference (BBVI) consists of sampling the variational model q(θ|λ) with the purpose of obtain an approach of formula expectations. Such expectations regarding the variational model can be computationally expensive and can be a bottleneck at computer memory level.

Score Gradients

Now we apply gradients and some algebraic transformations into the analytic ELBO formula (complete derivation is not shown):

After these transformations you could notice that it is not necesary to obtain gradients of complete ELBO, it is enough to derive the variational model (q(θ|λ)).

Monte-Carlo Integration

Monte-Carlo integration is a mechanism to approximate integrals. It consists of sampling the variable with respect to which we are integrating and do a summation of function values given those samples. The more samples are taken from the variable more exact will be the approximation. In our case we are going to use this mechanism to approximate ELBO integral, which integrates with respect to the q(θ|λ) distribution. Then a set of samples (θs) obtained from the q(θ|λ) distribution will allow us to get an approximation of the integral. The formula would be as follows:

Algorithm

Below is the BBVI algorithm:

Considerations

This algorithm is the result of get some measures that can question the convergence of VI algorithm. Starting by supposing q(θ|λ) factorizes (mean field assumption) to approximate ELBO integral using Monte-Carlo. All this causes that this algorithm is subjected to a high variance, and depending on the model, slow convergence. In order to reduce the variance that this method can cause some mechanisms have appeared:

Rao-Blackwellization. This method reduces the variance of an aleatory variable replacing it by its conditional expectation with respect to a subset of variables.
Control variates. This method replaces Monte-Carlo expectation by another function with less variance.

A positive aspect of this method is that it is not necessary to derive the analytical formulas used to update the variational parameters either analytical ELBO. This permits the access to these algorithms to people who have less statistical knowledge (as me). It exists another approximation called Automatic Differentiation Variational Inference (ADVI). This method also has these advantages and improves the convergence of unconjugated models that can be a challenge for the rest of VI variants.

Formerly with VI only conjugated models could be inferred since the unconjugated models didn't be easily derivable. The discovery of algorithms like ADVI and BBVI allowed the inference of this kind of models because they changed the analytical calculus for an approximate strategy.

Variational inference libraries

Best-known libraries for the use of VI and Markov Chain Monte-Carlo (MCMC):

Edward: It uses Tensorflow for gradients computations and has BBVI, reparameterization BBVI and Metropolis-Hastings implementations.
Stan: It uses C++ Automatic Differentiation reverse mode for gradients computation and has ADVI and HMC implementations.
PyMC3: It uses Theano for gradient computations and has ADVI, Gibbs sampling and Metropolis-Hastings implementations.
Bayesflow: Young Google module for VI.

If you want to learn more about Variational Inference and its uses in probabilistic models inference you can take a look to my master thesis. It focuses on the use of automatic differentiation tools to apply Variational Inference into a Gaussian Mixture Model (GMM). At the repository you can find implementations of Gaussian Mixture Model with different technologies like Tensorflow, Python, Edward, Autograd, ... And also other probabilistic models that helped me to learn.

References

Journal of the American Statistical AssociationGeorge (E. P. Box)
Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models (David M. Blei)
Probabilistic graphical models: principles and techniques (Koller, Daphne, and Nir Friedman)
Model-based Machine Learning (Christopher M. Bishop)
Machine Learning. A probabilistic perspective (Kevin P. Murphy)
An overview of gradient descent optimization algorithms (Sebastian Rudes)
Variational Bayesian inference (slides) (Kay H. Brodersen)
Variational Inference (David M. Blei)
Sthocastic Variational Inference (Matthew D. Hoffman, David M. Blei, Chong Wang and John Paisley)
Black Box Variational Inference (Rajesh Ranganath, Sean Gerrish and David M. Blei)
The Stan Math Library: Reverse-Mode Automatic Differentiation in C++ (Bob Carpenter, Matthew D. Hoffman, Marcus Brubaker, Daniel Lee, Peter Li and Michael Betancourt)
Automatic Differentiation Variational Inference (Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman and David M. Blei)

Variational Inference III 📦

Sthocastic Variational Inference

Sthocastic optimization

Algorithm

Black Box Variational Inference

Score Gradients

Monte-Carlo Integration

Algorithm

Considerations

Variational inference libraries

References

Write a comment! 😀