In this chapter, we will propose one way to perform inference in a semantic Morphable Model for image analysis. We have two aims: we would like to segment the image into semantic regions and we would like to explain those regions by their corresponding model. Both aims are formulated in a probabilistic way, the segmentation label \(z\) is depending on the target image and the model parameters : \[p(z|\tilde{I}, \theta)\]
At the same time, the model parameters are depending on the target image and the segmentation label \(z\). \[p(\theta|\tilde{I}, z)\]
Solving the segmentation with given model parameters or vice-versa would be a straightforward task. However, neither of them is given in advance. We end up with a hard inference problem - especially since some of the models are quite complex (e.g. the face model is including rendering). Exact inference is intractable since the segmentation label \(z\) leads to integer optimization and the dimensionality of the involved face model is too high. We, therefore, decided for an approximate inference technique.
We alternate between updates on the segmentation and the model parameters. The most commonly used algorithm for such a task is the expectation-maximization (EM) algorithm. The algorithm has two steps, an expectation step estimating the segmentation labels and a maximization step estimating model parameters. For our application in the expectation step, we update the segmentation label \(z\) assuming a given \(\theta\) and in the maximization step we are estimating \(\theta\) assuming a given segmentation. The estimation for \(\theta\) has to be performed for all involved models.
The algorithm actually combines components you already know. In Chapter 2 we showed you a segmentation algorithm based on a Markov random field. We will use this segmentation to estimate the label \(z\) for a given parameter set \(\theta\). For the model adaptation, we use the probabilistic fitting framework. The EM-algorithm is an iterative algorithm, convergence arises through repetition of both steps with updated parameters and labels in each iteration. An analysis of the convergence of the algorithm is included in Egger 2017
Both steps of the algorithm need likelihoods for all the involved models. We use the likelihoods we have shown you in Chapter 3 with two exceptions: the likelihoods work fine if there is no uncertainty on the conditioned variable. If we search e.g. for the segmentation label we assume that \(\theta\) is given and correct. However, during inference, both the segmentation label \(z\) and \(\theta\) are uncertain. This uncertainty leads to bad convergence properties and inference gets stuck. We adapted the likelihoods to incorporate those uncertainties.
The original face model likelihood looks like this:
\[ \ell_{\text{face}} (\theta; \tilde{I}_i) = \begin{cases} \frac{1}{N} \exp \left ( - \frac{1}{2 \sigma^2} \left \lVert \tilde{I}_i - I_i(\theta) \right \rVert ^2 \right ) & \text{if $i \in \mathcal{F}$}\\ \frac{1}{\delta} h_f(\tilde{I}_i, \theta) & \text{if $i \in \mathcal{B}$}. \end{cases} \]
To incorporate the uncertainty on the current model parameters during segmentation we adapt this likelihood in the E-step: \[ \ell'_{\text{face}} ( \theta ; \tilde{I}_i) = \frac{1}{N} \exp \bigg(- \frac{1}{2 \sigma^2} \min_{j \in n(i)} \Big\lVert \tilde{I}_i - I_{i,n}(\theta) \Big\rVert ^2 \bigg). \]
We are assuming that the current model parameters are leading to a face which is slightly translated in image space - therefore we evaluate the likelihood over a neighboring region \(n(i)\) of the pixel \(i\). In our implementation, we take the minimum over a patch of the \(9\times9\) neighboring pixels direction. This leads to a small tendency to include more pixels in the face region, basically if a neighboring pixel is similar enough. The face model adaptation can then explain the pixel better.
For face model adaptation we also do a small adaptation. The original non-face likelihood was: \[\ell_{\text{non-face}} (\theta ; \tilde{I}_i) = h_{\tilde{I}}(\tilde{I}_{i})\]
However sometimes there are pixels which are labeled as non-face but they could be nicely explained by the face-model. So we enable the face model to explain them if it can do better than the \(\text{non-face}\) model. In later iterations of the EM-procedure the pixel will then correctly be labeled as face:
\[ \ell'_{\text{non-face}} (\theta ; \tilde{I}_i) = \max\Big(\ell_{\text{face}}( \theta ; \tilde{I}_i), h_{\tilde{I}}(\tilde{I}_{i}) )\Big). \]
Before we show you how this is actually implemented within our software framework we have to think about initialization. The EM algorithm needs to be initialized with both an initial guess for \(z\) and \(\theta\). In the next chapter, you will learn a strategy to get a suitable initialization.
Semantic Morphable Models Tutorial | Inference / EM-like algorithm | next