Motivation

In this tutorial, you will learn all the components of semantic Morphable Models. The basic idea is to add semantics on a per pixel level to our probabilistic Morphable Models: we have different models explaining different objects or parts of objects in the image - for each pixel we decide which model to choose. Before starting with theory and practical exercises, we would like to demonstrate the challenges that arise in face image analysis under occlusions in an Analysis-by-Synthesis setting. Occlusions are one of the core motivations behind the idea to explicitly segment the different components of the image. In this motivation and the tutorial, we will focus on occlusions. Those are a hard challenge for many kind of image analysis methods. The semantic Morphable Model framework, however, is richer and is open to various future extensions.

Probabilistic Fitting

In the previous tutorial, you learned about the concept of probabilistic fitting. The main idea behind this method is an Analysis-by-Synthesis approach to interpret face images. Basically, we are trying to find a synthetic face image which is close to the real observed face image. The synthetic face image is generated by a parametric model holding explicit parameters \(\theta\) for facial color and shape, illumination as well as all the pose and camera parameters. The previous tutorials are mainly concerned about how to build such models and how to find parameters that match a novel input image. In this tutorial, we target occlusions - they were not handled by the strategy from our tutorial about probabilistic fitting. The fitting script you have seen there leads to undesired results if parts of the face are occluded. In practice, most face images contain some kind of occlusions by glasses, facial hair or various objects like flags or microphones. So let's have a look at some images containing occlusions:

To explain how those occlusions affect image explanation, we have to dig into the probabilistic fitting framework (Tutorial, Publication: Schönborn 2017). In an Analysis-by-Synthesis setting, we need some measure to compare the observed image \(\tilde I\) with our synthetic image \(I(\theta)\). In the fitting framework, we proposed that measure is in the face region a simple pixel-wise distance in RGB-color space. Measuring distances between images is a challenging and an unsolved problem. The pixel-wise measure we are using is a common choice. It is hard to come up with a better distance measure. In the face model adaptation, this error measure is part of the evaluation-step of the Metropolis-Hastings algorithm. In the probabilistic setting we build a likelihood which measures the quality of our current synthetic model estimate:

\[ \ell \left (\theta; \tilde I \right ) = \prod_{i \in \mathcal F} \ell_{\text{face}}( \theta ; \tilde{I}_i ) \prod_{j \in \mathcal B} \mathcal b \left( \tilde{I_i} \right ). \]

The likelihood distinguishes between foreground pixels \(\mathcal F\) which are covered by the face region and handles background pixels \(\mathcal B\) by a background model. The likelihood \(\ell_{\text{face}}( \theta ; \tilde{I}_i )\) of the foreground pixels reflects our error measure over all pixels covered by the face model.

Let's have a close look at the likelihood of the face pixels:

\[ \ell_{\text{face}} ( \theta; \tilde{I}_i) = \frac{1}{N} \exp \left ( - \frac{1}{2 \sigma^2} \left \lVert \tilde{I}_i - I_i(\theta) \right \rVert ^2 \right ) \]

The rendering function \(I\) generates a face image for given parameters \(\theta\). This pixels \(i\) of the synthesized image \(I(\theta)\) are compared to the ones of the observed image \(\tilde{I}\). The likelihood model for pixels in the face region \(\mathcal{F}\) is assuming per-pixel Gaussian noise.

Those likelihoods are implemented in the IndependentPixelEvaluator class in scalismo-faces software framework. This is the point in the framework where we will do our core changes - if you feel unconfident or you want to have a deeper understanding, please have a closer look at the probabilistic fitting tutorial first (Chapter 05 Face Fitting).

The most important observation in this likelihood is, that every pixel in the foreground has to be explained by the face model. Let's have a look at some fitting results - first without occlusions:

Those results appear close to the target image and the explanation of the face appears correct.

However for the faces with occlusions, you get different results:

If you search for the reason of those results you will realize, that the likelihood is not robust to the occuring occlusions. This leads to the following effects:

The results we present here are obtained by fixing the position of the landmarks (using a narrow likelihood for the landmarks) - so the corners of the eyes, the nose, and the mouth stay where they are, but the image likelihood tries to change e.g. the pose or explain the image by the color model. If we don't use landmarks during fitting those shrinking tendencies are even worse.

Robust Likelihood

To overcome the limitation of our model on occluded images, we start with a standard approach - a simple robust error metric: We modify the likelihood slightly to allow the exclusion of pixels in the foreground from the face model explanation. Basically, we say we would like to include only the best matching pixels in the face and the other pixels should be explained by our background model. In practice, we evaluate the above likelihood \(\ell_{\text{face}}\) for all pixels and then account only for the first \(\alpha\%\) of pixels in the evaluation. We now need to choose a parameter \(\alpha\) which basically tells how much percent of the pixels we want to include in the face model explanation. The choice of this value is fixed and is not adapted to the image at hand. We implemented this in the TrimmedIndependentPixelEvaluator. This reflects a standard robust technique which is applicable to a lot of applications with outliers - it is especially helpful if you know in advance the percentage of values that are outliers. However for our Analysis-by-Synthesis setting, it fails - let's have a look at some results for different values for \(\alpha\), we start with 90% of the pixels masked in the likelihood (and the mask visualizing the selected pixels in white):

Next we show results for \(\alpha= 80\%\):

And with \(\alpha= 70\%\):

When does this approach work? It works if we choose alpha correctly and the occluded pixels are actually the ones furthest away from the target image according to our likelihood.

On the other hand the problems with this approach are:

So this simple approach is not enough to cope with occlusions. To develop an occlusion-aware method let's first think about what occlusions actually are.

Occlusions

What are Occlusions?

Before we look about how to solve this problem let us think about the problem. Occlusions are a major challenge in the computer vision community. In the last decades, they have witnessed great efforts and major developments in the field of object detection. Almost all objects can be occluded in natural images - and often they are. A lot of object detection algorithms are not designed to handle occlusions but are instead focussed on the big challenge of detection itself. However in a lot of applications occlusions are actually important and have to be taken into account. For face analysis, there are occlusions from various sources. There are occlusions which are

We refer to everything which is between the skin surface and the camera as occlusion of the face.

For our later approach, it makes also sense to think about how we can handle those different categories of occlusions. For makeup, beards or glasses you could think of a generative approach and we could explicitly model those. Active Appearance models for example often include that kind of occlusions. For other occlusions like microphones, hands or the most random objects you can think of - it would be much harder to model all of those. We would have to generatively model the whole world. So our approach will be a middle course - modeling occlusions explicitly, but with a very general model. Our approach is, however, open to extensions for explicit modeling of face-specific and also non-specific occlusions.

In the literature, you will find a lot of methods targeting a similar topic - outliers. Outliers are related to occlusions but they are not the same. Outliers are observations that are distant from others. Outliers typically arise through inaccuracies or errors during measurements. Occlusions are image specific - they are not measurement errors but observations which are correct but objects in front of the object we would like to analyze. However, a lot of approaches to cope with occlusions can also be applied for outlier handling and vice-versa.

The occlusion challenge

Why are those occlusions so hard to handle? For our visual system, it is very easy to tell that there is something occluding the face. However, for a computer vision system, it is a hard job. There are a lot of discriminative approaches out there. To handle occlusions, they do an explicit segmentation - the occlusion handling is based on training data. Those approaches can handle what they have seen and generalize to a certain degree. Our Analysis-by-Synthesis approach is currently not well covered in the computer vision community. Compared to the dominating methodologies our approach is based on a generative model and does not rely upon and does not need a huge amount of training data. All our results are obtained by a model built from a few (200) 3D high-quality face scans, statistical modeling techniques as well as a probabilistic approach for model adaptation.

As we discussed before we could add additional models to handle some of those occlusions, whilst this seems to be unfeasible in practice for other occlusions. So what we need is some kind of an algorithm that tells us, which part of the image is part of the face and which parts of the image are better explained by occlusions. If you went through our probabilistic fitting tutorial this sounds similar to a background model. Compared to the background model occlusions appear in the face region and we have to find a way to distinguish between occlusions and parts of the face. So during generative model adaptations, we have to decide which parts of the image are actually part of the face. The most obvious way would be to just measure the distance from the current estimate to the target image. If we do so in the very beginning of model adaptation we get such a result:

A darker color means a lower error.

So there are two kinds of regions where the error is large: those which are actually occlusions, and those belonging to the face but are not (yet) estimated correctly by the model. So let's go for an even harder case, a worst case scenario:

In this case, the occluded part is even closer to the initialization than the face region to the face parts - so we cannot easily decide on occlusions based on this simple metric. In practice, such effects happen through unlucky occlusions or strong illumination effects

One last thing we should consider is the difference from occlusions to outliers. Outliers are observations which are somehow not consistent with the other observations. In the image analysis case, an outlier can e.g. be a pixel error in the camera. Those outliers are compared to occlusions not correlated with other observations. Occlusions on the other hand often dominate a region of pixels. However, you will see in our approach a method that was classically designed to handle outliers (RANSAC) and we show that it also work well for occlusions (robust illumination estimation).

Semantic Approach for Occlusion-aware Analysis-by-Synthesis

We choose a semantic approach to approach the occlusions in face images. The very basic idea is to segment the image into different semantic regions. Let's look at a possible segmentation of a face image into different regions:

Those regions are then explained by different models. The division itself is diverse and can be more or less detailed - the eye, for example, can be seen as one semantic region or refined further into sclera, iris, and pupil. The segmentation as well as the model parameters of all involved models are a result of the analysis and not known or given in advance. Both have to be found at the same time during the inference process. Our framework is open to include various models, those models can be coupled and they can be more or less complex - you already learned about our simple background model in the probabilistic fitting tutorial. In this tutorial, we focus on an instance of a semantic Morphable Model which is able to handle occlusions of the face. A 3D Morphable Face Model is at the core of our framework to explain the face region in a generative way.

Overview of this tutorial

Let's give you a short overview of what you can expect from this tutorial. In Chapter 2 we will introduce you to the very basics of image segmentation and show you a simple segmentation algorithm which we use to build our semantic Morphable Model. Chapter 3 focuses on the idea of semantic Morphable Models and will present the Occlusion-aware Morphable Model as an instance of it. In Chapter 4 we will show you how we can perform inference and actually estimate the segmentation as well as the model parameters together. Initialization is a key challenge to ease inference - we already showed that illumination is quite important and in Chapter 5 we will present our robust illumination estimation algorithm. Chapter 6 brings all methods together and guides you through a fitting script which combines all to analyze face images under occlusions. In the last Chapter, we share our thoughts on possible future work in the direction of semantic Morphable Models.


Semantic Morphable Models Tutorial | Motivation | next