Initialization / robust illumination estimation

Initialization was for a long time a major limitation for face image analysis based on 3D Morphable Face Models. Initialization in this context means to get a rough estimate of the pose and the other camera parameters to start model adaptation. Most approaches rely on manual initialization or initialization using manual landmarks. Recent advances in detection methods lead to different approaches to automize initialization. Our approach takes feature point detections as uncertain input and avoids an early decision on the pose. This is extensively discussed in our tutorial on Probabilistic Fitting and the associated publication.

For an image with strong occlusions we still get a quite good estimation of the facial pose:

The mouth corners, as well as the nose corners, can be detected. The eyes are not detected reliably and therefore the initialization relies on the model in this regions.

However, for our Semantic Morphable Model we need an additional initialization, namely the initialization of the class label \(z\) per pixel. In our first experiments for occlusion-aware model adaptation we choose a very simple initialization: basically, we started from the pose estimation as discussed before and everything that was in the face region was labeled as face, and everything outside the face region is labeled as non-face. This lead to quite good results for images with an almost ambient illumination in the face region. However, we realized, that whenever there are more realistic illumination settings (like less light on one side of the face), this was not sufficient. In a lot of cases, such an illumination misleads the model adaptation and excluded regions of the face which are illuminated non-ambient from the face region. Sometimes even occlusions where included in the face region because they were closer to the facial appearance.

Our likelihoods are purely driven by pixel intensities - therefore they are very sensitive to large distances in pixel intensities. In the beginning of the model adaptation, those distances are typically large, and they are not necessarily larger for face or non-face regions. In our probabilistic fitting framework, we overcome this issue by an explicit illumination estimation step. So in early steps of the fitting, we optimize for the illumination and this leads the fit fast in the correct direction.

Let's have a look what happens if we perform our standard illumination estimation technique after initializing the pose. Let's first load the model, prepare the renderer and read the pose initialization as well as the target image to prepare everything for the illumination estimation:

import java.io.File

import scalismo.faces.color.{RGB, RGBA}
import scalismo.faces.deluminate.SphericalHarmonicsOptimizer
import scalismo.faces.gui.ImagePanel
import scalismo.faces.image.{AccessMode, PixelImage}
import scalismo.faces.io.{MoMoIO, PixelImageIO, RenderParameterIO}
import scalismo.faces.mesh.MeshSurfaceSampling
import scalismo.faces.sampling.face.MoMoRenderer
import scalismo.faces.sampling.face.evaluators.PixelEvaluators.IsotropicGaussianPixelEvaluatorHSV
import scalismo.faces.sampling.face.proposals.SphericalHarmonicsLightProposals.{RobustSHLightSolverProposalWithLabel, SHLightSolverProposal}
import scalismo.utils.Random
import scalismo.faces.gui._
import scalismo.faces.gui.GUIBlock._


scalismo.initialize()
val seed = 1986L
implicit val rnd = Random(seed)


val modelface12 = MoMoIO.read(new File("data/model2017-1_face12_nomouth.h5")).get
val rendererFace12 =  MoMoRenderer(modelface12, RGBA.BlackTransparent).cached(5)

val poseInit = RenderParameterIO.read(new File("data/5-lminit.rps")).get
val target = PixelImageIO.read[RGBA](new File("data/5.png")).get

ImagePanel(target).displayIn("Target Image")

If you have problems executing this code make sure you have the data-directory in the same folder as you are executing the jar file and make sure to download the Basel Face Model 2017 and add it to the data directory (model2017-1_face12_nomouth.h5).

Now we can run our standard illumination estimation on this image:

val shOpt = SphericalHarmonicsOptimizer(rendererFace12, target)
val shOptimizerProposal = SHLightSolverProposal(shOpt, MeshSurfaceSampling.sampleUniformlyOnSurface(100))
val nonRobust = shOptimizerProposal.propose(poseInit)

To judge the result we visualize the target image, the pose initialization, and our illumination estimation:

val poseInitImg = rendererFace12.renderImage(poseInit)
val nonRobustImg = rendererFace12.renderImage(nonRobust)

shelf(
  ImagePanel(target),
  ImagePanel(poseInitImg),
  ImagePanel(nonRobustImg)
).displayIn("Nonrobust Illumination Estimation")

This illumination estimation is misled by the occluded part of the face. Let's have a closer look what happens in the illumination estimation:

  1. A number of points are sampled from a random uniform distribution on the face surface (in our case 100 points).
  2. For all those points the albedo and normal is taken from the current set of model parameters \(\theta\). In the beginning of model adaptation, the albedo and the facial shape are just the mean shape and the mean color of our model.
  3. The spherical harmonics illumination coefficients are estimated in closed form following a linear system (least squares). Details can can be found in Schönborn 2017 (Equation 31 in Appendix 1).

However, this illumination estimation itself includes samples in the occluded region and is not robust - we, therefore, have to exchange it by a robust illumination estimation technique.

We proposed a robust illumination estimation technique in Egger 2017 - we will present you the key components in the following. The basis of the technique is the RANSAC algorithm. This algorithm is well-known for model estimation under outliers - we showed that it actually also works in the context of occlusions. The algorithm is iterative and follows multiple steps:

  1. Sample \(n\) random points \(p\).
  2. Estimate model parameters on \(p\).
  3. Evaluate how much of all points are consistent with the estimation. The consistent points build the consensus settings.
  4. Repeat 1-3 and keep always the biggest consensus set.

The algorithm is easy to implement and is used in a lot of applications. For our application, an implementation is straightforward. We start with the first step - sampling random points to estimate the model. The number \(n\) of points is actually not a variable - to profit from the theoretical guarantees of the algorithm you should choose \(n\) to be minimal to estimate all parameters. To estimate our 29 illumination parameters we need 30 points.

val shOpt = SphericalHarmonicsOptimizer(rendererFace12, target)
val shOptimizerProposal = SHLightSolverProposal(shOpt, MeshSurfaceSampling.sampleUniformlyOnSurface(30))

We then estimate in the second step the model parameters based on 30 points and visualize it:

val nonRobust = shOptimizerProposal.propose(poseInit)
val nonRobustImg = rendererFace12.renderImage(nonRobust)
shelf(
  ImagePanel(target),
  ImagePanel(nonRobustImg)
).displayIn("Nonrobust Illumination Estimation")

If you execute above lines of code multiple times you can observe two effects: The prediction is in most cases even worse than the original one and it is less deterministic. To get a robust performance this random behavior is key - it will lead to better and worse estimations. To decide on the quality of the arising estimations we need to measure how good this prediction fits the target image. This measure is used to select the best set of points. In practice, a distance metric is used together with a threshold. So let's measure how good our prediction is with a simple color likelihood and threshold the values which are more than two standard deviations away from our assumed quality standard. If you execute the following code you will get the target image, the current estimation, the image difference and the thresholded consensus set.

val sdev = 0.043f
val pixelEvaluator = IsotropicGaussianPixelEvaluatorHSV(sdev)
val howMuchSigmasAway = 2

val normalizer = pixelEvaluator.logValue(RGB.White, RGB.White)
val threshold: Double = Math.exp(-0.5 * Math.pow(howMuchSigmasAway, 2) - normalizer)

val nonRobust = shOptimizerProposal.propose(poseInit)
val nonRobustImg = rendererFace12.renderImage(nonRobust)


var counter = 0
val diff = (x: Int, y: Int) => {
    val c = nonRobustImg(x,y)
    if (c.a > 0  ) {
        counter = counter + 1
        pixelEvaluator.logValue(c.toRGB, target(x, y).toRGB)
    }
    else
    // handling of pixels not in the face region to be above threshold
        Math.log(threshold) - normalizer + 1
}


val differenceImage = PixelImage(target.domain, diff, AccessMode.Strict[Double])
val thresholded = differenceImage.map(d => if (Math.exp(d - normalizer) > threshold) 1 else 0)
shelf(
  ImagePanel(target),
  ImagePanel(nonRobustImg),
  ImagePanel(differenceImage.map(RGB(_))),
  ImagePanel(thresholded.map(RGB(_)))
).displayIn("Nonrobust Illumination Estimation")

If you execute the code multiple times you can see the random behavior and the different consensus sets. In the end, the RANSAC-procedure just executes this estimation multiple times and selects the model with the largest consensus set. For a more reliable estimation, it estimates the final model not only on the \(n\) samples but on all pixels in the consensus set.

The full algorithm is implemented in RobustSHLightSolverProposalWithLabel - all the default values for the RANSAC-parameters are set to the values in our publication Egger 2017. If you need more details the implementation is verbosely commented. To use the algorithm we simply build a proposal similar as we did for the non-robust version and draw a sample. We perform 100 iterations, this will take ~1 minute (to speed up use fewer iterations or a lower resolution of the face mesh).

val robustShOptimizerProposal = RobustSHLightSolverProposalWithLabel(rendererFace12, shOpt, target, iterations = 100)
val dummyImg = target.map(_ => 0)
val robust = robustShOptimizerProposal.propose(poseInit, dummyImg)

val poseInitImg = rendererFace12.renderImage(poseInit)
val robustImg = rendererFace12.renderImage(robust._1)
val consensusSet = robust._2.map(RGB(_))

shelf(
  ImagePanel(target),
  ImagePanel(poseInitImg),
  ImagePanel(robustImg),
  ImagePanel(consensusSet)
).displayIn("Robust Illumination Estimation")

This algorithm has actually two outputs which are useful for us. The first output are illumination parameters which are part of our model parameters \(\theta\). We use those to initialize not only the pose but also the illumination condition. The second output is the consensus set of the RANSAC algorithm. It tells us, which pixels are consistent with the estimated illumination. This consensus set excludes still a lot of pixels which are part of the face, but it includes only rarely pixels which are not part of the face. Therefore the consensus set of our robust illumination estimation technique provides a strong initialization for our segmentation label \(z\).

If you look closer at the result you will realize that there are a lot of pixels excluded from the model estimation which are actually part of the face. Everything that does not fit the current shape and albedo is excluded. For some parts like the eyebrows, we would need to get better parameters \(\theta\) whilst other parts like wrinkles can currently not be explained by our face model. The robust illumination estimation is therefore only suitable as initialization - there is still a need for the full inference and for simultaneous model adaptation and segmentation.

This robust illumination estimation was also used to derive the Basel Illumination prior. If you want to see some results, an evaluation or if you want to learn more about this illumination estimation you will find it in Chapter 5 of Egger 2017.


Semantic Morphable Models Tutorial | Initialization / robust illumination estimation | next