| Face colour under varying illumination - analysis and applications | ||
|---|---|---|
| Prev | Chapter 6. Skin locus in face tracking | Next |
The skin locus constraint can be implemented as a colour distribution adaptation module also for other tracking algorithms (Paper VIII), like mean-shift (Comaniciu et al. 2000). Appendix 3 contains a detailed description of this algorithm. Basically, the mean-shift algorithm searches for the nearest place in which the difference between the computed colour distribution and the defined object colour distribution is smaller than some fixed threshold. For maintaining good performance of tracking, the defined distribution should be adapted to environmental changes or environmental changes should be cancelled. The cancellation can be by a colour constancy algorithm or by using only very restricted environments where the illumination does not change. Because the performance of colour constancy algorithms has not been satisfactory so far and robust methods for normal, unrestricted environments are sought, the adaptation of the object colour distribution is a very attractive solution.
The skin locus is now modelled in rb-space with three straight lines (Figure 38). The limits are modelled by giving the range of chromaticity b:
in which the chromaticity r can vary between 0.34 and 0.71.
Initialization for the adaptive mean-shift with skin locus was done by hand and the initialization coordinates were saved. The manually selected area S was then subjected to filtering using the whole skin locus in order to remove non-skin coloured pixels. From the filtered result, the colour distribution for the object was calculated. Unlike with backprojection tracking, only a subsection of the skin locus was used at a time. The range of the subsection was dynamically updated from each segmented image. Instead of using a fixed threshold for excluding those skin chromaticities which appear too often in the background, a video independent dynamic threshold T was used:
where KC = the size of the bounding box obtained from initialization or algorithm calculations,
KFRAME = the size of the whole frame, and
ξ = a parameter to be set.
The positive parameter x defines the shape of the threshold function. If x is not near 0 or 1, the thresholding is a nonlinear function of the size ratio. After initialization, the following procedure was repeated until the end of the sequence:
A new frame is filtered with a subsection of the skin locus.
Threshold T is calculated and applied to the current skin colour model for the object. Those skin chromaticities which have background support more than the threshold value, are excluded.
Mean shift is allowed to localize the bounding box and a new skin colour model is calculated on the segmented image.
A new, updated skin colour model for the object is obtained as an average of the “old” and “new” colour models. Those chromaticities which appear only in the “old” colour model are removed.
Skin locus subsection range is updated using the localized bounding box.
Table 18 shows the localization and segmentation for tracking after applying the previous algorithm on a video. It is not argued here that this is the optimal implementation; the purpose is to mainly show that even with this implementation, the results are better than for the non-skin locus based methods. The size of the bounding box was frozen after the first frame to make possible comparison with the fixed skin colour model tracking and adaptive tracking with spatial constraint for the whole sequence (Tables 19 and 20). Otherwise, the performance of these two tracking methods became unstable. The colour model for the static distribution tracking was obtained from the first frame of the video. For the spatially restricted adaptive tracking (after Y. Raja et al. 1998), the pixels for updating the colour model are obtained from a smaller bounding box centred inside the localization bounding box and 1/3 its size. As it is quite obvious that locus based tracking cannot separate the face area from the neck. This is also true for the other methods. The segmentation results are best with a skin locus and they depend on the background. In some frames the segmentation is so good that the size of the face could be reliably determined from them. The worst result is with a fixed colour model. Both fixed and spatially adaptive methods have a tracking failure. The spatial adaptation method is prone to adapt something which is not skin, as demonstrated in the results for frame 300 of Table 19. It seems to track better when the illumination field is uniform (the pixels selected for model updating represent well the total colour distribution of the face) and the background does not contain similar colours. The fixed method works when the illumination field is quite stable.
For numerical evaluation of tracking results from the methods, three different metrics were used. The first one is the localization quality, for which the following equation can be used to express the goodness G of the tracking:
where AGT = area size for the manually selected ground truth bounding box, and
AC = area size for the computed bounding box.
This metric evaluates the tracking goodness as an intersection of the found and ground truth bounding boxes against the total area which the bounding boxes cover.
From goodness measure G it is possible to calculate the error measure E = 1-G. Figure 39 displays examples of a possible ground truth and computed bounding boxes at different goodness and error values. The error in tracking localization for video 1 tracked with three different methods is shown in Figure 40. The static model has big errors when the object model is no longer valid. Adaptive methods produce similar results in frames but with geometrical constraint there are more spurious error spikes. The locus constraint produces the most stable performance.

Figure 40. Localization error of the bounding box for three different tracking methods with a static colour model, an adapted model based on geometrical constraint and an adaptive model based on chromaticity constraint.
The segmentation results were evaluated using the true positive and false positive metrics. The formula of true positive TP metric is
where NTP = the number of detected face pixels, and
NIN = the number of pixels inside the face polygon.
TP describes how many true objects pixels were found in relation to the total number of the object pixels. The nearer the value to one, the more object pixels are found. The value one means all object pixels were found. Figure 41 shows TP metric results for video 1. The static model tracking found almost all face pixels when the illumination conditions are near those ones in the first frame. Because with geometrical constraint only pixels inside the face localization are used in colour model adaptation, not all facial chromaticities are present in the model even though they might belong to skin. This might be the reason for the tracking failure around frame 300: when the illumination field is nonuniform and changing, the selected pixels no longer correspond to true colour distribution. This also shows the susceptibility to adapt something else which is not skin. The operation of the locus based adaptation is most stable and best over the whole sequence even though the non-skin regions of the face were not removed.
The false positives FP are defined as
where NFP = the number of detected facially coloured pixels in the background, and
NOUT = the number of pixels outside the area defined by the face polygon.
FP indicates how many background pixels are activated by the colour model. When the FP value is zero or a value near zero and the TP value is big, then the object is well separable from the background, leading to better and more reliable localization. For video 1, the FP values can be seen in Figure 42. The skin locus based method is superior because from the adaptation, those pixels which occur too often in the background are excluded from the colour model adaptation. Since the underlying assumptions of static model methods are uniqueness of the colour distribution in the scene and unchanging illumination, it gives the poorest results. Also the geometrically adaptive model relies on the assumption of the uniqueness of the chromaticities. As can be observed from Figure 42, the uniqueness of chromaticities is highly dependent on the background and uniformness of the illumination field over the face.
As a conclusion from the results presented in Figs. 40-42, the adaptive tracking method using the skin locus proved to have the most stable and best performance on a video sequence with true, drastic illumination changes, It was necessary to keep the bounding box size fixed because the static and geometrically adaptive tracking methods behaved unstably.