Real-Time Face Detection and Recognition in Complex Background ()
1. Introduction
Real-time face detection and facial recognition play an important role in applications such as robot intelligence, smart cameras, security monitoring or even criminal identification. Conventional algorithms for face detection and facial recognition are designed for still-face images or color images. In color images, the colors increase data complexity by mapping pixels onto a high-dimensional space, which greatly reduces the processing speed and accuracy of the face detection and recognition [1] .
There are several approaches towards facial recognition problems. Given the fact, the faces are usually round or oval with same color, one simplest approach is to use the color segmentation to detect faces. However, using color segmentation is not able to adapt to the changing environment, such as lighting conditions. More adaptive and robust methods may not be able to operate in real time since they require more computational power. Moreover, adaptive algorithms usually employ statistical concepts in various degrees, such as template matching [2] , Support Vector Machine (SVM) [3] , color segmentation [4] or neural network [5] . More reliable descriptors such as Histogram of Oriented Gradient (HOG) [6] , Scale-Invariant Feature Transform (SIFT) [7] , Local Binary Pattern (LBP) [8] , or Haar-like features [9] are used to determine facial features for face detection. The facial recognition is based on Principal Component Analysis (PCA) [10] , Linear Discriminant Analysis (LDA) [11] , holistic matching method [12] and feature-based method [7] . For practical applications, the faces need to be detected and recognized in real-time and often in complex backgrounds.
The algorithms proposed in this paper process gray-scale images to detect and recognize faces in real-time with high accuracy. The combination of Ada Boost algorithm and the cascade classifier [13] improves the detection accuracy. The face detection algorithm uses a cascade classifier based on the
descriptor [8] , providing a higher processing speed. The eye detection also uses a cascade classifier but based on the Haar-like descriptor to ensure low false-positive face detection rate. The result of facial recognition training can be improved significantly through an efficient pre-processing on training data. After training, the PCA algorithm is used for the facial recognition. The flowchart for real-time face detection and recognition is shown in Figure 1.
The implemented algorithm can be segmented into three stages: 1) Faces and eyes detection; 2) Facial images normalization and enhancement, and 3) Facial recognition and face sample collection. In stage 1, two different cascade classifiers are used to detect the faces and the eyes respectively. The training process of these two classifiers is done by the Ada Boost algorithm. In stage 2, faces detected in previous stage are normalized to a fixed size and orientation. In this stage, the backgrounds are discarded; the contrast and lighting get enhanced. In stage 3, the algorithm tracks the differences of faces in detection windows. In the case of significant difference, the algorithm will recognize the face using PCA and collect it to train the recognition algorithm further. With the help of preprocessing and eye detection module, the method proposed in this paper can operate more accurately regardless of the background.
2. Descriptors for Real-Time Detection
2.1.
Descriptor
The
descriptor is used to extract facial features for the face detection. LBP stands for Local Binary Pattern, and every pattern of the facial image is encoded and counted to construct the spatially enhanced histogram representing local primitives. The subscript of
indicates the LBP descriptor is using
Figure 1. Flowchart for real-time face detection and recognition.
8 sampling points within a radius of 2 pixels. The
superscript indicates that the descriptor is using uniform patterns. This descriptor uses 58 bins to include 58 uniform patterns and 1 bin to include 198 non-uniform patterns. Uniform patterns account for almost 90% of the local primitives [14] and there are two transitions from 0 - 1 or 1 - 0 in each 8-bit binary number at most. Due to the shorter length of the histogram, the calculation can be greatly simplified by using the
descriptor. Each sample histogram is compared with the template histogram to find the threshold for each region. The encoding process of
descriptor is shown in Figure 2.
2.2. Haar-Like Descriptor
The Haar-like descriptor is utilized to extract eye features. Each Haar-like feature is composed of neighboring rectangular regions, which are shown in Figure 3. Haar-like features have multiple neighboring rectangular regions. The values
Figure 2.
descriptor encoding.
of the pixels in the black rectangular regions are subtracted from the values of the pixels in the white rectangular regions. The total represents the value of a Haar-like feature. While a Haar-like feature goes through the detection window, the area with the minimum value is the best match for this feature.
3. Face Detection Algorithms
3.1. Face Detection Classifier
The Ada Boost algorithm [15] is used to extract the best features to detect the faces. The best features are chosen as weak classifiers and then concatenated together as a weighted combination of these features to construct a strong classifier, which is shown in the following equation:
(1)
In Equation (1),
are n weak classifiers used to construct a strong classifier
. The parameters
are weights associated with the n weak classifiers. The strong classifier can be used to detect faces with the following equation:
(2)
In Equation (2),
is the threshold by the strong classifier to detect a face. “1” indicates that a face is present while “0” indicates that no face is detected. In our paper, the trained strong classifier
can correctly detect faces with a high detection accuracy of 98.8%.
The cascade classifiers are trained using the Ada Boost algorithm. The cascade classifier consists in a series of tests on the input features, as shown in Figure 4. Selected features are separated into several stages and each stage is trained to be a strong classifier with best weak classifiers. The tested implementation uses 120 LBP and 32 Harr features for weak classifiers. Each stage is responsible for deciding whether the detection window might contain a face or not. The window will be discarded immediately once it fails at any stage. The result of this cascading is that the areas without faces will be discarded within the early stages, and therefore processed faster. The number of stages is defined during learning and is picked to achieve a predetermined detection accuracy.
The Chi-Squared difference [16] is used by the face detection classifier. The Chi-Squared difference is calculated between the LBP encoded histogram of a face detection region and the LBP encoded histogram of a predefined template image which is obtained by averaging 2400 facial images. The images used contain faces of various skin-colors, sexes, ages and are all picked from the MIT CBCL face database. Then, the difference is compared with a predefined threshold for classification. The Chi-Square difference equation is:
(3)
Figure 4. Flowchart of a cascading classifier.
In Equation (3),
and
are the numbers of features in the i-th bin of the LBP encoded histogram of the detection region and the template image respectively. If the Chi-Squared difference is smaller than the threshold, it means that the detection window contains a face. Results of the face detection in various conditions are shown in Figure 5.
3.2. Eyes Detection
The Haar-like descriptor is used to detect both eyes of the face in order to enhance the face detection accuracy. The origin of the coordinate system for the facial image is chosen to be the top-left point. Two rectangular eye-search regions with the same size are extracted from each facial image at four predefined positions. For the left eye, the region extends on the x axis from 10% of the image width to 38%, and for the y axis from 15% of the image height to 40%.Since the right eye’s search region is symmetric with respect to left-eye search region, the same proportions are used from the other side of the image. Figure 6 shows the result of the eyes detection algorithm.
Figure 6. Eyes detection in eye-search regions.
4. Facial Recognition
4.1. Affine Transformation
An affine transformation [17] is used to rectify the orientation and scale of the detected facial images to improve accuracy of recognition. An affine matrix is adopted to scale the detected facial image to the desired size, and rotate it so that the two eyes are horizontal.
(4)
In Equation (4),
is the affine matrix and
are the scaling ratios in the x, y directions.
are the translation factors in x, y directions. θ is the rotation angle of the image. The position of each pixel of the original facial image is multiplied by the affine matrix to constitute the corrected image, with a resolution of 70 × 70 pixels.
Figure 7 shows a facial image after correction. The two eyes are now horizontal and the image is resized to a standard dimension. The image is cropped to only show the facial features and discard the background.
4.2. Histogram Equalization
The facial images of the same person can change drastically in various lighting conditions. A histogram equalization algorithm [18] is used to enhance the contrast of the detected facial images. The algorithm consists in replacing the pixel-values using a function designed to spread the repartition of the histogram. The function is given by the following Equation (5).
(5)
In this equation, CDF(v) is the cumulative distribution function of pixels with value v for calculating the equalized value H(v). M,N are the numbers of rows and columns for the facial image respectively. L is 256 and represents the gray- scale range.
Figure 8 shows the enhancement of the facial image using the histogram equalization algorithm. In strong lighting condition, however, one side of the face can be more exposed to the light than the other side, resulting in a significant lighting difference between the two sides. Figure 9 shows an alternative
Figure 7. Affine transformation of a facial image.
Figure 8. Histogram equalization in weak lighting condition.
Figure 9. Separated histogram equalization in strong lighting condition.
Figure 10. Improved histogram equalization in strong lighting condition.
processing, applying the histogram equalization separately on both sides of the face.
In Figure 9, there is still a high lightning difference between both sides of the face, which might affect the recognition accuracy. In Figure 10, we propose an improved histogram equalization to decrease this lightning difference by mixing the separated histogram equalization with the whole-face histogram equalization gradually from the left or right edge to the center. Therefore, the far left or right region applies the separated histogram equalization and the central region smoothly mixes left or right equalized values and the whole-face equalized values.
4.3. Gaussian Filter
A Gaussian filter [19] is used to remove noise in the pre-processed facial images for a high facial recognition accuracy. A convolution matrix produced by a Gaussian function is used to smooth the facial images. The 2-D Gaussian function is given in Equation (6).
(6)
The 3 × 3 normalized convolution matrix with
is adopted for smoothing while preserving edges, which is shown in Equation (7).
(7)
The convolution process is defined by the Equation (8). For each pixel of the output image I’, the pixels of the original image I around this position are multiplied by the coefficients of the matrix H, and then summed up. The resulting image is a smaller image with a size of 68 × 68.
(8)
Figure 11 shows that the Gaussian filter is removing the high-frequency noise in the pre-processed facial image.
4.4. Principal Component Analysis
The desired facial images are first collected as samples for training the new coordinate system. Every pixel of the image is represented by a variable in one dimension for describing facial features, therefore the features of each desired facial image can be represented by a column vector with 70 × 70 = 4900 dimensions. The PCA algorithm is used to recognize high-dimensional facial images with few principal components. The new base vectors
are given by maximizing the sample variance and minimizing the mean squared error.
(9)
(10)
In Equation (9), the collected facial sample in the original coordinate system is represented as x. The collected facial sample which is reconstructed from the principal components is represented as
. Equation (10) shows that each base vector is orthogonal to each other. The Lagrange multiplier is used to find the local minima of the function. The solution is shown in Equations ((11) and (12)).
(11)
(12)
is the covariance matrix of the sample vectors whose common features are removed by reducing the average vector of data.
is the average vector.
are the eigenvalues of the covariance matrix. N indicates the dimension for each sample vector. P is the number of collected samples. The best base vector
is the eigenvector of the covariance matrix having the largest eigenvalue
. The flowchart of the PCA algorithm for facial recognition is shown in Figure 12. A value of D = 100 is selected as the number of principal components to represent collected samples. A new face can be defined with only 100 dimensions, since the 100 principal components in the new coordinate system can illustrate most features of the new face. The projected values of each collected facial image on the 100 principal components are constructed into a 100-dimensional column vector for representing the training samples. If the difference between the reconstructed face and the new face is above the threshold of T = 0.4, it means that the new face was not recorded and it is displayed as an “unknown face”. Otherwise, the new face is identified as the sample face with the closest match.
Figure 12. Flowchart of facial recognition with PCA algorithm.
5. Results
The sample images used to train the face detector come from the MIT CBCL Face Database [20] . It includes 2492 faces with different identities, skin-colors, head poses, and 4548 non-faces images. The eye samples were extracted from the detected facial images in order to train the eye detector. In this paper, the algorithms were run on a computer with an Intel 2.50 GHz Core i7-3537U CPU at VGA resolution on a single thread. The processing time for detecting every face is of 11.4 ms and the processing time for detecting every pair of eyes within the facial regions is of 15.3 ms. With the help of the cascade classifier, the system is able to eliminate most non-facial features with little computational work. The resulting system is almost 3 times faster than the Joint Cascade detector [21] that takes 28.6 ms for face detection on a 2.93 GHz CPU at same resolution, and about 3000 times faster than Zhu et al detector [1] that detects every face in 33.8 s, still at VGA resolution. In order to test the face detection accuracy, 2836 faces and 3121 non-faces were randomly selected from the MIT CBCL Face Database [20] as well as the NIST Mugshot Identification Database [22] for cross validation purposes. By combining face normalization and eye detection, the algorithm achieves 98.8% detection accuracy and has a higher accuracy than other face detection algorithms, compared to 73.68% for the Color Based Segmentation [4] , 97.14% for the Head Hunter [23] . It is to be noted that the fact that these two methods where tested on different databases but with similar properties. Table 1 shows the test outcome for the facial detection, achieving a sensitivity of 99.2%, a specificity of 98.4% and a total accuracy of 98.8%. The Facial Recognition Technology Database [24] , containing 3682 face samples of 526 subjects under various viewing conditions, is used to train the facial recognition algorithm and validate its results, resulting in a 99.2% positive recognition rate in this paper.
Figure 13 shows that faces can be recognized in different real world conditions, such as picking up the cell phone or with occlusions on the hair. Figure 14
Figure 13. Real-time facial recognition in various conditions.
Figure 15. Real-time multi-person facial recognition.
shows that faces can be recognized under various backgrounds. Figure 15 shows that multiple faces can be recognized real-time.
6. Conclusions
Our algorithms can detect and recognize faces with high accuracy in real-time. It has a faster detection speed compared to other detection methods. The eyes detection is used to increase the face detection accuracy. The facial recognition performances are also greatly improved by using facial components alignment, contrast enhancement and image smoothing. Images of faces are collected as training samples in real-time and recognized under various conditions including among other faces.
Future work involves the training of new classifiers capable to expand the facial recognition to a wider range of facial orientations. The head rotation can be estimated so that the algorithm can correct further the facial image and maintain an accurate recognition.