Feature Extraction Trends for Intelligent Facial Expression Recognition: A Survey

Human facial expression is important means of non-verbal communication and conveys a lot more information visually than vocally. In human-machine interaction facial expression recognition plays a vital role. Still facial expression recognition through machines like computer is a difficult task. Face detection, feature extraction and expression classification are the three main stages in the process of Facial Expression Recognition (FER). This survey mainly covers the recent work on FER techniques. It especially focuses on the performance including efficiency and accuracy in face detection, feature extraction and classification methods.


Introduction
Social psychology says facial expressions are means of coordinating conversations and communication. With the advancements in artificial intelligence and pattern recognition, people started considering Facial Expression Recognition (FER) as the most important technology of intelligent human interactive interface [1]. Beside differences, the expressions of different people are still recognizable. Facial information from human face, mostly provide clues for the better depiction of user mind. This increases greatly the human-computer interaction. Scientists have been working on facial expression classification and recognition for the past few decades. Problem-solving abilities and vast applications of a particular discipline act as an inspiration for further exploration and research. The urge to make the visual data useful, is the motivation for all image processing and computer vision algorithms. The FER has same motivation in the domain of computer vision. Its applications in the HCI (Human-Computer Interaction), visual look of human, touch sensations (moods), sight and voice utilization at the same time increases its requirement and value today. Moreover, it has applications like disables emotion detection system, assistance systems for autistic system [2] for detection of pain and stress in psychological studies [3], for instructor feedback an intelligent tutoring system [4], social and emotionally intelligent robot [17] etc. These applications reveal that facial expression detection systems work in effectively unless they are bound to do so in real-time. Face detection and tracking, feature extraction and tracking, feature classification and reduction etc. are the phases involved in FER. Each individual phase uses distinct algorithms, researchers tried to classify basic six expressions using these specific algorithms. This research mainly discusses algorithms for the phases of facial expression. Algorithms like adaptive skin color are used for detection and tracking [5,6], mean shift algorithms [6], Stereo Active Appearance Model (STAAM) [7] etc. For feature extraction and tracking some algorithms are used like Local Binary Pattern (LBP) [8], Guided Particle Swarm Optimization (GPSO) [9], Gabor feature [8] etc., and there are few algorithms used for feature reduction like Principal Component Analysis (PCA), AdaBoost [10] etc. along classifiers like support vector machine [11][12][13], Hidden Markov Model (HMM) [14] etc. Accuracy and efficiency are two important aspects in real time environment for FER. Efficiency includes time complexity, space complexity and computational complexity. Due to very high computational complexity (Gabor feature and mean shift algorithm) and space complexity (LBP), it becomes difficult to work in real time environment for most of the mentioned algorithms. At the level of feature extraction, tracking or reduction efficiency can be improved. There are some appropriate algorithms for real-time environment like Optical flow calculation [6], Pixel Pattern-Based Texture Feature (PPBTF) [11], Adaboost [7,10,11], Pyramid LBP [12], Haar classifier [8,10,13,15], PCA [10]. This survey mainly covers performance aspects of FER domain. We have mentioned the current approaches for FER and our views related to limitations of these approaches with respect to its execution in real time.
In this work, Viola and Jones algorithm was used for face detection and Basel transformation is utilized for feature extraction. Thousands of facial features are extracted using Gabor feature extraction technique and also those features represent different facial detection patterns. To improve classification speed Adaboost Hypothesis is applied which will select few hundred features from thousands of extracted features. A three layers neural network classifier is then used to further process the selected features. JAFFE and YALE facial expression database are used to train the system and also for its testing. Combination of Basel and AdaBoost is used for the reduction of expression dataset. You will be amazed to know that Basel downsampling is never been used before for FER. So, it is an innovation for improving speed and accuracy. Proposed technique gave an accuracy of 96.83 and the average recognition rate of 92.22℅ for mentioned databases JAFFE and YALE and execution time required for 100x100 pixel sizes is 14.5ms. The results show that neutral expression has the weakest accuracy 92.23℅ in JAFFE and 86.16% in YALE.
In [19], Xijian and Tjahjadi extracted spatial pyramid histogram of gradients to three-dimensional facial features. They captured both spatial and motion information of facial expression by integrated the extracted features with dense optical flow. Support vector machine was used in this study with one-to-one strategy for training and testing. Investigation on CK+ and MMI datasets proved that integrated framework gives better performance than using individual descriptors. Contribution of this paper includes an integrated framework that captures dynamic information from deformation of facial regions and also facial landmarks movements, PHOG-TOP facial feature, dense optical flow having fused weighted PHOG-TOP and proposed framework analysis using contribution of different Fabian sub-regions. Canny edge detector is used for the detection of edges. Then to enhance spatial information an image is segmented into number of 3D subregions. PHOG-TOP is employed to whole face in a video sequence and also on 4 different sub-regions (forehead, eye, mouth and nose). Optical flow is implemented in order to extract dynamic info in video sequence. Dense facial points are equally distributed on mid of the face. Grid size is responsible for the efficiency of computing the optical flow. Average accuracy rate is 83.7% on CK+ and 73.1% on MMI dataset. It is also observed that happiness and surprise are easy to detect than remaining expressions. Proposed framework has limitations of generalizing to other datasets and it is also difficult to detect expressions other than happiness and surprise. This is due to the reason of smash and expressed datasets. Another limitation is that it is unable to detect faces wearing glasses or changed hairstyle.
Proposed technique by Zhang [20] includes following databases consisting of videos and images from movies and websites (AFEW, SFEW, HAPPEI, CENKI-4Kand QUT FER). Database used for FER falls in two different categories. One category consists of data collected in laboratory environment and the second category which collects data from broadcast TV and World Wide Web includes different databases. They used two lab-based facial FER databases for comparisons. They first applied video selection and segmentation process. Videos are captured from realworld environment and then segmentation is done using video splitter software. Annotator's pre-training consists of different students. Results from these students are also tested and the cycle is repeated again and again until 96% accuracy is achieved. Clips and vectors are classified by annotators and tested by experienced members. Detection applied using Viola Jones and ASM, failure rate of both algorithms is 4.7%. In Bench Mark FER approach face detection and tracking is applied on selected images using Viola Jones and ASM algorithms then texture features and geometric features are separated using SIFT, FAP and ASM algorithms. Features from texture features are selected using mRMP. Selected features and geometric features are then transferred to feature level fusion. Next, SVM classifier with Radial Basis Function (RBF) is trained for classification of facial expressions. Six basic universal emotions and three categorized emotions (positive, negative and neutral) are extracted using this method. For detection of six basic emotions SIFT+FAP gives 70% accurate results while, detection of three categories (positive, negative, neutral) gives 65% accurate result on realistic QUT images. Realistic QUT video clips have accuracy of 52.9 % for (SIFT+FAP), SIFT 48 %, FAP 50.6% on detection of six universal emotions. Accuracy for three categorized emotions (positive, negative, neutral) is given as SIFT+FAP 62.9 %, SIFT 61% and FAP 56.3 %. On the other hand, performance on lab-base data, FEEDTUM and NIVE database are used which gives 61.0% and 82% accuracy. Classification of three categorized emotions is slightly difficult from six universal expressions. Fear and sadness is highly affected due to nature of data as compared to other expressions.
Fang, Hui, et al., [21]is about automatic facial expression which extracts prominent features from videos without any preprocessing of subjective and without requiring any additional data for frames selection. Proposed technique uses machine learning methods in parallel with human reasoning to achieve dynamic changes in expressions in a better way. It is mandatory to detect facial regions first and for this purpose Viola-Jones detector is used. Face detection here is used only for the initiation of group registration. After face detection our next step is to detect features and for this purpose key feature is landmarked (eye, lips and nose). To align faces in static or dynamic data these landmark features are used to eliminate rotation and scaling effect. Through these landmarks deformations in video could be captured for further feature analysis. Traditional base algorithm is then used here to select neutral face image or first frame as template and wrap all images on it. Then global sharp model, appearance, local texture can be used as a knowledge to short the searching. Machine learning method can be used to locate optional landmarks (i.e. linear regression, graphical models). For facial sequences, group-wise registration is applied. Successfully displacement between landmarks and other relative measurements like lip curvature, eye size is taken for expression recognition after landmarks tracking, and geometric features can be extracted from result. In following four regions (cheek region, eyebrow region, outer eye corner wrinkle, and forehead region) a Gabor filter is applied to extract an energy value that helps to obtain texture feature for learning expressions. These textures and geometric features are used for each video sequence. For classification of gathered data two techniques can be used 50% stratified split, one half for testing and other one is for training. The other technique contains stratified 10x10 fold cross validation which produces models using given data. Six classifiers are used J48, FRNN, VQNN, random forest, SMO-SVM and logistic and database used for this purpose is MMI. Accuracy by proposed method is 71.56%. It is noticed that happiness and surprise are easily identified by automatic classifiers and also human participants but there are difficulties in identification in the remaining expressions.
Zang Wang et al., in [22] proposed that face image is usually presented in high dimensional space as a data point. Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA) methods are used for dimension reduction. Both can reveal global Euclidean structure but cannot manifold structure. Various manifold learning-based methods has been developed to get discriminant features for image detection and classification. One of the dimensional reduction methods is Local Fisher Discriminant Analysis (LFDA) but this method was not sparse like others and has no discriminant information so it was removed. A new system with Sparse LFDA (SLFDA) was introduced. From LFDA the minimum L1 normalization solution was obtained as a sparse solution. L1 minimization problem overcomes by Bregman. Therefore, Bregman method is applied to obtain sparse projection vector. Original features weight can be controlled by SLFDA. Moreover, in multimode problem SLFDA works well and the contrasting power of LFDA enhanced by it. In dimension reduction methods competitive to others SLFDA can achieve performance shown by the experiments performed on the databases (JAFFE and Yale B). The best recognition rate of SLFDA is 77.92%. As the proposed method strength is in dimension reduction and also gives reasonable interpretation of extracted features but their only focus was on supervised learning, so there is possibility to extend this approach to semi-supervised learning framework as well as to find the fast-numerical algorithm to solve L1-normalization problem.
By using both appearance and geometric features performance improved but most of existing algorithms are based on geometric features such algorithms track the facial components like eyes, lips, corner etc. as well as shapes and size of the face. Happy et al. [23] discussed that major issue is landmarks selection for which a relative geometric distance-based approach described to detect landmarks. Deformable model also become famous for landmark detection but high computation cost is a hurdle for them in real-time applications. It is a learning free approach to detect eyes, nose, lips etc. in face image and mark the required region. Some salient patches are extracted in training stage and within-pair of expressions; features having maximum variants are selected. Then multi-class classifier divides these selected features into basic six classes of expressions (those are fear, sadness, happiness, disgust, surprise and anger). In near frontal image with less computational complexity the results of facial landmark system are similar to the state of art method. Accurate emotion recognition in low-resolution image is ultimate goal of effective computing so this facial landmark detection technique along with salient patches-based FER framework performance is good in different image resolutions. Accuracy rate on JAFFE database is 91.8% and 94.1% on Cohn-Kanade (CK+) database which is satisfactory result but they are just considering few facial patches not whole face and also analyzing facial features without considering facial hairs. There is possibility of improvement in performance with partially occluded images and by using different appearance features. Moreover, an un-optimized MATLAB code is used for execution time. However, to improve the computational cost and real-time expression recognition with good accuracy rate an optimal implementation is required.
Ying tong et al. [24] discussed facial expressions of human and feature extraction method. Although Gabber wavelet, LBP are also used for feature extraction methods but they are time-consuming and dimensions increased significantly. In [24], the authors made binary coding for the separate block images which results a LGC statistics histogram. For identification feature they linked together the resultant histograms. In order to obtain the LGC-HD operator (LGC based on Horizontal and Diagonal gradient prior principle) more optimization is provided on the LGC operator. This reduces the computational complexity without losing the main expression information from face texture and also reduces the characteristic dimension. By using JAFFE database, the recognition average time in seconds are 90 for 8x8 block size with LGC-HD operator. Even after comparison of LBP, LBP uniform pattern, Gabor filter and LGC-HD the recognition average rate of LGC-HD is 90 %, which is higher than others. Experimental results show the weakness that either the block is larger or smaller will impact the recognition rate. The very small block number has smaller sub-block which cannot be accurately extracted. The redundancy of LGC-HD factor will affect the classification which results in inaccurate expression characteristics of large area.
The common universal facial expression recognizes cross-cultural facial expression including Japanese, Chinese, European and American. Ali et. al. in [25] claims that instead of using whole face only consider facial components (eyebrows, eye, etc.) as a lot of work has been done on those and through such technique satisfactory result is gained. In the race of solving problems of facial expressions consider multiple classifier decisions instead of using single classifier decision. Moreover, neural network-based ensemble classifier is made to enhance the accuracy of classifier.
Multi-objective genetic algorithm is also used. Acquisition and representation of multicultural dataset are major problems of multicultural facial expression classification. Three databases JAFFE, TFED and RadBoud used to overcome these problems. To check the presence of expression they used KNN, NB (Nave Bayes classifier) and SVM classifier. Furthermore, they used established dataset verification strategy for system performance evaluation. Four types of classifier considered for the classification of expressions those are BNN, KNN, SVM and Naïve Bayes classifier so that finally this plan worked out accurately. Experimental result 93.75% got with the combination of NNE collection, with NB predictor and using HOG descriptor. This is best in order to get satisfactory result of multicultural FER. But on few confuse facial expression still future work is needed due to visual representation, facial structure and difference in number of samples.
For recognition of facial expression, theoretical description of face operations often highlights the feature shapes as a primary visual signal. Although, facial surface characteristics can also be affected by changes in facial expression. Mladen Sormaz et al., in [26] examined that in the recognition of facial expression this surface knowledge can also be used. Firstly, facial expressions from images with distinct shapes and surface characteristics are identified by the participants. Mladen Sormaz et al., said various expressions depend on properties of shape and surface. They Further elaborated that facial expression categorization is feasible in any type of image. Moreover, in order to categorize the facial expressions, they evaluated the corresponding contributions of surface and shape information. This involves a correlative method in which shape properties and surface properties both are taken from different expressions. The experimental results show that in hybrid images categorization of facial expressions basically depends on the properties of surface and shape of image. Collectively, all the data directly demonstrate that recognition of facial expression is done through significant contribution of both the surface and the shape properties.
Andres Hernandez-Matamoros et al., mainly focus facial expression algorithm that significantly encounter facial image, present in color picture and segment divided into two Regions of Interest (ROI) i.e. the forehead and the mouth [27]. Both these regions are further segmented into non-overlying NXM blocks. Then dimension reduction is carried out after inserting the matrix in the Principal Component Analysis (PCA) module. Lastly, the resultant matrix generates the feature vectors. These vectors are then incorporated into the low complexity classifier, which uses the congregation and fizzy logical techniques. Though this classifier gives similar recognition rate as the high-performance classifiers but it gives least computational complexity. Results show that when feature vector from only one ROI is used in the proposed system then the recognition rate was increased to 97%. But the usage of feature vector of both ROI increases recognition rate up to 99%. It means overall 97% recognition rate in proposed system can only be achieved by clogging only one ROI.
In another work, Khan et al.,in [28] highlighted the importance of local descriptors like Weber Local Descriptor (WLD) and LBP for recognition of facial expressions which is robust to illumination and pose changes. They argue that the local descriptors cannot be used to store the data of face image. In order to handle this problem, they proposed a framework in which the first they extracted features from face image using WLD and LBP and then fuse both type of features.
Similarly, a novel framework known as Weber Local Binary Image Cosine Transform (WLBI-CT) is proposed in [29]

Conclusion and future work
During communication transmission facial expressions are produced so that the images can be obtained in unmanageable condition i.e. occlusion (effect of makeup, glasses, facial hair, hijab which can also affect the rate of recognition), illumination of light, posed expressions and variations in expression etc. This paper presented a survey on current work done in the domain of FER. Some techniques of feature extraction were explained. In addition, comparisons were also done which can help other researchers to advance and polish the present methods for getting accurate and better results in future.
In future we are intended to investigate the local descriptor in frequency domain for real-world FER.