Patch-Aware Representation Learning for Facial Expression Recognition

被引：1

作者：

Wu, Yi ^{[1
]}

Wang, Shangfei ^{[1
]}

Chang, Yanan ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

国家重点研发计划;

关键词：

patch-aware; two collaborative streams; facial landmarks; facial expression recognition; JOINT; POSE;

D O I：

10.1145/3581783.3612342

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing methods for facial expression recognition (FER) lack the utilization of prior facial knowledge, primarily focusing on expression-related regions while disregarding explicitly processing expression-independent information. This paper proposes a patch-aware FER method that incorporates facial keypoints to guide the model and learns precise representations through two collaborative streams, addressing these issues. First, facial keypoints are detected using a facial landmark detection algorithm, and the facial image is divided into equal-sized patches using the Patch Embedding Module. Then, a correlation is established between the keypoints and patches using a simplified conversion relationship. Two collaborative streams are introduced, each corresponding to a specific mask strategy. The first stream masks patches corresponding to the keypoints, excluding those along the facial contour, with a certain probability. The resulting image embedding is input into the Encoder to obtain expression-related features. The features are passed through the Decoder and Classifier to reconstruct the masked patches and recognize the expression, respectively. The second stream masks patches corresponding to all the above keypoints. The resulting image embedding is input into the Encoder and Classifier successively, with the resulting logit approximating a uniform distribution. Through the first stream, the Encoder learns features in the regions related to expression, while the second stream enables the Encoder to better ignore expression-independent information, such as the background, facial contours, and hair. Experiments on two bench-mark datasets demonstrate that the proposed method outperforms state-of-the-art methods.

引用

页码：6143 / 6151

页数：9