Surgical mask detection with deep recurrent phonetic models

被引：5

作者：

Klumpp, Philipp ^{[1
]}

Arias-Vergara, Tomas ^{[1
,2
]}

Camilo Vasquez-Correa, Juan ^{[1
,2
]}

Andrea Perez-Toro, Paula ^{[1
,2
]}

Hoenig, Florian ^{[1
]}

Noeth, Elmar ^{[1
]}

Rafael Orozco-Arroyave, Juan ^{[2
]}

机构：

[1] Friedrich Alexander Univ Erlangen Nurnberg, Erlangen, Germany

[2] Univ Antioquia, Medellin, Colombia

来源：

INTERSPEECH 2020 | 2020年

关键词：

computational paralinguistics; phoneme recognition;

D O I：

10.21437/Interspeech.2020-1723

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

To solve the task of surgical mask detection from audio recordings in the scope of Interspeech's ComParE challenge, we introduce a phonetic recognizer which is able to differentiate between clear and mask samples. A deep recurrent phoneme recognition model is first trained on spectrograms from a German corpus to learn the spectral properties of different speech sounds. Under the assumption that each phoneme sounds differently among clear and mask speech, the model is then used to compute frame-wise phonetic labels for the challenge data, including information about the presence of a surgical mask. These labels served to train a second phoneme recognition model which is finally able to differentiate between mask and clear phoneme productions. For a single utterance, we can compute a functional representation and learn a random forest classifier to detect whether a speech sample was produced with or without a mask. Our method performed better than the baseline methods on both validation and test set. Furthermore, we could show how wearing a mask influences the speech signal. Certain phoneme groups were clearly affected by the obstruction in front of the vocal tract, while others remained almost unaffected.

引用

页码：2057 / 2061

页数：5

共 23 条

[1] Deep Audio-Visual Speech Recognition [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727

[2] Multi-channel Convolutional Neural Networks for Automatic Detection of Speech Deficits in Cochlear Implant Users [J].

Arias-Vergara, Tomas ;

Vasquez-Correa, Juan Camilo ;

Gollwitzer, Sandra ;

Orozco-Arroyave, Juan Rafael ;

Schuster, Maria ;

Noeth, Elmar .

PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS (CIARP 2019), 2019, 11896 :679-687

[3]

Cho K., 2014, P EMPIRICAL METHODS, P1724, DOI DOI 10.3115/V1/D14-1179

[4] Audio-Visual Speech Modeling for Continuous Speech Recognition [J].

Dupont, Stephane ;

Luettin, Juergen .

IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151

[5]

Eyben F., 2010, P 18 ACM INT C MULT, P1459

[6]

Fecher N., 2013, AUDITORY VISUAL SPEE

[7]

Fecher N, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P2247

[8]

He K., 2016, Proceedings of the IEEE conference on computer vision and pattern recognition, DOI DOI 10.1109/CVPR.2016.90

[9]

HOUTGAST T, 1971, ACUSTICA, V25, P355

[10]

Howard A. G., 2017, arXiv

← 1 2 3 →