EarSSR: Silent Speech Recognition via Earphones

被引:2
作者
Sun, Xue [1 ]
Xiong, Jie [2 ]
Feng, Chao [3 ]
Li, Haoyu [4 ]
Wu, Yuli [5 ]
Fang, Dingyi
Chen, Xiaojiang [3 ,4 ,5 ]
机构
[1] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Sch Informat Sci & Technol, Xian 710069, Peoples R China
[2] Microsoft Res Asia, Shanghai 200000, Peoples R China
[3] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Xian Key Lab Adv Comp & Syst Secur, Xian 710069, Peoples R China
[4] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Xian 710069, Peoples R China
[5] Northwest Univ, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech recognition; Irrigation; Ear; Deformation; Mouth; Headphones; Sensors; Acoustic sensing; silent speech recognition; earphone;
D O I
10.1109/TMC.2024.3356719
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the most natural and convenient way to communicate with people, speech is always preferred in Human-Computer Interactions. However, voice-based interaction still has several limitations. It raises privacy concerns in some circumstances and the accuracy severely degrades in noisy environments. To address these limitations, silent speech recognition (SSR) has been proposed, which leverages the inaudible information (e.g., lip movements and throat vibration) to recognize the speech. In this paper, we present EarSSR, an earphone-based silent speech recognition system to enable interaction without a need of vocalization. The key insight is that when people are speaking, their ear canals exhibit unique deformation patterns and the corresponding deformation patterns are related to words/letters even without any vocalization. We utilize the built-in microphone and speaker of an earphone to capture the ear canal deformation. Ultrasound signals are emitted and the reflected signals are analyzed to extract the signal features corresponding to speech-induced ear canal deformation for silent speech recognition. We design a two-channel hierarchical convolutional neural network to achieve fine-grained letter/word recognition. Our extensive experiments show that EarSSR can achieve an accuracy of 82% for single alphabetic letter recognition and an accuracy of 93% for word recognition.
引用
收藏
页码:8493 / 8507
页数:15
相关论文
共 67 条
[11]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[12]  
Cho K, 2014, P WORKSH SYNT SEM ST, DOI [10.3115/v1/w14-4012, 10.3115/v1/W14-4012]
[13]   Lip Reading in the Wild [J].
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 :87-103
[14]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[15]  
Cordourier Maruri A., 2018, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, V2, DOI 10.1145/3287058
[16]  
Dong Ma, 2021, MobiSys '21: Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, P175, DOI 10.1145/3458864.3467680
[17]   INTERPOLATION IN DIGITAL MODEMS .2. IMPLEMENTATION AND PERFORMANCE [J].
ERUP, L ;
GARDNER, FM ;
HARRIS, RA .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1993, 41 (06) :998-1008
[18]  
Fan XR, 2021, PROCEEDINGS OF THE 27TH ACM ANNUAL INTERNATIONAL CONFERENCE ON MOBILE COMPUTING AND NETWORKING (ACM MOBICOM '21), P147, DOI 10.1145/3447993.3448624
[19]  
Ferlini A., 2021, ARXIV
[20]   EchoWhisper: Exploring an Acoustic-based Silent Speech Interface for Smartphone Users [J].
Gao, Yang ;
Jin, Yincheng ;
Li, Jiyang ;
Choi, Seokmin ;
Jin, Zhanpeng .
PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT, 2020, 4 (03)