EarSSR: Silent Speech Recognition via Earphones

被引：2

作者：

Sun, Xue ^{[1
]}

Xiong, Jie ^{[2
]}

Feng, Chao ^{[3
]}

Li, Haoyu ^{[4
]}

Wu, Yuli ^{[5
]}

Fang, Dingyi

Chen, Xiaojiang ^{[3
,4
,5
]}

机构：

[1] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Sch Informat Sci & Technol, Xian 710069, Peoples R China

[2] Microsoft Res Asia, Shanghai 200000, Peoples R China

[3] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Xian Key Lab Adv Comp & Syst Secur, Xian 710069, Peoples R China

[4] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Xian 710069, Peoples R China

[5] Northwest Univ, Xian, Peoples R China

来源：

IEEE TRANSACTIONS ON MOBILE COMPUTING | 2024年 / 23卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Speech recognition; Irrigation; Ear; Deformation; Mouth; Headphones; Sensors; Acoustic sensing; silent speech recognition; earphone;

D O I：

10.1109/TMC.2024.3356719

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the most natural and convenient way to communicate with people, speech is always preferred in Human-Computer Interactions. However, voice-based interaction still has several limitations. It raises privacy concerns in some circumstances and the accuracy severely degrades in noisy environments. To address these limitations, silent speech recognition (SSR) has been proposed, which leverages the inaudible information (e.g., lip movements and throat vibration) to recognize the speech. In this paper, we present EarSSR, an earphone-based silent speech recognition system to enable interaction without a need of vocalization. The key insight is that when people are speaking, their ear canals exhibit unique deformation patterns and the corresponding deformation patterns are related to words/letters even without any vocalization. We utilize the built-in microphone and speaker of an earphone to capture the ear canal deformation. Ultrasound signals are emitted and the reflected signals are analyzed to extract the signal features corresponding to speech-induced ear canal deformation for silent speech recognition. We design a two-channel hierarchical convolutional neural network to achieve fine-grained letter/word recognition. Our extensive experiments show that EarSSR can achieve an accuracy of 82% for single alphabetic letter recognition and an accuracy of 93% for word recognition.

引用

页码：8493 / 8507

页数：15

共 67 条

[1] Alaparthi S., 2020, ARXIV
[2] Amazon, 2020, Technical Report
[3] Facial Expression Recognition Using Ear Canal Transfer Function
Amesaka, Takashi
Watanabe, Hiroki
Sugimoto, Masanori
[J]. ISWC'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 2019, : 1 - 9
[4] [Anonymous], 2017, SPEECH VOICE RECOGNI
[5] [Anonymous], 2014, ADV NEURAL INFORM PR
[6] [Anonymous], 2017, UNDERSTANDING ACOUST
[7] Apple, 2020, US
[8] Assael YM, 2016, ARXIV
[9] Brain-computer interfaces for speech communication
Brumberg, Jonathan S.
Nieto-Castanon, Alfonso
Kennedy, Philip R.
Guenther, Frank H.
[J]. SPEECH COMMUNICATION, 2010, 52 (04) : 367 - 379
[10] CanalScan: Tongue-Jaw Movement Recognition via Ear Canal Deformation Sensing
Cao, Yetong
Chen, Huijie
Li, Fan
Wang, Yu
[J]. IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021), 2021,

← 1 2 3 4 5 6 7 →