EarSSR: Silent Speech Recognition via Earphones

被引:2
作者
Sun, Xue [1 ]
Xiong, Jie [2 ]
Feng, Chao [3 ]
Li, Haoyu [4 ]
Wu, Yuli [5 ]
Fang, Dingyi
Chen, Xiaojiang [3 ,4 ,5 ]
机构
[1] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Sch Informat Sci & Technol, Xian 710069, Peoples R China
[2] Microsoft Res Asia, Shanghai 200000, Peoples R China
[3] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Xian Key Lab Adv Comp & Syst Secur, Xian 710069, Peoples R China
[4] Northwest Univ, Shaanxi Int Joint Res Ctr Battery Free Internet T, Xian 710069, Peoples R China
[5] Northwest Univ, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech recognition; Irrigation; Ear; Deformation; Mouth; Headphones; Sensors; Acoustic sensing; silent speech recognition; earphone;
D O I
10.1109/TMC.2024.3356719
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the most natural and convenient way to communicate with people, speech is always preferred in Human-Computer Interactions. However, voice-based interaction still has several limitations. It raises privacy concerns in some circumstances and the accuracy severely degrades in noisy environments. To address these limitations, silent speech recognition (SSR) has been proposed, which leverages the inaudible information (e.g., lip movements and throat vibration) to recognize the speech. In this paper, we present EarSSR, an earphone-based silent speech recognition system to enable interaction without a need of vocalization. The key insight is that when people are speaking, their ear canals exhibit unique deformation patterns and the corresponding deformation patterns are related to words/letters even without any vocalization. We utilize the built-in microphone and speaker of an earphone to capture the ear canal deformation. Ultrasound signals are emitted and the reflected signals are analyzed to extract the signal features corresponding to speech-induced ear canal deformation for silent speech recognition. We design a two-channel hierarchical convolutional neural network to achieve fine-grained letter/word recognition. Our extensive experiments show that EarSSR can achieve an accuracy of 82% for single alphabetic letter recognition and an accuracy of 93% for word recognition.
引用
收藏
页码:8493 / 8507
页数:15
相关论文
共 67 条
  • [1] Alaparthi S., 2020, ARXIV
  • [2] Amazon, 2020, Technical Report
  • [3] Facial Expression Recognition Using Ear Canal Transfer Function
    Amesaka, Takashi
    Watanabe, Hiroki
    Sugimoto, Masanori
    [J]. ISWC'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 2019, : 1 - 9
  • [4] [Anonymous], 2017, SPEECH VOICE RECOGNI
  • [5] [Anonymous], 2014, ADV NEURAL INFORM PR
  • [6] [Anonymous], 2017, UNDERSTANDING ACOUST
  • [7] Apple, 2020, US
  • [8] Assael YM, 2016, ARXIV
  • [9] Brain-computer interfaces for speech communication
    Brumberg, Jonathan S.
    Nieto-Castanon, Alfonso
    Kennedy, Philip R.
    Guenther, Frank H.
    [J]. SPEECH COMMUNICATION, 2010, 52 (04) : 367 - 379
  • [10] CanalScan: Tongue-Jaw Movement Recognition via Ear Canal Deformation Sensing
    Cao, Yetong
    Chen, Huijie
    Li, Fan
    Wang, Yu
    [J]. IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021), 2021,