Attention and Sequence Modeling for Match-Mismatch Classification of Speech Stimulus and EEG Response

被引:1
作者
Borsdorf, Marvin [2 ]
Cai, Siqi [2 ]
Pahuja, Saurav [2 ]
De Silva, Dashanka [2 ]
Li, Haizhou [1 ,2 ,3 ]
Schultz, Tanja [4 ]
机构
[1] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China
[2] Univ Bremen, Machine Listening Lab, D-28359 Bremen, Germany
[3] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[4] Univ Bremen, Cognit Syst Lab, D-28359 Bremen, Germany
来源
IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2024年 / 5卷
关键词
Auditory system; EEG decoding; match-mismatch classification; speech envelope; speech stimulus; SPEAKER EXTRACTION; AUDITORY ATTENTION; NEURAL-NETWORK; BRAIN; LSTM; TRANSFORMER; ENVIRONMENT; PERCEPTION;
D O I
10.1109/OJSP.2023.3340063
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
For the development of neuro-steered hearing aids, it is important to study the relationship between a speech stimulus and the elicited EEG response of a human listener. The recent Auditory EEG Decoding Challenge 2023 (Signal Processing Grand Challenge, IEEE International Conference on Acoustics, Speech and Signal Processing) dealt with this relationship in the context of a match-mismatch classification task. The challenge's task was to find the speech stimulus that elicited a specific EEG response from two given speech stimuli. Participating in the challenge, we adopted the challenge's baseline model and explored an attention encoder to replace the spatial convolution in the EEG processing pipeline, as well as additional sequence modeling methods based on RNN, LSTM, and GRU to preprocess the speech stimuli. We compared speech envelopes and mel-spectrograms as two different types of input speech stimulus and evaluated our models on a test set as well as held-out stories and held-out subjects benchmark sets. In this work, we show that the mel-spectrograms generally offer better results. Replacing the spatial convolution with an attention encoder helps to capture better spatial and temporal information in the EEG response. Additionally, the sequence modeling methods can further enhance the performance, when mel-spectrograms are used. Consequently, both lead to higher performances on the test set and held-out stories benchmark set. Our best model outperforms the baseline by 1.91% on the test set and 1.35% on the total ranking score. We ranked second in the challenge.
引用
收藏
页码:799 / 809
页数:11
相关论文
共 62 条
[61]   SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures [J].
Zmolikova, Katerina ;
Delcroix, Marc ;
Kinoshita, Keisuke ;
Ochiai, Tsubasa ;
Nakatani, Tomohiro ;
Burget, Lukas ;
Cernocky, Jan .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) :800-814
[62]   Speaker-aware neural network based beamformer for speaker extraction in speech mixtures [J].
Zmplikova, Katerina ;
Delcroix, Marc ;
Kinoshita, Keisuke ;
Higuchi, Takuya ;
Ogawa, Atsunori ;
Nakatani, Tomohiro .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :2655-2659