共 35 条
[1]
Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
[3]
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:347-356
[4]
Cheng JY, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P2447
[5]
Christ Lukas, 2023, ARXIV230503369
[6]
SPEED-ROBUST KEYWORD SPOTTING VIA SOFT SELF-ATTENTION ON MULTI-SCALE FEATURES
[J].
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT,
2022,
:1014-1021
[7]
Stable Speech Emotion Recognition with Head-k-Pooling Loss
[J].
INTERSPEECH 2023,
2023,
:661-665
[8]
LETR: A LIGHTWEIGHT AND EFFICIENT TRANSFORMER FOR KEYWORD SPOTTING
[J].
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP),
2022,
:7987-7991
[9]
Dosovitskiy A., 2020, ICLR 2021
[10]
Goodfellow Ian J., 2013, Neural Information Processing. 20th International Conference, ICONIP 2013. Proceedings: LNCS 8228, P117, DOI 10.1007/978-3-642-42051-1_16