共 53 条
[3]
[Anonymous], 2014, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings
[4]
[Anonymous], 2016, The extended ballroom dataset
[5]
Ba J. L., 2015, PROC INT C LEARN REP
[6]
Cano P., 2006, ISMIR 2004 Audio Description Contest
[7]
End-to-End Object Detection with Transformers
[J].
COMPUTER VISION - ECCV 2020, PT I,
2020, 12346
:213-229
[8]
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:347-356
[9]
DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION ON LARGE-SCALE DATASET
[J].
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021),
2021,
:5904-5908
[10]
Cho K, 2014, P SSST 8 8 WORKSH SY, P103, DOI [10.3115/v1/W14-4012, DOI 10.3115/V1/W14-4012]