Large-Scale Automatic Audiobook Creation

被引:0
作者
Walsh, Brendan [1 ]
Hamilton, Mark [1 ,2 ]
Newby, Greg [3 ]
Wang, Xi [1 ]
Ruan, Serena [1 ]
Zhao, Sheng [1 ]
He, Lei [1 ]
Zhang, Shaofei [1 ]
Dettinger, Eric [1 ]
Freeman, William T. [2 ,4 ]
Weimer, Markus [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
[2] MIT, Cambridge, MA 02139 USA
[3] Project Gutenberg, Salt Lake City, UT USA
[4] Google, Mountain View, CA 94043 USA
来源
INTERSPEECH 2023 | 2023年
关键词
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
An audiobook can dramatically improve a work of literature's accessibility and improve reader engagement. However, audiobooks can take hundreds of hours of human effort to create, edit, and publish. In this work, we present a system that can automatically generate high-quality audiobooks from online e-books. In particular, we leverage recent advances in neural text-to-speech to create and release thousands of human-quality, open-license audiobooks from the Project Gutenberg ebook collection. Our method can identify the proper subset of e-book content to read for a wide collection of diversely structured books and can operate on hundreds of books in parallel. Our system allows users to customize an audiobook's speaking speed and style, emotional intonation, and can even match a desired voice using a small amount of sample audio. This work contributed over five thousand open-license audiobooks and an interactive demo that allows users to quickly create their own customized audiobooks. To listen to the audiobook collection visit https://aka.ms/audiobook.
引用
收藏
页码:3675 / 3676
页数:2
相关论文
共 8 条
[1]  
[Anonymous], 2019, ANN C NEUR INF PROC, DOI DOI 10.23919/CHICC.2019.8865210
[2]   A Machine Learning Approach to Detecting Start Reading Location of eBooks [J].
Bodapati, Sravan Babu ;
Ramaswamy, Sriraghavendra ;
Narayanan, Gururaj .
2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, :1522-1529
[3]   CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS [J].
Guo, Haohan ;
Zhang, Shaofei ;
Soong, Frank K. ;
He, Lei ;
Xie, Lei .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :403-409
[4]  
Hamilton Mark, 2018, INT C PRED APPL AP, P11
[5]  
Van Den Oord Aaron., 2016, CORR
[6]   Tacotron: Towards End-to-End Speech Synthesis [J].
Wang, Yuxuan ;
Skerry-Ryan, R. J. ;
Stanton, Daisy ;
Wu, Yonghui ;
Weiss, Ron J. ;
Jaitly, Navdeep ;
Yang, Zongheng ;
Xiao, Ying ;
Chen, Zhifeng ;
Bengio, Samy ;
Quoc Le ;
Agiomyrgiannakis, Yannis ;
Clark, Rob ;
Saurous, Rif A. .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :4006-4010
[7]   Self-supervised Context-aware Style Representation for Expressive Speech Synthesis [J].
Wu, Yihan ;
Wang, Xi ;
Zhang, Shaofei ;
He, Lei ;
Song, Ruihua ;
Nie, Jian-Yun .
INTERSPEECH 2022, 2022, :5503-5507
[8]   AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [J].
Wu, Yihan ;
Tan, Xu ;
Li, Bohan ;
He, Lei ;
Zhao, Sheng ;
Song, Ruihua ;
Qin, Tao ;
Liu, Tie-Yan .
INTERSPEECH 2022, 2022, :2568-2572