Building a Rich Arabic Speech and Language Corpus Based on the Holy Quran

被引：0

作者：

Meftah, Ali ^{[1
]}

Seddiq, Yasser ^{[1
,2
]}

Alotaibi, Yousef ^{[1
]}

Selouani, Sid-Ahmed ^{[3
]}

机构：

[1] King Saud Univ, Coll Comp & Informat Sci, Riyadh, Saudi Arabia

[2] KACST, Riyadh, Saudi Arabia

[3] Univ Moncton, LARIHS Lab, Campus Shippagan, Shippegan, NB, Canada

来源：

ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE | 2018年 / 782卷

关键词：

The Holy Quran; Speech corpus; Arabic speech processing;

D O I：

10.1007/978-3-319-73500-9_7

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This paper pursues the goal of creating a reliable speech corpus based on The Holy Quran (THQ) audio recordings. Achieving that goal involves major steps to be done and essential requirements to be considered. With the availability of tremendous amount of recordings nowadays, it is of a fundamental importance to select the ones that feature both high audio quality and perfect reciter performance. Also, since the targeted beneficiaries from the corpus are the digital speech processing research community, it is also very essential to maintain an efficient, a familiar and a convenient way of presenting the audio corpus and other language material, such as the language model. Audio recordings of THQ are selected from four sources having a high standard regarding the reciters' performance. A significant effort is made in phonetical transcription of the audio content such that the written transcript maps perfectly to the uttered phonemes. Furthermore, the corpus dictionary, which is usually required in many fields such as machine learning and datamining, is also created. The first release of the corpus consists of recorded recitations and the necessary metadata of three chapters of THQ of different lengths recited by four reference reciters. Those chapters are selected for this phase based on statistical analysis of the lengths of all chapters and the frequency of occurrence of the Arabic phonemes across all chapters of THQ.

引用

页码：90 / 101

页数：12

共 6 条

[1]

Alghamdi M, 2007, ICSPC: 2007 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS, VOLS 1-3, PROCEEDINGS, P233

[2] Saudi Accented Arabic Voice Bank [J].

Alghamdi, Mansour ;

Alhargan, Fayez ;

Alkanhal, Mohammed ;

Alkhairy, Ashraf ;

Eldesouki, Munir ;

Alenazi, Ammar .

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2008, 20 :45-+

[3]

Alghmadi M., 2003, PROC 15 INT C PHONET, P3109

[4]

AlQahtany M. O., 2009, P INT C NAT LANG PRO, P1

[5]

BBN Technologies (with American University of Beirut a subcontractor), 2005, BBN AUB DARPA BAB LE

[6]

LAROCCA SA, 2002, W POINT ARABIC SPEEC

← 1 →