Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora

被引：0

作者：

Kaur, Dilshad ^{[1
]}

Singh, Satwinder ^{[1
]}

机构：

[1] Cent Univ Punjab Bathinda, Dept Comp Sci & Technol, Bathinda, Punjab, India

来源：

APPLIED COMPUTER SYSTEMS | 2023年 / 28卷 / 02期

关键词：

Aligned corpora; comparable corpora; English-Punjabi; parallel corpora;

D O I：

10.2478/acss-2023-0024

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Comparable corpora are the right resources for extracting parallel data due to their abundant availability. It is of great importance where parallel data are scarce. In this study, the focus is placed on building of parallel data for Punjabi and English language pair. The raw data were collected from web contents of "Mann Ki Baat", which is a collection of textual speeches of Prime Minister of India Mr. Narendra Modi broadcasted every last Sunday of the month. Data were cleaned and pre-processed using a natural language toolkit. An alignment model using BERT was built that aligned two textual files on a sentence level. Furthermore, extraction of noun forms with the help of NLTK library in Python programming was performed. The noun aligned dataset was built for English-Punjabi language pair and made available at Mendeley data repository.

引用

页码：245 / 251

页数：7

共 33 条

[1] Parallel fragments : Measuring their impact on translation performance [J].

Abdul-Rauf, Sadaf ;

Schwenk, Holger ;

Nawaz, Mohammad .

COMPUTER SPEECH AND LANGUAGE, 2017, 43 :56-69

[2]

Ali A., 2010, Resource, V9

[3]

[Anonymous], 2013, P 17 C COMP NAT LANG

[4]

[Anonymous], 2013, P 2013 C EMP METH NA

[5]

[Anonymous], 2009, P ACL IJCNLP 2009 C, DOI DOI 10.3115/1667583.1667653

[6]

[Anonymous], 2011, P 4 WORKSH BUILD US

[7]

[Anonymous], 2013, P 51 ANN M ASS COMP

[8]

Archana GP, 2015, 2015 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), P2414, DOI 10.1109/ICACCI.2015.7275980

[9]

Argaw A. A., 2005, WEBIST 2005 1 INT C

[10]

Bhat R. A., 2013, 27 PAC AS C LANG INF, P390

← 1 2 3 4 →