Collection and Automatic Analysis with Natural Language Processing on a Corpus of Andean Oral Literature Implemented on the Web

被引:0
作者
Soria Solis, Ivan [1 ]
Castro Buleje, Carlos Yinmel [1 ]
Silvera Reynaga, Humberto [1 ]
Mamani Macedo, Mauro Felix [2 ]
Leon Soncco, Dionicia [2 ]
Mautino Guillen, Alejandro Giancarlo [3 ]
机构
[1] Univ Nacl Jose Maria Arguedas, Andahuaylas, Peru
[2] Univ Nacl Mayor San Marcos, Lima, Peru
[3] Univ Nacl Santiago Antunez Mayolo, Huaraz, Peru
来源
INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 4, INTELLISYS 2024 | 2024年 / 1068卷
关键词
Andean oral literature; Word embeddings; FastText; Natural language processing;
D O I
10.1007/978-3-031-66336-9_32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Oral literature is transmitted through tradition and is not typically cataloged or recorded. Analyzing this type of literature presents certain challenges as there are limited computer tools available for its collection, preservation and study. In particular, Andean oral literature in Quechua faces significant challenges as there are scarce computer resources to support its study and processing. This study aims to establish a Web-based repository for the development of a corpus comprising texts of Andean oral literature. We utilized this corpus to train a natural language processing model that enables automatic analysis and topic classification of the texts. We applied the FastText tool, which features a pre-trained model for Quechua. Due to the variability in writing that characterizes this language, Word Embeddings were preferred to represent the meaning according to the context, which can handle similar words. The model was evaluated using the corpus dataset in conjunction with the pre-trained vectors. Texts were collected and then trained on a classification model. Metrics of accuracy, precision, recall, and f1 were evaluated. The model was determined to perform best when not utilizing a pre-trained model.
引用
收藏
页码:449 / 463
页数:15
相关论文
共 28 条
  • [1] Using natural language processing and machine learning to classify health literacy from secure messages: The ECLIPPSE study
    Balyan, Renu
    Crossley, Scott A.
    Brown, William, III
    Karter, Andrew J.
    McNamara, Danielle S.
    Liu, Jennifer Y.
    Lyles, Courtney R.
    Schillinger, Dean
    [J]. PLOS ONE, 2019, 14 (02):
  • [2] Calsin Vilca D.P., 2021, Puriq, P3, DOI [10.37073/puriq.3.2.158, DOI 10.37073/PURIQ.3.2.158]
  • [3] Calvo Perez J., 2005, Lexis, P29, DOI [10.18800/lexis.200501.004, DOI 10.18800/LEXIS.200501.004]
  • [4] Extracting Semantic Relationships in Greek Literary Texts
    Christou, Despina
    Tsoumakas, Grigorios
    [J]. SUSTAINABILITY, 2021, 13 (16)
  • [5] Construction of English and American Literature Corpus Based on Machine Learning Algorithm
    Dai, Qian
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [6] Dalianis H., 2018, CLIN TEXT MINING, DOI DOI 10.1007/978-3-319-78503-5_6
  • [7] Duran M., 2023, Aprendo con NooJ, DOI [10.35305/an.vi3.18, DOI 10.35305/AN.VI3.18]
  • [8] Fasttext Homepage, Word vectors for 157 languages
  • [9] Gianitsos Efthimios., 2019, Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, P52, DOI [10.18653/v1/W19-2507, DOI 10.18653/V1/W19-2507]
  • [10] Jacksi K., 2020, J. Appl. Sci. Technol. Trends, V1, P31, DOI [10.38094/jastt1214, DOI 10.38094/JASTT1214]