Complex Word Identification for Italian Language: a dictionary-based approach

被引:0
|
作者
Occhipinti, Laura [1 ]
机构
[1] Univ Bologna, Bologna, Italy
来源
PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2024 | 2024年
关键词
complex word identification; Italian language; lexical complexity;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Assessing word complexity in Italian poses significant challenges, particularly due to the absence of a standardized dataset. This study introduces the first automatic model designed to identify word complexity for native Italian speakers. A dictionary of simple and complex words was constructed, and various configurations of linguistic features were explored to find the best statistical classifier based on Random Forest algorithm. Considering the probabilities of a word to belong to a class, a comparison between the models' predictions and human assessments derived from a dataset annotated for complexity perception was made. Finally, the degree of accord between the model predictions and the human inter-annotator agreement was analyzed using Spearman correlation. Our findings indicate that a model incorporating both linguistic features and word embeddings performed better than other simpler models, also showing a value of correlation with the human judgements similar to the inter-annotator agreement. This study demonstrates the feasibility of an automatic system for detecting complexity in the Italian language with good performances and comparable effectiveness to humans in this subjective task.
引用
收藏
页码:119 / 129
页数:11
相关论文
共 10 条
  • [1] Text-to-text generative approach for enhanced complex word identification
    Sliwiak, Patrycja
    Shah, Syed Afaq Ali
    NEUROCOMPUTING, 2024, 610
  • [2] Automatic Identification of Domain Terms: An Approach for Italian
    Artese, Maria Teresa
    Gagliardi, Isabella
    DIGITAL PRESENTATION AND PRESERVATION OF CULTURAL AND SCIENTIFIC HERITAGE, 2020, 10 : 251 - 257
  • [3] Cross-Lingual Transfer Learning for Complex Word Identification
    Zaharia, George-Eduard
    Cercel, Dumitru-Clementin
    Dascalu, Mihai
    2020 IEEE 32ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2020, : 384 - 390
  • [4] Adaptive Complex Word Identification through False Friend Detection
    Aprosio, Alessio Palmero
    Menini, Stefano
    Tonelli, Sara
    UMAP'20: PROCEEDINGS OF THE 28TH ACM CONFERENCE ON USER MODELING, ADAPTATION AND PERSONALIZATION, 2020, : 192 - 200
  • [5] LegalEc: A New Corpus for Complex Word Identification Research in Law Studies in Ecuatorian Spanish
    Ortiz-Zambrano, Jenny A.
    Espin-Riofrio, Cesar
    Montejo-Raez, Arturo
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2023, (71): : 247 - 259
  • [6] Revisiting the Role of Classical Readability Formulae Parameters in Complex Word Identification (Part 2)
    Venugopal, Gayatri
    Pramod, Dhanya
    Saini, Jatinderkumar R.
    COMPUTER SCIENCE JOURNAL OF MOLDOVA, 2022, 30 (01) : 49 - 63
  • [7] Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach
    Lindaa, Hammami
    Alessia, Paglialonga
    Giancarlo, Pruneri
    Michele, Torresani
    Milenaa, Sant
    Carlo, Bono
    Gianluca, Caiani Enrico
    Paolo, Baili
    JOURNAL OF BIOMEDICAL INFORMATICS, 2021, 116
  • [8] Italian medical language: A corpus-based study on patient information leaflets
    Nitti, Paolo
    FORUM ITALICUM, 2025,
  • [9] Lexicon-Grammar based open information extraction from natural language sentences in Italian
    Guarasci, Raffaele
    Damiano, Emanuele
    Minutolo, Aniello
    Esposito, Massimo
    De Pietro, Giuseppe
    EXPERT SYSTEMS WITH APPLICATIONS, 2020, 143
  • [10] A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
    Catelli, Rosario
    Gargiulo, Francesco
    Casola, Valentina
    De Pietro, Giuseppe
    Fujita, Hamido
    Esposito, Massimo
    IEEE ACCESS, 2021, 9 : 19097 - 19110