Creation of a large news corpus in Spanish for the diachronic and diatopic analysis of the use of language

被引:0
作者
Razgovorov, Pavel [1 ]
Tomas, David [2 ]
机构
[1] Ingn Informat Empresarial, C Auso & Mona 16, Alicante 03006, Spain
[2] Univ Alicante, C San Vicente del Raspeig S-N, E-03690 Alicante, Spain
来源
PROCESAMIENTO DEL LENGUAJE NATURAL | 2019年 / 62期
关键词
Corpus; text mining; diachronic analysis; diatopic analysis;
D O I
10.26342/2019-62-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article describes the process carried out to develop a large corpus of news stories in Spanish. The collected texts are located both temporally and geographically. This makes it a very useful resource to work with in the field of linguistics, sociology and data journalism, allowing the diachronic and diatopic study of the use of language and tracking the evolution of specific events. The corpus can be freely downloaded using the software developed as part of this work. The article includes a statistical analysis of the corpus and two case studies that show its potential for event analysis.
引用
收藏
页码:29 / 36
页数:8
相关论文
共 7 条
  • [1] [Anonymous], 2012, P LANG ERS EV C LREC
  • [2] [Anonymous], 2011, 1 MONDAY, DOI DOI 10.5210/FM.V16I9.3663
  • [3] On the resemblance and containment of documents
    Broder, AZ
    [J]. COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, : 21 - 29
  • [4] Gray J., 2012, The Data Journalism Handbook: How Journalists Can Use Data to Improve News
  • [5] THE ANALYSIS OF LITERARY-STYLE - A REVIEW
    HOLMES, DI
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 1985, 148 : 328 - 341
  • [6] Indyk P., 1998, Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, P604, DOI 10.1145/276698.276876
  • [7] Leskovec J, 2014, MINING OF MASSIVE DATASETS, 2ND EDITION, P1