Text Classification by Genres Based on Rhythmic Characteristics

被引:2
作者
Lagutina, K. V. [1 ]
Lagutina, N. S. [1 ]
Boychuk, E. I. [2 ]
机构
[1] Yaroslavl State Univ, Yaroslavl 150003, Russia
[2] Ushinsky Yaroslavl State Pedag Univ, Yaroslavl 150000, Russia
关键词
stylometry; natural language processing; rhythmic characteristics; genres; text classification; WORDS;
D O I
10.3103/S0146411622070136
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This article considers the rhythm of texts of various genres: fiction novels, advertising, scientific articles, reviews, tweets and political articles. The authors identify such lexical and grammatical means in the texts as anaphora, epiphora, diacope, aposiopesis, etc., which are markers of the rhythm of a text. On their basis, statistical characteristics are calculated that describe quantitatively and structurally these rhythmic means. The resulting text model is visualized for statistical analysis using boxplots and heatmaps, which shows differences in the rhythm of various genres. The boxplots shows that almost all genres differ from each other in terms of the overall density of rhythmic characteristics. The heatmaps shows the different rhythm structure of the genres. Further, rhythmic characteristics were successfully used to classify texts by six genres. The high quality of the classification shows that rhythmic characteristics are a good marker for most genres, especially for fiction. The experiments are carried out using the ProseRhythmDetector software for Russian and English. Text corpora contain 300 texts for each language.
引用
收藏
页码:735 / 743
页数:9
相关论文
共 21 条
[1]  
Antonova A.Y., 2011, OTKRYTYE SIST, V3, P80
[2]  
Barakhnin V.B., 2018, IT OBRAZ, V14, P888, DOI [10.25559/SITITO.14.201804.888-895, DOI 10.25559/SITITO.14.201804.888-895]
[3]   USING THE ANALYSIS OF SEMANTIC PROXIMITY OF WORDS IN SOLVING THE PROBLEM OF DETERMINING THE GENRE OF TEXTS WITHIN DEEP LEARNING [J].
Batraeva, I. A. ;
Nartsev, A. D. ;
Lezgyan, A. S. .
VESTNIK TOMSKOGO GOSUDARSTVENNOGO UNIVERSITETA-UPRAVLENIE VYCHISLITELNAJA TEHNIKA I INFORMATIKA-TOMSK STATE UNIVERSITY JOURNAL OF CONTROL AND COMPUTER SCIENCE, 2020, (50) :14-22
[4]  
Cimino A., 2017, PROC 4 ITALIAN C COM, P107
[5]  
Dejica D., 2020, Scientific Bulletin of the Politehnica University of Timisoara. Transactions on Modern Languages, V19, P56
[6]  
Dubovik A, 2017, P29, DOI [10.17586/2541-9781-2017-1-29-45, DOI 10.17586/2541-9781-2017-1-29-45]
[7]  
El-Halees A.M., 2017, J ENG RES TECHNOL, V4, P105
[8]  
[Горбич Л.Г. Gorbich L.G.], 2020, [Программные продукты и системы, Software & Systems, Programmnye produkty i sistemy], P720, DOI 10.15827/0236-235X.132.720-725
[9]   Research of Axiological Dominants in Press Release Genre based on Automatic Extraction of Key Words from Corpus [J].
Kochetova, Larisa A. ;
Popov, Vladimir V. .
NAUCHNYI DIALOG, 2019, (06) :32-49
[10]  
Kozlova L., 2019, SRAVNITELNAYA TIPOLO