Robustness of sentence length measures in written texts

被引:7
|
作者
Vieira, Denner S. [1 ]
Picoli, Sergio [1 ]
Mendes, Renio S. [1 ]
机构
[1] Univ Estadual Maringa, Dept Fis, Ave Colombo 5790, BR-87020900 Maringa, Parana, Brazil
关键词
Sentence length; Time series; Linear correlation; Probability distribution; Auto-correlation; LONG-RANGE CORRELATIONS; HUMAN LANGUAGE; TRANSLATION; ENGLISH;
D O I
10.1016/j.physa.2018.04.104
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Hidden structural patterns in written texts have been subject of considerable research in the last decades. In particular, mapping a text into a time series of sentence lengths is a natural way to investigate text structure. Typically, sentence length has been quantified by using measures based on the number of words and the number of characters, but other variations are possible. To quantify the robustness of different sentence length measures, we analyzed a database containing about five hundred books in English. For each book, we extracted six distinct measures of sentence length, including the number of words and number of characters (taking into account lemmatization and stop words removal). We compared these six measures for each book by using (i) Pearson's coefficient to investigate linear correlations; (ii) Kolmogorov-Smirnov test to compare distributions; and (iii) detrended fluctuation analysis (DFA) to quantify auto-correlations. We have found that all six measures exhibit very similar behavior, suggesting that sentence length is a robust measure related to text structure. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:749 / 754
页数:6
相关论文
共 50 条
  • [41] Tense and temporality in written and spoken texts
    Marschall, GR
    ETUDES GERMANIQUES, 2002, 57 (01): : 157 - 157
  • [42] COMPUTERS AND WRITTEN TEXTS - BUTLER,C
    GUBERMAN, S
    CANADIAN MODERN LANGUAGE REVIEW-REVUE CANADIENNE DES LANGUES VIVANTES, 1993, 50 (01): : 186 - 187
  • [43] Written Texts as Statistical Mechanical Problem
    Koroutchev, Kostadin
    Korutcheva, Elka
    Shen, Jian
    ADVANCES IN INFORMATION RETRIEVAL THEORY, 2009, 5766 : 241 - +
  • [44] Quantifying Syntactic Complexity in Czech Texts: An Analysis of Mean Dependency Distance and Average Sentence Length Across Genres
    Chen, Xinying
    Kubat, Miroslav
    JOURNAL OF QUANTITATIVE LINGUISTICS, 2024, 31 (03) : 260 - 273
  • [45] Relative clauses in Spanish written texts
    Gisbert, Jose Manuel Bustos
    VERBA-ANUARIO GALEGO DE FILOLOXIA, 2024, 51
  • [46] Polysemy in Spoken Conversations and Written Texts
    Soler, Aina Gari
    Labeau, Matthieu
    Clavel, Chloe
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1680 - 1690
  • [47] Discourse connectors in written descriptive texts
    Bustos Gisbert, Jose Manuel
    RILCE-REVISTA DE FILOLOGIA HISPANICA, 2017, 33 (02): : 443 - 479
  • [48] Word length balance in texts: Proportion constancy and word-chain-lengths in Proust's longest sentence
    Andersen, Simone
    GLOTTOMETRICS, 2005, 11 : 32 - 50
  • [49] ESTABLISHMENT OF REFERENCES IN SPEECH AND WRITTEN TEXTS
    Kranjc, Simona
    ZBORNIK MATICE SRPSKE ZA SLAVISTIKU-MATICA SRPSKA JOURNAL OF SLAVIC STUDIES, 2006, 69 : 153 - 166
  • [50] The Graphic and Grammatical Structure of Written Texts
    Dahl, Alva
    STUDIA NEOPHILOLOGICA, 2018, 90 : 24 - 36