CH-Bench: a user-oriented benchmark for systems for efficient distant reading (design, performance, and insights)

被引:0
作者
Jens Willkomm
Markus Raster
Martin Schäler
Klemens Böhm
机构
[1] Karlsruhe Institute of Technology (KIT),
[2] University of Salzburg,undefined
来源
International Journal on Digital Libraries | 2023年 / 24卷
关键词
Benchmark design; Text corpus; Distant reading; Query performance; Corpus insights;
D O I
暂无
中图分类号
学科分类号
摘要
Data science deals with the discovery of information from large volumes of data. The data studied by scientists in the humanities include large textual corpora. An important objective is to study the ideas and expectations of a society regarding specific concepts, like “freedom” or “democracy,” both for today’s society and even more for societies of the past. Studying the meaning of words using large corpora requires efficient systems for text analysis, so-called distant reading systems. Making such systems efficient calls for a specification of the necessary functionality and clear expectations regarding typical work loads. But this currently is unclear, and there is no benchmark to evaluate distant reading systems. In this article, we propose such a benchmark, with the following innovations: As a first step, we collect and structure various information needs of the target users. We then formalize the notion of word context to facilitate the analysis of specific concepts. Using this notion, we formulate queries in line with the information needs of users. Finally, based on this, we propose concrete benchmark queries. To demonstrate the benefit of our benchmark, we conduct an evaluation, with two objectives. First, we aim at insights regarding the content of different corpora, i.e., whether and how their size and nature (e.g., popular and broad literature or specific expert literature) affect results. Second, we benchmark different data management technologies. This has allowed us to identify performance bottlenecks.
引用
收藏
页码:243 / 261
页数:18
相关论文
共 26 条
  • [1] Bakshy E(2015)Exposure to ideologically diverse news and opinion on facebook Science 348 1130-1132
  • [2] Messing S(1990)Indexing by latent semantic analysis J. Am. Soci. Inf. Sci. 41 391-407
  • [3] Adamic L(2019)Improving semantic change analysis by combining word embeddings and word frequencies Int. J. Digit. Libr. 21 247-264
  • [4] Deerwester S(2016)Filter bubbles, echo chambers, and online news consumption Publ. Opin. Quart. 80 298-320
  • [5] Dumais S(2015)Improving distributional similarity with lessons learned from word embeddings Trans. Assoc. Comput. Ling. 3 211-225
  • [6] Furnas G(2011)Quantitative analysis of culture using millions of digitized books Science 331 176-182
  • [7] Englhardt A(2008)Fightin’ words: lexical feature selection and evaluation for identifying the content of political conflict Polit. Anal. 16 372-403
  • [8] Willkomm J(1987)The temporal query language TQuel ACM Trans. Datab. Syst. 12 247-298
  • [9] Schäler M(2017)Fake news and ideological polarization: filter bubbles and selective exposure on social media Bus. Inf. Rev. 34 150-160
  • [10] Flaxman S(2008)Computergestützte quantitative Textanalyse. Diagnostica 54 85-98