Improving Digital Libraries' Provision of Digital Humanities Datasets: A Case Study of HTRC Literature Dataset

被引:2
作者
Hu, Yuerong [1 ]
Jiang, Ming [1 ]
Underwood, Ted [1 ]
Downie, J. Stephen [1 ]
机构
[1] Univ Illinois, Sch Informat Sci, Urbana, IL 61801 USA
来源
PROCEEDINGS OF THE ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES IN 2020, JCDL 2020 | 2020年
关键词
digital libraries; digital humanities; cultural analytics; datasets;
D O I
10.1145/3383583.3398621
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper investigates the limitations and challenges of the curated datasets provided by digital libraries in support of digital humanities research. This presented work provides a use case utilizing an English literature dataset of 178,381 volumes curated by the HathiTrust Research Center (HTRC) for measuring the change of three literature genres. These volumes were selected from over 17 million digitized items in the HathiTrust Digital Library. We demonstrate our methods and workflow for improving the representativeness and scholarly usability of the existing datasets. We analyzed and effectively overcame three common limitations: duplicate volumes, uneven distribution of data and OCR errors. We suggest that stakeholders of digital libraries should flag and address these limitations to improve their provisions' usability in the context of digital humanities research.
引用
收藏
页码:405 / 408
页数:4
相关论文
共 10 条
[1]  
Cohen Adam, Fuzzy Wuzzy Project Description
[2]  
Fenlon K., 2014, Proceedings of the American Society for Information Science and Technology, V51, P1
[3]  
HathiTrust Research Center, 2015, Genrespecific word counts for 178,381 volumes from the HathiTrust Digital Library v. 0.1
[4]   Building Community and Tools for Analyzing Web Archives through Datathons [J].
Milligan, Ian ;
Casemajor, Nathalie ;
Fritz, Samantha ;
Lin, Jimmy ;
Ruest, Nick ;
Weber, Matthew ;
Worby, Nicholas .
2019 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2019), 2019, :265-268
[5]   Worksets Expand the Scholarly Utility of Digital Libraries [J].
Page, Kevin R. ;
Jett, Jacob ;
Cole, Timothy W. ;
Kudeki, Deren ;
Bainbridge, David ;
Organisciak, Peter ;
Downie, J. Stephen .
JCDL'18: PROCEEDINGS OF THE 18TH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, 2018, :371-372
[6]   Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution [J].
Pechenick, Eitan Adam ;
Danforth, Christopher M. ;
Dodds, Peter Sheridan .
PLOS ONE, 2015, 10 (10)
[7]   COLLECTION DEVELOPMENT AND THE PSYCHOLOGY OF BIAS [J].
Quinn, Brian .
LIBRARY QUARTERLY, 2012, 82 (03) :277-304
[8]   Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing [J].
Thi-Tuyet-Hai Nguyen ;
Jatowt, Adam ;
Coustaty, Mickael ;
Nhu-Van Nguyen ;
Doucet, Antoine .
2019 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2019), 2019, :29-38
[9]  
Underwood T., 2019, Distant Horizons: Digital Evidence and Literary Change
[10]  
Zeng Jiaan., 2014, Proceedings of the 5th ACM Workshop on Scientific Cloud Computing, ScienceCloud '14, P9