Beyond time delays: how web scraping distorts measures of online news consumption

被引:0
作者
Ulloa, Roberto [1 ,2 ]
Mangold, Frank [1 ]
Schmidt, Felix [1 ,3 ]
Gilsbach, Judith [1 ,4 ]
Stier, Sebastian [1 ,3 ]
机构
[1] Leibniz Inst Social Sci, Dept Computat Social Sci, GESIS, Cologne, Germany
[2] Univ Konstanz, Cluster Excellence The Polit Inequal, Univ Str 10, D-78464 Constance, Germany
[3] Univ Mannheim, Sch Social Sci, Mannheim, Germany
[4] Univ Konstanz, Grad Sch Social & Behav Sci, Constance, Germany
关键词
MEDIA; EXPOSURE; INCREASE; PAYWALL;
D O I
10.1080/19312458.2025.2482538
中图分类号
G2 [信息与知识传播];
学科分类号
05 ; 0503 ;
摘要
As the exploration of digital behavioral data revolutionizes communication research, understanding the nuances of data collection methodologies becomes increasingly pertinent. This study focuses on one prominent data collection approach, web scraping;specifically, its application in the growing field of research relying on web browsing data. We investigate discrepancies between content obtained directly during user interaction with a website (in-situ) and content scraped using the URLs of participants' logged visits (ex-situ) with various time delays (0, 30, 60, and 90 days). We find substantial disparities between the methodologies, uncovering that errors are not uniformly distributed across news categories regardless of the classification method (domain, URL, or content analysis). These biases compromise the precision of measurements used in the existing literature. The ex-situ collection environment is the primary source of the discrepancies (33.8%), while the time delays in the scraping process play a smaller role (adding similar to 6.5% points in 90 days). Our research emphasizes the need for data collection methods that capture web content directly in the user's environment. However, acknowledging its complexities, we further explore strategies to mitigate biases in web-scraped browsing histories, offering recommendations for researchers who rely on this method and laying the groundwork for developing error-correction frameworks.
引用
收藏
页数:22
相关论文
共 70 条
  • [1] Improving the Quality of Individual-Level Web Tracking: Challenges of Existing Approaches and Introduction of a New Content and Long-Tail Sensitive Academic Solution
    Adam, Silke
    Makhortykh, Mykola
    Maier, Michaela
    Aigenseer, Viktor
    Urman, Aleksandra
    Lopez, Teresa Gil
    Christner, Clara
    de Leon, Ernesto
    Ulloa, Roberto
    [J]. SOCIAL SCIENCE COMPUTER REVIEW, 2024,
  • [2] Aigenseer V., 2019, GESIS COMPUTATIONAL
  • [3] The Welfare Effects of Social Media
    Allcott, Hunt
    Braghieri, Luca
    Eichmeyer, Sarah
    Gentzkow, Matthew
    [J]. AMERICAN ECONOMIC REVIEW, 2020, 110 (03) : 629 - 676
  • [4] Measuring Media Diet in a High-Choice Environment - Testing the List-Frequency Technique
    Andersen, Kim
    de Vreese, Claes H.
    Albaek, Erik
    [J]. COMMUNICATION METHODS AND MEASURES, 2016, 10 (2-3) : 81 - 98
  • [5] From Gratis to Paywalls: A brief history of a retro-innovation in the press's business
    Arrese, Angel
    [J]. JOURNALISM STUDIES, 2016, 17 (08) : 1051 - 1067
  • [6] Online searches to evaluate misinformation can increase its perceived veracity
    Aslett, Kevin
    Sanderson, Zeve
    Godel, William
    Persily, Nathaniel
    Nagler, Jonathan
    Tucker, Joshua A.
    [J]. NATURE, 2024, 625 (7995) : 548 - 556
  • [7] Search Engine Use for Health-Related Purposes: Behavioral Data on Online Health Information-Seeking in Germany
    Bachl, Marko
    Link, Elena
    Mangold, Frank
    Stier, Sebastian
    [J]. HEALTH COMMUNICATION, 2024, 39 (08) : 1651 - 1664
  • [8] Barbaresi A, 2021, ACL-IJCNLP 2021: THE JOINT CONFERENCE OF THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE SYSTEM DEMONSTRATIONS, P122
  • [9] Automated Text Classification of News Articles: A Practical Guide
    Barbera, Pablo
    Boydstun, Amber E.
    Linn, Suzanna
    McMahon, Ryan
    Nagler, Jonathan
    [J]. POLITICAL ANALYSIS, 2021, 29 (01) : 19 - 42
  • [10] MESSAGES RECEIVED - THE POLITICAL IMPACT OF MEDIA EXPOSURE
    BARTELS, LM
    [J]. AMERICAN POLITICAL SCIENCE REVIEW, 1993, 87 (02) : 267 - 285