Predicting Publication of Clinical Trials Using Structured and Unstructured Data: Model Development and Validation Study

被引:5
作者
Wang, Siyang [1 ]
Suster, Simon [1 ,4 ]
Baldwin, Timothy [1 ,2 ]
Verspoor, Karin [3 ]
机构
[1] Univ Melbourne, Sch Comp & Informat Syst, Melbourne, Australia
[2] Mohamed Bin Zayed Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates
[3] RMIT Univ, Sch Comp Technol, Melbourne, Australia
[4] Univ Melbourne, Sch Comp & Informat Syst, Melbourne 3000, Australia
基金
澳大利亚研究理事会;
关键词
clinical trials; study characteristics; machine learning; natural language processing; pretrained language models; publication success; UPDATE; FAIL; BIG;
D O I
10.2196/38859
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Publication of registered clinical trials is a critical step in the timely dissemination of trial findings. However, a significant proportion of completed clinical trials are never published, motivating the need to analyze the factors behind success or failure to publish. This could inform study design, help regulatory decision-making, and improve resource allocation. It could also enhance our understanding of bias in the publication of trials and publication trends based on the research direction or strength of the findings. Although the publication of clinical trials has been addressed in several descriptive studies at an aggregate level, there is a lack of research on the predictive analysis of a trial's publishability given an individual (planned) clinical trial description. Objective: We aimed to conduct a study that combined structured and unstructured features relevant to publication status in a single predictive approach. Established natural language processing techniques as well as recent pretrained language models enabled us to incorporate information from the textual descriptions of clinical trials into a machine learning approach. We were particularly interested in whether and which textual features could improve the classification accuracy for publication outcomes. Methods: In this study, we used metadata from ClinicalTrials.gov (a registry of clinical trials) and MEDLINE (a database of academic journal articles) to build a data set of clinical trials (N=76,950) that contained the description of a registered trial and its publication outcome (27,702/76,950, 36% published and 49,248/76,950, 64% unpublished). This is the largest data set of its kind, which we released as part of this work. The publication outcome in the data set was identified from MEDLINE based on clinical trial identifiers. We carried out a descriptive analysis and predicted the publication outcome using 2 approaches: a neural network with a large domain-specific language model and a random forest classifier using a weighted bag-of-words representation of text. Results: First, our analysis of the newly created data set corroborates several findings from the existing literature regarding attributes associated with a higher publication rate. Second, a crucial observation from our predictive modeling was that the addition of textual features (eg, eligibility criteria) offers consistent improvements over using only structured data (F1-score=0.62-0.64 vs F1-score=0.61 without textual features). Both pretrained language models and more basic word-based representations provide high-utility text representations, with no significant empirical difference between the two. Conclusions: Different factors affect the publication of a registered clinical trial. Our approach to predictive modeling combines heterogeneous features, both structured and unstructured. We show that methods from natural language processing can provide effective textual features to enable more accurate prediction of publication success, which has not been explored for this task previously.
引用
收藏
页数:18
相关论文
共 63 条
[1]  
Adhikari A, 2019, DocBERT: BERT for document classification
[2]   Nonpublication Rates and Characteristics of Registered Randomized Clinical Trials in Digital Health: Cross-Sectional Analysis [J].
Al-Durra, Mustafa ;
Nolan, Robert P. ;
Seto, Emily ;
Cafazzo, Joseph A. ;
Eysenbach, Gunther .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2018, 20 (12)
[3]  
[Anonymous], 2019, Cochrane Handbook for Systematic Reviews of Interventions
[4]   A systematic review of the processes used to link clinical trial registrations to their published results [J].
Bashir, Rabia ;
Bourgeois, Florence T. ;
Dunn, Adam G. .
SYSTEMATIC REVIEWS, 2017, 6
[5]  
Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
[6]   Investigator initiated trials versus industry sponsored trials-translation of randomized controlled trials into clinical practice (IMPACT) [J].
Bluemle, Anette ;
Wollmann, Katharina ;
Bischoff, Karin ;
Kapp, Philipp ;
Lohner, Szimonetta ;
Nury, Edris ;
Nitschke, Kai ;
Zaehringer, Jasmin ;
Ruecker, Gerta ;
Schumacher, Martin .
BMC MEDICAL RESEARCH METHODOLOGY, 2021, 21 (01)
[7]  
Boudin F, 2010, P 2010 C EMPIRICAL M, P108
[8]   Favorable and publicly funded studies are more likely to be published: a systematic review and meta-analysis [J].
Canestaro, William J. ;
Hendrix, Nathaniel ;
Bansal, Aasthaa ;
Sullivan, Sean D. ;
Devine, Emily B. ;
Carlson, Josh J. .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2017, 92 :58-68
[9]   Time to publication of oncology trials and why some trials are never published [J].
Chapman, Paul B. ;
Liu, Nathan J. ;
Zhou, Qin ;
Iasonos, Alexia ;
Hanley, Sara ;
Bosl, George J. ;
Spriggs, David R. .
PLOS ONE, 2017, 12 (09)
[10]   Some data quality issues at ClinicalTrials.gov [J].
Chaturvedi, Neha ;
Mehrotra, Bagish ;
Kumari, Sangeeta ;
Gupta, Saurabh ;
Subramanya, H. S. ;
Saberwal, Gayatri .
TRIALS, 2019, 20 (1)