Gene name errors: Lessons not learned

被引:14
作者
Abeysooriya, Mandhri [1 ]
Soria, Megan [1 ]
Kasu, Mary Sravya [1 ]
Ziemann, Mark [1 ]
机构
[1] Deakin Univ, Sch Life & Environm Sci, Geelong, Vic, Australia
关键词
REPRODUCIBLE RESEARCH;
D O I
10.1371/journal.pcbi.1008984
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Erroneous conversion of gene names into other dates and other data types has been a frustration for computational biologists for years. We hypothesized that such errors in supplementary files might diminish after a report in 2016 highlighting the extent of the problem. To assess this, we performed a scan of supplementary files published in PubMed Central from 2014 to 2020. Overall, gene name errors continued to accumulate unabated in the period after 2016. An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers). These findings further reinforce that spreadsheets are ill-suited to use with large genomic data. Author summary Autocorrection is a feature of modern softwares including messaging apps, word processors and spreadsheets. These are designed to avoid data entry errors but "autocorrect fails" can lead to information being distorted in undesired and sometimes humorous ways. What is not funny though is having genomics spreadsheets suffer from auto-conversion of gene names like SEPT8, DEC1 and MARCH3 into dates, a problem first characterised in 2004. A 2016 article on this topic led the Human Gene Name Consortium to change many of these gene names to be less susceptible to autocorrect. Despite this, our work here shows that gene name autocorrect errors continue to accumulate in supplementary genomics spreadsheet files at a rapid pace. To avoid this and other reproducibility problems with spreadsheets, big changes are required in the way genomics scientists analyse and share data. We provide several practical steps researchers can take to avoid gene name errors and reiterate that big genomics data analysis is better suited to Python/R notebooks rather than spreadsheets.
引用
收藏
页数:13
相关论文
共 14 条
  • [1] Legible ledgers
    不详
    [J]. NATURE GENETICS, 2016, 48 (10) : 1101 - 1101
  • [2] DERIVING CHEMOSENSITIVITY FROM CELL LINES: FORENSIC BIOINFORMATICS AND REPRODUCIBLE RESEARCH IN HIGH-THROUGHPUT BIOLOGY
    Baggerly, Keith A.
    Coombes, Kevin R.
    [J]. ANNALS OF APPLIED STATISTICS, 2009, 3 (04) : 1309 - 1334
  • [3] Prestigious Science Journals Struggle to Reach Even Average Reliability
    Brembs, Bjoern
    [J]. FRONTIERS IN HUMAN NEUROSCIENCE, 2018, 12
  • [4] Guidelines for human gene nomenclature
    Bruford, Elspeth A.
    Braschi, Bryony
    Denny, Paul
    Jones, Tamsin E. M.
    Seal, Ruth L.
    Tweedie, Susan
    [J]. NATURE GENETICS, 2020, 52 (08) : 754 - 758
  • [5] Spreadsheet Error Types and Their Prevalence in a Healthcare Context
    Dobell, Elaine
    Herold, Sebastian
    Buckley, Jim
    [J]. JOURNAL OF ORGANIZATIONAL AND END USER COMPUTING, 2018, 30 (02) : 20 - 42
  • [6] Error rates in a clinical data repository: lessons from the transition to electronic data transfer - a descriptive study
    Hong, Matthew K. H.
    Yao, Henry H. I.
    Pedersen, John S.
    Peters, Justin S.
    Costello, Anthony J.
    Murphy, Declan G.
    Hovens, Christopher M.
    Corcoran, Niall M.
    [J]. BMJ OPEN, 2013, 3 (05):
  • [7] Truke, a web tool to check for and handle excel misidentified gene symbols
    Mallona, Izaskun
    Peinado, Miguel A.
    [J]. BMC GENOMICS, 2017, 18
  • [8] Oh Sehyun, 2020, F1000Res, V9, P1493, DOI 10.12688/f1000research.28033.1
  • [9] Panko R. R., 1998, Journal of End User Computing, V10, P15
  • [10] Reproducible Research in Computational Science
    Peng, Roger D.
    [J]. SCIENCE, 2011, 334 (6060) : 1226 - 1227