Bag-of-Words Technique in Natural Language Processing: A Primer for Radiologists

被引:24
作者
Juluru, Krishna [1 ]
Shih, Hao-Hsin [1 ]
Murthy, Krishna Nand Keshava [1 ]
Elnajjar, Pierre [1 ]
机构
[1] Mem Sloan Kettering Canc Ctr, Dept Radiol, 1275 York Ave,Box 29, New York, NY 10065 USA
基金
美国国家卫生研究院;
关键词
D O I
10.1148/rg.2021210025
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Natural language processing (NLP) is a methodology designed to extract concepts and meaning from human-generated unstructured (free-form) text. It is intended to be implemented by using computer algorithms so that it can be run on a corpus of documents quickly and reliably. To enable machine learning (ML) techniques in NLP, free-form text must be converted to a numerical representation. After several stages of preprocessing including tokenization, removal of stop words, token normalization, and creation of a master dictionary, the bag-of-words (BOW) technique can be used to represent each remaining word as a feature of the document. The preprocessing steps simplify the documents but also potentially degrade meaning. The values of the features in BOW can be modified by using techniques such as term count, term frequency, and term frequency-inverse document frequency. Experience and experimentation will guide decisions on which specific techniques will optimize ML performance. These and other NLP techniques are being applied in radiology. Radiologists' understanding of the strengths and limitations of these techniques will help in communication with data scientists and in implementation for specific tasks. (C) RSNA, 2021
引用
收藏
页码:1420 / 1426
页数:7
相关论文
共 19 条
[1]  
[Anonymous], NATURAL LANGUAGE TOO
[2]  
[Anonymous], Natural Language Processing with Python-Analyzing Text with the Natural Language Toolkit
[3]   Natural Language Processing Technologies in Radiology Research and Clinical Applications [J].
Cai, Tianrun ;
Giannopoulos, Andreas A. ;
Yu, Sheng ;
Kelil, Tatiana ;
Ripley, Beth ;
Kumamaru, Kanako K. ;
Rybicki, Frank J. ;
Mitsouras, Dimitrios .
RADIOGRAPHICS, 2016, 36 (01) :176-191
[4]   Deep Learning to Classify Radiology Free-Text Reports [J].
Chen, Matthew C. ;
Ball, Robyn L. ;
Yang, Lingyao ;
Moradzadeh, Nathaniel ;
Chapman, Brian E. ;
Larson, David B. ;
Langlotz, Curtis P. ;
Amrhein, Timothy J. ;
Lungren, Matthew P. .
RADIOLOGY, 2018, 286 (03) :845-852
[5]   Integrating Natural Language Processing and Machine Learning Algorithms to Categorize Oncologic Response in Radiology Reports [J].
Chen, Po-Hao ;
Zafar, Hanna ;
Galperin-Aizenberg, Maya ;
Cook, Tessa .
JOURNAL OF DIGITAL IMAGING, 2018, 31 (02) :178-184
[6]   Algorithmic Stemmers or Morphological Analysis? An Evaluation [J].
Fautsch, Claire ;
Savoy, Jacques .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (08) :1616-1624
[7]   A universal information theoretic approach to the identification of stopwords [J].
Gerlach, Martin ;
Shi, Hanyu ;
Amaral, Luis A. Nunes .
NATURE MACHINE INTELLIGENCE, 2019, 1 (12) :606-612
[8]  
Grefenstette G., 1999, Tokenization, V9, P117, DOI [DOI 10.1007/978-94-015-9273-4_9, DOI 10.1007/978-94-015-9273-49]
[9]   Machine Learning for Automation of Radiology Protocols for Quality and Efficiency Improvement [J].
Kalra, Angad ;
Chakraborty, Amit ;
Fine, Benjamin ;
Reicher, Joshua .
JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2020, 17 (09) :1149-1158
[10]   Foundations of statistical natural language processing [J].
Lee, L .
COMPUTATIONAL LINGUISTICS, 2000, 26 (02) :277-279