Context-aware automated quality assessment of textual data

被引:0
作者
Mylavarapu G. [1 ]
Viswanathan K.A. [2 ]
Thomas J. [2 ]
机构
[1] Department of Computer Science and Information Systems, Murray State University, Murray
[2] Department of Computer Science, Oklahoma State University, Stillwater, OK
关键词
automated data quality assessment; context-aware; data accuracy; data consistency; data context; Doc2Vec; lexicon; sentiment analysis; textual data;
D O I
10.1504/IJBIDM.2023.130588
中图分类号
学科分类号
摘要
Data analysis is a crucial process in the field of data science that extracts useful information from any form of data. With the rapid growth of technology, more and more unstructured data, such as text and images, are being produced in large amounts. Apart from the analytical techniques used, the quality of the data plays a prominent role in the accurate analysis. Data quality becomes inferior to poor maintenance and mediocre data generation strategies employed by amateur users. This problem escalates with the advent of big data. In this paper, we propose a quality assessment model for the textual form of unstructured data (TDQA). The context of data plays an important role in determining the quality of the data. Therefore, we automate the process of context extraction in textual data using natural language processing to identify data errors and assess quality. Copyright 2023 Inderscience Enterprises Ltd.
引用
收藏
页码:451 / 469
页数:18
相关论文
共 23 条
  • [11] Lichman M., News Aggregator Data Set – UCI Machine Learning Repository, (2013)
  • [12] Melamud O., Goldberger J., Dagan I., Context2Vec: learning generic context embedding with bidirectional LSTM, Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp. 51-61, (2016)
  • [13] Mitchell T., 20 Newsgroups Data, (2008)
  • [14] Ni J., Li J., McAuley J., Justifying recommendations using distantly-labeled reviews and fine-grained aspects, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 188-197, (2019)
  • [15] Nikolenko S.I., Koltcov S., Koltsova O., Topic modelling for qualitative studies, Journal of Information Science, 43, 1, pp. 88-102, (2017)
  • [16] Onan A., Korukoglu S., Bulut H., LDA-based topic modelling in text sentiment classification: an empirical analysis, Int. J. Comput. Linguistics Appl, 7, 1, pp. 101-119, (2016)
  • [17] Sonntag D., Assessing the quality of natural language text data, GI Jahrestagung, 1, pp. 259-263, (2004)
  • [18] Sowmiya J.S., Chandrakala S., Joint sentiment/topic extraction from text, 2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies, pp. 611-615, (2014)
  • [19] Taleb I., Serhani M.A., Dssouli R., Big data quality assessment model for unstructured data, 2018 International Conference on Innovations in Information Technology (IIT), pp. 69-74, (2018)
  • [20] Thompson A., All The News, (2017)