Context-aware automated quality assessment of textual data

被引:0
作者
Mylavarapu G. [1 ]
Viswanathan K.A. [2 ]
Thomas J. [2 ]
机构
[1] Department of Computer Science and Information Systems, Murray State University, Murray
[2] Department of Computer Science, Oklahoma State University, Stillwater, OK
关键词
automated data quality assessment; context-aware; data accuracy; data consistency; data context; Doc2Vec; lexicon; sentiment analysis; textual data;
D O I
10.1504/IJBIDM.2023.130588
中图分类号
学科分类号
摘要
Data analysis is a crucial process in the field of data science that extracts useful information from any form of data. With the rapid growth of technology, more and more unstructured data, such as text and images, are being produced in large amounts. Apart from the analytical techniques used, the quality of the data plays a prominent role in the accurate analysis. Data quality becomes inferior to poor maintenance and mediocre data generation strategies employed by amateur users. This problem escalates with the advent of big data. In this paper, we propose a quality assessment model for the textual form of unstructured data (TDQA). The context of data plays an important role in determining the quality of the data. Therefore, we automate the process of context extraction in textual data using natural language processing to identify data errors and assess quality. Copyright 2023 Inderscience Enterprises Ltd.
引用
收藏
页码:451 / 469
页数:18
相关论文
共 23 条
  • [1] Abasi A.K., Khader A.T., Al-Betar M.A., Naim S., Alyasseri Z.A.A., Makhadmeh S.N., An ensemble topic extraction approach based on optimization clusters using hybrid multi-verse optimizer for scientific publications, Journal of Ambient Intelligence and Humanized Computing, 12, 2, pp. 2765-2801, (2021)
  • [2] Bagheri A., Saraee M., De Jong F., ADM-LDA: an aspect detection model based on topic modelling using the structure of review sentences, Journal of Information Science, 40, 5, pp. 621-636, (2014)
  • [3] Cambria E., Das D., Bandyopadhyay S., Feraco A., A Practical Guide to Sentiment Analysis, (2017)
  • [4] Curiskis S.A., Drake B., Osborn T.R., Kennedy P.J., An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit, Information Processing & Management, 57, 2, (2020)
  • [5] Word2Vec, (2013)
  • [6] Greene D., Cunningham P., Practical solutions to the problem of diagonal dominance in kernel document clustering, Proceedings of the 23rd International Conference on Machine Learning, pp. 377-384, (2006)
  • [7] Jayaraman D., N-Gram based Keyword Topic Modelling for Canadian Longitudinal Study on Aging Survey Data, (2018)
  • [8] Kiefer C., Assessing the quality of unstructured data: an initial overview, LWDA, pp. 62-73, (2016)
  • [9] Le Q., Mikolov T., Distributed representations of sentences and documents, International Conference on Machine Learning, pp. 1188-1196, (2014)
  • [10] Lee K., Palsetia D., Narayanan R., Patwary M.M.A., Agrawal A., Choudhary A., Twitter trending topic classification, 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 251-258, (2011)