Exploring AI-driven approaches for unstructured document analysis and future horizons

被引:4
作者
Mahadevkar, Supriya V. [1 ]
Patil, Shruti [2 ]
Kotecha, Ketan [2 ]
Soong, Lim Way [3 ]
Choudhury, Tanupriya [4 ,5 ]
机构
[1] Symbiosis Int, Symbiosis Inst Technol, Pune 412115, India
[2] Symbiosis Int, Symbiosis Inst Technol, Symbiosis Ctr Appl Artificial Intelligence, Pune 412115, India
[3] Multimedia Univ, Fac Engn & Technol, Cyberjaya, Malaysia
[4] Univ Petr & Energy Studies UPES, Sch Comp Sci, Dehra Dun 248002, Uttarakhand, India
[5] Symbiosis Int, Symbiosis Inst Technol, CSE Dept, Pune 412115, Maharashtra, India
关键词
Artificial intelligence; Unstructured document processing; Printed and handwritten text recognition; Information extraction; Optical character recognition; Semantic segmentation; Robotics process automation; Named entity recognition; Large Language models; CHARACTER-RECOGNITION; PRINTED TEXT; HANDWRITTEN; EXTRACTION; SEGMENTATION; WORDS;
D O I
10.1186/s40537-024-00948-z
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In the current industrial landscape, a significant number of sectors are grappling with the challenges posed by unstructured data, which incurs financial losses amounting to millions annually. If harnessed effectively, this data has the potential to substantially boost operational efficiency. Traditional methods for extracting information have their limitations; however, solutions powered by artificial intelligence (AI) could provide a more fitting alternative. There is an evident gap in scholarly research concerning a comprehensive evaluation of AI-driven techniques for the extraction of information from unstructured content. This systematic literature review aims to identify, assess, and deliberate on prospective research directions within the field of unstructured document information extraction. It has been observed that prevailing extraction methods primarily depend on static patterns or rules, often proving inadequate when faced with complex document structures typically encountered in real-world scenarios, such as medical records. Datasets currently available to the public suffer from low quality and are tailored for specific tasks only. This underscores an urgent need for developing new datasets that accurately reflect complex issues encountered in practical settings. The review reveals that AI-based techniques show promise in autonomously extracting information from diverse unstructured documents, encompassing both printed and handwritten text. Challenges arise, however, when dealing with varied document layouts. Proposing a framework through hybrid AI-based approaches, this review envisions processing a high-quality dataset for automatic information extraction from unstructured documents. Additionally, it emphasizes the importance of collaborative efforts between organizations and researchers to address the diverse challenges associated with unstructured data analysis.
引用
收藏
页数:54
相关论文
共 109 条
[1]   An analytical study of information extraction from unstructured and multidimensional big data [J].
Adnan, Kiran ;
Akbar, Rehan .
JOURNAL OF BIG DATA, 2019, 6 (01)
[2]   Limitations of information extraction methods and techniques for heterogeneous unstructured big data [J].
Adnan, Kiran ;
Akbar, Rehan .
INTERNATIONAL JOURNAL OF ENGINEERING BUSINESS MANAGEMENT, 2019, 11
[3]   Building Knowledge Graphs from Unstructured Texts: Applications and Impact Analyses in Cybersecurity Education [J].
Agrawal, Garima ;
Deng, Yuli ;
Park, Jongchan ;
Liu, Huan ;
Chen, Ying-Chih .
INFORMATION, 2022, 13 (11)
[4]  
Agrawal N, 2018, PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE CONFLUENCE 2018 ON CLOUD COMPUTING, DATA SCIENCE AND ENGINEERING, P876, DOI 10.1109/CONFLUENCE.2018.8442875
[5]   Off-line signature verification using elementary combinations of directional codes from boundary pixels [J].
Ajij, Md ;
Pratihar, Sanjoy ;
Nayak, Soumya Ranjan ;
Hanne, Thomas ;
Roy, Diptendu Sinha .
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (07) :4939-4956
[6]   Named Entity Extraction for Knowledge Graphs: A Literature Overview [J].
Al-Moslmi, Tareq ;
Ocana, Marc Gallofre ;
Opdahl, Andreas L. ;
Veres, Csaba .
IEEE ACCESS, 2020, 8 :32862-32881
[7]  
Albattah W., 2022, Applied sciences Standalone and Hybrid CNN Architectures
[8]  
Alheraki M, Handwritten Arabic Character Recognition for Children Writing Using Convolutional Neural Network and Stroke Identification
[9]   Arabic handwriting recognition system using convolutional neural network [J].
Altwaijry, Najwa ;
Al-Turaiki, Isra .
NEURAL COMPUTING & APPLICATIONS, 2021, 33 (07) :2249-2261
[10]  
Arlazarov VV, 2019, COMPUT OPT, V43, P818