Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents

被引:0
|
作者
Bridgelall, Raj [1 ]
机构
[1] North Dakota State Univ, Coll Business, Dept Transportat & Supply Chain, POB 6050, Fargo, ND 58108 USA
来源
APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 05期
关键词
document search; supervised machine learning; unsupervised machine learning; natural language processing; latent Dirichlet allocation; non-negative matrix factorization; manifold learning; t-distributed stochastic neighbor embedding; term co-occurrence networks; RETRIEVAL;
D O I
10.3390/app15052357
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The exponential growth of patent datasets poses a significant challenge in filtering relevant documents for research and innovation. Traditional semantic search methods based on keywords often fail to capture the complexity and variability in multidisciplinary terminology, leading to inefficiencies. This study addresses the problem by systematically evaluating supervised and unsupervised machine learning (ML) techniques for document relevance filtering across five technology domains: solid-state batteries, electric vehicle chargers, connected vehicles, electric vertical takeoff and landing aircraft, and light detecting and ranging (LiDAR) sensors. The contributions include benchmarking the performance of 10 classical models. These models include extreme gradient boosting, random forest, and support vector machines; a deep artificial neural network; and three natural language processing methods: latent Dirichlet allocation, non-negative matrix factorization, and k-means clustering of a manifold-learned reduced feature dimension. Applying these methods to more than 4200 patents filtered from a database of 9.6 million patents revealed that most supervised ML models outperform the unsupervised methods. An average of seven supervised ML models achieved significantly higher precision, recall, and F1-scores across all technology domains, while unsupervised methods show variability depending on domain characteristics. These results offer a practical framework for optimizing document relevance filtering, enabling researchers and practitioners to efficiently manage large datasets and enhance innovation.
引用
收藏
页数:25
相关论文
共 50 条
  • [31] Application of natural language processing and machine learning in prediction of deviations in the HAZOP study worksheet: A comparison of classifiers
    Ekramipooya, Ali
    Boroushaki, Mehrdad
    Rashtchian, Davood
    PROCESS SAFETY AND ENVIRONMENTAL PROTECTION, 2023, 176 : 65 - 73
  • [32] Network Intrusion Detection using Natural Language Processing and Ensemble Machine Learning
    Das, Saikat
    Ashrafuzzamant, Mohammad
    Sheldon, Frederick T.
    Shiva, Sajjan
    2020 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2020, : 829 - 835
  • [33] A Systematic Review of Using Machine Learning and Natural Language Processing in Smart Policing
    Sarzaeim, Paria
    Mahmoud, Qusay H.
    Azim, Akramul
    Bauer, Gary
    Bowles, Ian
    COMPUTERS, 2023, 12 (12)
  • [34] Insights into Search Engine Optimization using Natural Language Processing and Machine Learning
    Vinutha, M. S.
    Padma, M. C.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (02) : 86 - 96
  • [35] Using Natural Language Processing and Machine Learning to Detect Online Grooming Attacks
    Street, Jake
    Olajide, Funminiyi
    ADVANCES IN COMPUTATIONAL INTELLIGENCE SYSTEMS, UKCI 2022, 2024, 1454 : 261 - 270
  • [36] Predicting Severity in People with Aphasia: A Natural Language Processing and Machine Learning Approach
    Day, Marjory
    Dey, Rupam Kumar
    Baucum, Matthew
    Paek, Eun Jin
    Park, Hyejin
    Khojandi, Anahita
    2021 43RD ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY (EMBC), 2021, : 2299 - 2302
  • [37] Resume Classification System using Natural Language Processing and Machine Learning Techniques
    Ali, Irfan
    Mughal, Nimra
    Khand, Zahid Hussain
    Ahmed, Javed
    Mujtaba, Ghulam
    MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2022, 41 (01) : 65 - 79
  • [38] A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection
    Bountakas, Panagiotis
    Koutroumpouchos, Konstantinos
    Xenakis, Christos
    ARES 2021: 16TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY, 2021,
  • [39] Automated Genre Classification of Books Using Machine Learning and Natural Language Processing
    Gupta, Shikha
    Agarwal, Mohit
    Jain, Satbir
    2019 9TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (CONFLUENCE 2019), 2019, : 269 - 272
  • [40] Standardization of Featureless Variables for Machine Learning Models Using Natural Language Processing
    Modarresi, Kourosh
    Munir, Abdurrahman
    COMPUTATIONAL SCIENCE - ICCS 2018, PT II, 2018, 10861 : 234 - 246