Named entity recognition for construction documents based on fine-tuning of large language models with low-quality datasets

被引:0
|
作者
Zhou, Junyu [1 ]
Ma, Zhiliang [1 ]
机构
[1] Tsinghua Univ, Dept Civil Engn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Construction documents; Large language model; Named entity recognition; Low-quality datasets;
D O I
10.1016/j.autcon.2025.106151
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
Named Entity Recognition (NER) is a fundamental task for automatically processing and reusing documents. In traditional methods, machine learning has been used relying on costly high-quality datasets. This paper proposed an NER method based on fine-tuning Large Language Models (LLMs) with low-quality datasets for construction documents. Firstly, low-quality datasets were semi-automatically generated from national standards, qualification textbooks, and lexicons, including datasets of generation-type, tagging-type and question-answering type. Then, they were used to fine-tune an LLM for NER of structural elements to obtain optimal parametric fine-tuning conditions. Next, the results of optimally fine-tuned LLM were used to iterate the low-quality dataset to improve the performance. The F1 finally reached 0.756. Similar results were obtained on two other types of named entities, illustrating the generalizability. This paper provided a more effective and efficient method for the construction documents reuse. Future research should explore how to achieve better results by using other methods.
引用
收藏
页数:15
相关论文
共 23 条
  • [21] Automatic quantitative stroke severity assessment based on Chinese clinical named entity recognition with domain-adaptive pre-trained large language model
    Gu, Zhanzhong
    He, Xiangjian
    Yu, Ping
    Jia, Wenjing
    Yang, Xiguang
    Peng, Gang
    Hu, Penghui
    Chen, Shiyan
    Chen, Hongjie
    Lin, Yiguang
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 150
  • [22] Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study
    Li, Fei
    Jin, Yonghao
    Liu, Weisong
    Rawat, Bhanu Pratap Singh
    Cai, Pengshan
    Yu, Hong
    JMIR MEDICAL INFORMATICS, 2019, 7 (03)
  • [23] Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media
    Li, Yiming
    Viswaroopan, Deepthi
    He, William
    Li, Jianfu
    Zuo, Xu
    Xu, Hua
    Tao, Cui
    JOURNAL OF BIOMEDICAL INFORMATICS, 2025, 163