Named entity recognition for construction documents based on fine-tuning of large language models with low-quality datasets

被引：0

作者：

Zhou, Junyu ^{[1
]}

Ma, Zhiliang ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Civil Engn, Beijing, Peoples R China

来源：

AUTOMATION IN CONSTRUCTION | 2025年 / 174卷

基金：

中国国家自然科学基金;

关键词：

Construction documents; Large language model; Named entity recognition; Low-quality datasets;

D O I：

10.1016/j.autcon.2025.106151

中图分类号：

TU [建筑科学];

学科分类号：

0813 ;

摘要：

Named Entity Recognition (NER) is a fundamental task for automatically processing and reusing documents. In traditional methods, machine learning has been used relying on costly high-quality datasets. This paper proposed an NER method based on fine-tuning Large Language Models (LLMs) with low-quality datasets for construction documents. Firstly, low-quality datasets were semi-automatically generated from national standards, qualification textbooks, and lexicons, including datasets of generation-type, tagging-type and question-answering type. Then, they were used to fine-tune an LLM for NER of structural elements to obtain optimal parametric fine-tuning conditions. Next, the results of optimally fine-tuned LLM were used to iterate the low-quality dataset to improve the performance. The F1 finally reached 0.756. Similar results were obtained on two other types of named entities, illustrating the generalizability. This paper provided a more effective and efficient method for the construction documents reuse. Future research should explore how to achieve better results by using other methods.

引用

页数：15

共 23 条

[21] Automatic quantitative stroke severity assessment based on Chinese clinical named entity recognition with domain-adaptive pre-trained large language model
Gu, Zhanzhong
He, Xiangjian
Yu, Ping
Jia, Wenjing
Yang, Xiguang
Peng, Gang
Hu, Penghui
Chen, Shiyan
Chen, Hongjie
Lin, Yiguang
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 150
[22] Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study
Li, Fei
Jin, Yonghao
Liu, Weisong
Rawat, Bhanu Pratap Singh
Cai, Pengshan
Yu, Hong
JMIR MEDICAL INFORMATICS, 2019, 7 (03)
[23] Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media
Li, Yiming
Viswaroopan, Deepthi
He, William
Li, Jianfu
Zuo, Xu
Xu, Hua
Tao, Cui
JOURNAL OF BIOMEDICAL INFORMATICS, 2025, 163

← 1 2 3 →