TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching

被引：2

作者：

Gan, Ling ^{[1
]}

Hu, Liuhui ^{[2
]}

Tan, Xiaodong ^{[1
]}

Du, Xinrui ^{[1
]}

机构：

[1] Chongqing Univ Posts & Telecommun, Sch Comp, Chongqing 400065, Peoples R China

[2] Chongqing Univ Posts & Telecommun, Sch Software Engn, Chongqing, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 19期

关键词：

Long text matching; Noise filtering; Transformer; PageRank;

D O I：

10.1007/s10489-023-04607-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In the field of deep matching, a large amount of noisy data in Chinese long texts affects the matching effect. Most long-form text matching models use all text data indiscriminately, which results in a large amount of noisy data, and thus the PageRank algorithm is combined with Transformer to filter noise. For sentence-level noise detection, after calculating the overlap rate of words to evaluate the similarity, a sentence-level relationship graph is constructed and filtered by using the PageRank algorithm; for word-level noise detection, based on the attention score in Transformer, a word graph is established, then the PageRank algorithm is executed on graph, combined with self-attention weights, to select keywords to highlight topic relevance, the noisy words are filtered sequentially at different layers in the module, layer by layer. In addition, during the model training, PolyLoss is applied to replace the traditional binary Cross-Entropy loss function, thus reducing the difficulty of hyperparameter tuning. Finally, a better filtering strategy is proposed and experiments are conducted to verify it on two Chinese long-form text matching datasets. The result shows that the matching model based on the noise filtering strategy of this paper can better filter the noise and capture the matching signal more accurately.

引用

页码：22313 / 22327

页数：15

共 46 条

[11] Jiang JY, 2019, WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), P795
[12] Jiezhong Qiu O.L.W., 2020, BLOCKWISE SELF ATTEN, P2555
[13] Data deduplication techniques for efficient cloud storage management: a systematic review
Kaur, Ravneet
Chana, Inderveer
Bhattacharya, Jhilik
[J]. JOURNAL OF SUPERCOMPUTING, 2018, 74 (05) : 2035 - 2085
[14] Kitaev N, 2020, INT C LEARN REPR
[15] Koehn P., 2005, C P 10 MACH TRANSL S, P79
[16] Leng Z., 2022, INT C LEARNING REPRE
[17] Liu B, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P6284
[18] Deep bi-directional interaction network for sentence matching
Liu, Mingtong
Zhang, Yujie
Xu, Jinan
Chen, Yufeng
[J]. APPLIED INTELLIGENCE, 2021, 51 (07) : 4305 - 4329
[19] Distilling Knowledge from BERT into Simple Fully Connected Neural Networks for Efficient Vertical Retrieval
Liu, Peiyang
Wang, Xi
Wang, Lin
Ye, Wei
Xi, Xiangyu
Zhang, Shikun
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3965 - 3975
[20] Liu W, 2021, P 59 ANN M ASS COMP, P5847

← 1 2 3 4 5 →