Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

被引：45

作者：

Yang, Liu ^{[1
]}

Zhang, Mingyang ^{[1
]}

Li, Cheng ^{[1
]}

Bendersky, Michael ^{[1
]}

Najork, Marc ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

来源：

CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT | 2020年

关键词：

D O I：

10.1145/3340531.3411908

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers [30] and BERT [6] have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention [34], multi-depth attention-based hierarchical recurrent neural network [14], and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.(1)

引用

页码：1725 / 1734

页数：10

共 6 条

[1] Transformer-based Models for Long-Form Document Matching: Challenges and Empirical Analysis
Jha, Akshita
Samavedhi, Adithya
Rakesh, Vineeth
Chandrashekar, Jaideep
Reddy, Chandan K.
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2345 - 2355
[2] Transformer-based Hierarchical Encoder for Document Classification
Sakhrani, Harsh
Parekh, Saloni
Ratadiya, Pratik
21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 852 - 858
[3] TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching
Gan, Ling
Hu, Liuhui
Tan, Xiaodong
Du, Xinrui
APPLIED INTELLIGENCE, 2023, 53 (19) : 22313 - 22327
[4] TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching
Ling Gan
Liuhui Hu
Xiaodong Tan
Xinrui Du
Applied Intelligence, 2023, 53 : 22313 - 22327
[5] Open-Domain Long-Form Question–Answering Using Transformer-Based Pipeline
Dash A.
Awachar M.
Patel A.
Rudra B.
SN Computer Science, 4 (5)
[6] Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer
Liu, Tengfei
Hu, Yongli
Gao, Junbin
Wang, Jiapu
Sun, Yanfeng
Yin, Baocai
NEURAL NETWORKS, 2024, 176

← 1 →