Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

被引:45
|
作者
Yang, Liu [1 ]
Zhang, Mingyang [1 ]
Li, Cheng [1 ]
Bendersky, Michael [1 ]
Najork, Marc [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
来源
CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT | 2020年
关键词
D O I
10.1145/3340531.3411908
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers [30] and BERT [6] have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention [34], multi-depth attention-based hierarchical recurrent neural network [14], and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.(1)
引用
收藏
页码:1725 / 1734
页数:10
相关论文
共 6 条
  • [1] Transformer-based Models for Long-Form Document Matching: Challenges and Empirical Analysis
    Jha, Akshita
    Samavedhi, Adithya
    Rakesh, Vineeth
    Chandrashekar, Jaideep
    Reddy, Chandan K.
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2345 - 2355
  • [2] Transformer-based Hierarchical Encoder for Document Classification
    Sakhrani, Harsh
    Parekh, Saloni
    Ratadiya, Pratik
    21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 852 - 858
  • [3] TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching
    Gan, Ling
    Hu, Liuhui
    Tan, Xiaodong
    Du, Xinrui
    APPLIED INTELLIGENCE, 2023, 53 (19) : 22313 - 22327
  • [4] TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching
    Ling Gan
    Liuhui Hu
    Xiaodong Tan
    Xinrui Du
    Applied Intelligence, 2023, 53 : 22313 - 22327
  • [5] Open-Domain Long-Form Question–Answering Using Transformer-Based Pipeline
    Dash A.
    Awachar M.
    Patel A.
    Rudra B.
    SN Computer Science, 4 (5)
  • [6] Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer
    Liu, Tengfei
    Hu, Yongli
    Gao, Junbin
    Wang, Jiapu
    Sun, Yanfeng
    Yin, Baocai
    NEURAL NETWORKS, 2024, 176