Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

被引:79
作者
Powalski, Rafal [1 ]
Borchmann, Lukasz [1 ,2 ]
Jurkiewicz, Dawid [1 ,3 ]
Dwojak, Tomasz [1 ,3 ]
Pietruszka, Michal [1 ,4 ]
Palka, Gabriela [1 ,3 ]
机构
[1] Applica Ai, Warsaw, Poland
[2] Poznan Univ Tech, Poznan, Poland
[3] Adam Mickiewicz Univ, Poznan, Poland
[4] Jagiellonian Univ, Krakow, Poland
来源
DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II | 2021年 / 12822卷
关键词
Natural Language Processing; Transfer learning; Document understanding; Layout analysis; Deep learning; Transformer;
D O I
10.1007/978-3-030-86331-9_47
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.
引用
收藏
页码:732 / 747
页数:16
相关论文
共 56 条
[1]  
Cho M., 2018, PMLR
[2]  
Choi E, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2174
[3]   SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering [J].
Chuang, Yung-Sung ;
Liu, Chi-Liang ;
Lee, Hung-yi ;
Lee, Lin-shan .
INTERSPEECH 2020, 2020, :4168-4172
[4]   TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages [J].
Clark, Jonathan H. ;
Choi, Eunsol ;
Collins, Michael ;
Garrette, Dan ;
Kwiatkowski, Tom ;
Nikolaev, Vitaly ;
Palomaki, Jennimaria .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 :454-470
[5]  
Dai JF, 2016, ADV NEUR IN, V29
[6]  
Denk TI, 2019, BERTGRID CONTEXTUALI
[7]  
Dodge Jesse, 2020, Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping
[8]  
Dua D, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2368
[9]  
Dwojak T., 2020, CONLL
[10]  
Ethayarajh K., 2019, EMNLP IJCNLP