Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

被引：79

作者：

Powalski, Rafal ^{[1
]}

Borchmann, Lukasz ^{[1
,2
]}

Jurkiewicz, Dawid ^{[1
,3
]}

Dwojak, Tomasz ^{[1
,3
]}

Pietruszka, Michal ^{[1
,4
]}

Palka, Gabriela ^{[1
,3
]}

机构：

[1] Applica Ai, Warsaw, Poland

[2] Poznan Univ Tech, Poznan, Poland

[3] Adam Mickiewicz Univ, Poznan, Poland

[4] Jagiellonian Univ, Krakow, Poland

来源：

DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II | 2021年 / 12822卷

关键词：

Natural Language Processing; Transfer learning; Document understanding; Layout analysis; Deep learning; Transformer;

D O I：

10.1007/978-3-030-86331-9_47

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

引用

页码：732 / 747

页数：16

共 56 条

[1]

Cho M., 2018, PMLR

[2]

Choi E, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2174

[3] SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering [J].

Chuang, Yung-Sung ;

Liu, Chi-Liang ;

Lee, Hung-yi ;

Lee, Lin-shan .

INTERSPEECH 2020, 2020, :4168-4172

[4] TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages [J].

Clark, Jonathan H. ;

Choi, Eunsol ;

Collins, Michael ;

Garrette, Dan ;

Kwiatkowski, Tom ;

Nikolaev, Vitaly ;

Palomaki, Jennimaria .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 :454-470

[5]

Dai JF, 2016, ADV NEUR IN, V29

[6]

Denk TI, 2019, BERTGRID CONTEXTUALI

[7]

Dodge Jesse, 2020, Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping

[8]

Dua D, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2368

[9]

Dwojak T., 2020, CONLL

[10]

Ethayarajh K., 2019, EMNLP IJCNLP

← 1 2 3 4 5 6 →