LABT: A Sequence-to-Sequence Model for Mongolian Handwritten Text Recognition with Local Aggregation BiLSTM and Transformer

被引：0

作者：

Li, Yu ^{[1
,2
,3
]}

Wei, Hongxi ^{[1
,2
,3
]}

Sun, Shiwen ^{[1
,2
,3
]}

机构：

[1] Inner Mongolia Univ, Sch Comp Sci, Hohhot 010010, Peoples R China

[2] Prov Key Lab Mongolian Informat Proc Technol, Hohhot 010010, Peoples R China

[3] Natl & Local Joint Engn Res Ctr Mongolian Informa, Hohhot 010010, Peoples R China

来源：

DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT II | 2024年 / 14805卷

关键词：

Mongolian handwritten text recognition; BiLSTM; Local aggregation; Transformer; ATTENTION;

D O I：

10.1007/978-3-031-70536-6_21

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Mongolian handwritten text recognition poses challenges with the unique characteristics of Mongolian script, its large vocabulary, and the presence of out-of-vocabulary (OOV) words. This paper proposes a model that uses local aggregation BiLSTM for sequence modeling of visual features and Transformer for word prediction. Specifically, we introduce a local aggregation operation in BiLSTM (Bidirectional Long and Short Term Memory) to improve contextual understanding by aggregating adjacent information at each time step. The improved BiLSTM is able to capture context-dependent and letter shape changes that occur in different contexts. It effectively addresses the difficulty of accurately identifying variable letters and generating OOV words without relying on predefined words during training. The contextual features extracted by BiLSTM are passed through multiple layers of Transformer's encoder and decoder. At each layer, the representations of the previous layer are accessible, allowing layered representations to be refined and improved. By using hierarchical representations, accurate predictions can be made even in large vocabulary text recognition tasks. Our proposed model achieves state-of-the-art performance on two commonly used Mongolian handwritten text recognition datasets.

引用

页码：352 / 363

页数：12

共 42 条

[41] Local gabor binary pattern histogram sequence (LGBPHS): A novel non-statistical model for face representation and recognition
Zhang, WC
Shan, SG
Gao, W
Chen, XL
Zhang, HM
TENTH IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1 AND 2, PROCEEDINGS, 2005, : 786 - 791
[42] A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET
Tambe, Thierry
Yang, En-Yu
Ko, Glenn G.
Chai, Yuji
Hooper, Coleman
Donato, Marco
Whatmough, Paul N.
Rush, Alexander M.
Brooks, David
Wei, Gu-Yeon
2021 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), 2021, 64 : 158 - +

← 1 2 3 4 5 →