Improving Handwritten Mathematical Expression Recognition via Integrating Convolutional Neural Network With Transformer and Diffusion-Based Data Augmentation

被引：0

作者：

Zhang, Yibo ^{[1
]}

Li, Gaoxu ^{[2
]}

机构：

[1] Beijing Jiaotong Univ, Sch Phys Sci & Engn, Beijing 100044, Peoples R China

[2] Xian Jiaotong Liverpool Univ, Sch Adv Technol, Suzhou 215123, Jiangsu, Peoples R China

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

中国国家自然科学基金;

关键词：

CNN; data augmentation; denoising diffusion probabilistic model; DDPM; handwritten mathematical expression recognition; HMER; Transformer;

D O I：

10.1109/ACCESS.2024.3399919

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Handwritten mathematical expression recognition (HMER) poses a formidable challenge due to the intricate two-dimensional structures and diverse handwriting styles. This paper introduces a novel approach to improve HMER accuracy by employing an integrated, high-capacity architecture that combines Transformer and Convolutional Neural Network (CNN) models, along with a denoising diffusion probabilistic model (DDPM)-based data augmentation technique. We explore three combination strategies for an attention-based encoder-decoder (AED) HMER model: 1) The "Tandem" strategy, which harnesses CNN features within a Transformer encoder to capture global interdependencies; 2) The "Parallel" strategy, which integrates Transformer encoder outputs with CNN outputs to achieve comprehensive feature fusion; 3) The "Mixing" strategy, which introduces multi-head self-attention (MHSA) at the final stage of the CNN. We evaluate these methods using the CROHME benchmark dataset and conduct a detailed comparative analysis. All three approaches significantly enhance model performance. Notably, the "Tandem" approach achieves expression recognition rates (ExpRate) of 54.85% and 58.56% on the CROHME 2016 and 2019 test sets, respectively, while the "Parallel" method attains 55.63% and 57.39% on the same test sets. Furthermore, we introduce an innovative data augmentation approach that utilizes DDPM to generate synthetic training samples. The DDPM, conditioned on LaTeX-rendered images, bridges the gap between printed and handwritten expressions, enabling the creation of realistic, stylistically diverse handwriting samples. This augmentation boosts the ExpRates of all strategies on both CROHME 2016 and 2019 test sets, yielding improvements of 1.6-4.6% relative to the unaugmented dataset.

引用

页码：67945 / 67956

页数：12

共 53 条

[1] Anderson Robert H, 1967, S INT SYST EXP APPL, P436, DOI DOI 10.1145/2402536.2402585
[2] Pattern generation strategies for improving recognition of Handwritten Mathematical Expressions
Anh Duc Le
Indurkhya, Bipin
Nakagawa, Masaki
[J]. PATTERN RECOGNITION LETTERS, 2019, 128 : 255 - 262
[3] Bian XB, 2022, AAAI CONF ARTIF INTE, P113
[4] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[5] Coquenet D, 2022, Arxiv, DOI [arXiv:2203.12273, 10.48550/ARXIV.2203.12273]
[6] Dai Z, 2021, ADV NEUR IN, V34
[7] Deng YT, 2017, 34 INT C MACHINE LEA, V70
[8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9] Dhariwal P, 2021, ADV NEUR IN, V34
[10] Ding Haisong, 2023, Document Analysis and Recognition - ICDAR 2023: 17th International Conference, Proceedings. Lecture Notes in Computer Science (14190), P20, DOI 10.1007/978-3-031-41685-9_2

← 1 2 3 4 5 6 →