Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

被引：9

作者：

Suo, Yucheng ^{[1
]}

Zheng, Zhedong ^{[2
,3
]}

Wang, Xiaohan ^{[1
]}

Zhang, Bang ^{[4
]}

Yang, Yi ^{[1
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China

[2] Univ Macau, Fac Sci & Technol, Taipa Univ Blvd, Macau 999078, Peoples R China

[3] Univ Macau, Inst Collaborat Innovat, Taipa Univ Blvd, Macau 999078, Peoples R China

[4] Alibaba Grp, DAMO Acad, 969 Wenyi West Rd, Hangzhou 311121, Zhejiang, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Sign language; motion transfer; video generation; jointly training; HUMAN POSE ESTIMATION;

D O I：

10.1145/3648368

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sign language provides a way for differently-abled individuals to express their feelings and emotions. However, learning sign language can be challenging and time consuming. An alternative approach is to animate user photos using sign language videos of specific words, which can be achieved using existing image animation methods. However, the finger motions in the generated videos are often not ideal. To address this issue, we propose the Structure-aware Temporal Consistency Network (STCNet), which jointly optimizes the prior structure of humans with temporal consistency to produce sign language videos. We use a fine-grained skeleton detector to acquire knowledge of body structure and introduce both short- and long-term cycle loss to ensure the continuity of the generated video. The two losses and keypoint detector network are optimized in an end-to-end manner. Quantitative and qualitative evaluations on three widely used datasets, namely LSA64, Phoenix-2014T, and WLASL-2000, demonstrate the effectiveness of the proposed method. It is our hope that this work can contribute to future studies on sign language production.

引用

页数：18

共 97 条

[1]

Albanie S, 2021, Arxiv, DOI arXiv:2111.03635

[2] BSL-1K: Scaling Up Co-articulated Sign Language Recognition Using Mouthing Cues [J].

Albanie, Samuel ;

Varol, Gul ;

Momeni, Liliane ;

Afouras, Triantafyllos ;

Chung, Joon Son ;

Fox, Neil ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :35-53

[3] PoseTrack: A Benchmark for Human Pose Estimation and Tracking [J].

Andriluka, Mykhaylo ;

Iqbal, Umar ;

Insafutdinov, Eldar ;

Pishchulin, Leonid ;

Milan, Anton ;

Gall, Juergen ;

Schiele, Bernt .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5167-5176

[4] UniPose: Unified Human Pose Estimation in Single Images and Videos [J].

Artacho, Bruno ;

Savakis, Andreas .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :7033-7042

[5]

Ben S, 2021, IEEE INT CONF AUTOMA

[6] PRINCIPAL WARPS - THIN-PLATE SPLINES AND THE DECOMPOSITION OF DEFORMATIONS [J].

BOOKSTEIN, FL .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1989, 11 (06) :567-585

[7] Neural Sign Language Translation [J].

Camgoz, Necati Cihan ;

Hadfield, Simon ;

Koller, Oscar ;

Ney, Hermann ;

Bowden, Richard .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7784-7793

[8] OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].

Cao, Zhe ;

Hidalgo, Gines ;

Simon, Tomas ;

Wei, Shih-En ;

Sheikh, Yaser .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186

[9] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [J].

Cao, Zhe ;

Simon, Tomas ;

Wei, Shih-En ;

Sheikh, Yaser .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1302-1310

[10] Everybody Dance Now [J].

Chan, Caroline ;

Ginosar, Shiry ;

Zhou, Tinghui ;

Efros, Alexei A. .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :5932-5941

← 1 2 3 4 5 6 7 8 9 10 →