Neural Architecture Search With a Lightweight Transformer for Text-to-Image Synthesis

被引：50

作者：

Li, Wei ^{[1
]}

We, Shiping ^{[2
]}

Shi, Kaibo ^{[3
]}

Yang, Yin ^{[4
]}

Huang, Tingwen ^{[5
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China

[2] Univ Technol Sydney, Fac Engn & Informat Technol, Australian AI Inst, Sydney, NSW 2007, Australia

[3] Chengdu Univ, Sch Informat Sci & Engn, Chengdu 611731, Sichuan, Peoples R China

[4] Hamad Bin Khalifa Univ, Coll Sci & Engn, Doha 5855, Qatar

[5] Texas A&M Univ Qatar, Sci Program, Doha 23874, Qatar

来源：

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING | 2022年 / 9卷 / 03期

关键词：

Transformers; Task analysis; Computer architecture; Generative adversarial networks; Image synthesis; Search problems; Semantics; Generative adversarial network; neural architecture search; text-to-image synthesis; transformer;

D O I：

10.1109/TNSE.2022.3147787

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Despite the cross-modal text-to-imagesynthesis task has achieved great success, most of the latest works in this field are based on the network architectures proposed by predecessors, such as StackGAN, AttnGAN, etc. Since the quality for text-to-image synthesis is more and more demanding, these old and tandem architectures with simple convolution operations are no longer suitable. Therefore, a novel text-to-image synthesis network combining with the latest technologies is in urgent need of exploration. To tackle with this challenge, we creatively propose a unique architecture for text-to-image synthesis, dubbed T2IGAN, which is automatically searched by neural architecture search (NAS). In addition, considering the amazing capabilities of the popular transformer in natural language processing and computer vision, a lightweight transformer is applied in our search space to efficiently integrate the text features and image features. Ultimately, the effectiveness of our searched T2IGAN is remarkable by experimentally evaluating it on the typical text-to-image synthesis datasets. Specifically, we achieve an excellent result of IS 5.12 and FID 10.48 on CUB-200 Birds, IS 4.89 and FID 13.55 on Oxford-102 Flowers, IS 31.93 and FID 26.45 on COCO. By contrast with the state-of-the-art works, ours gets better performance on CUB-200 Birds and Oxford-102 Flowers.

引用

页码：1567 / 1576

页数：10

共 64 条

[1]

Bender Gabriel, 2018, PR MACH LEARN RES, V80, P550

[2]

Brock A., 2017, Smash: one-shot model architecture search through hypernetworks

[3]

Brown TB, 2020, ADV NEUR IN, V33

[4]

Cai H, 2018, INT C LEARN REPR

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6]

Cha M, 2019, AAAI CONF ARTIF INTE, P3272

[7]

Chen M, 2020, PR MACH LEARN RES, V119

[8] Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation [J].

Chen, Xin ;

Xie, Lingxi ;

Wu, Jun ;

Tian, Qi .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :1294-1303

[9]

Cheng J., 2020, CVPR, P10911

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 6 7 →