A Lightweight Enhancement Approach for Real-Time Semantic Segmentation by Distilling Rich Knowledge from Pre-Trained Vision-Language Model

被引：0

作者：

Lin, Chia-Yi ^{[1
]}

Chen, Jun-Cheng ^{[2
]}

Wu, Ja-Ling ^{[1
,3
]}

机构：

[1] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei, Taiwan

[2] Acad Sinica, Res Ctr Informat Technol Innovat, Taipei, Taiwan

[3] Natl Taiwan Univ, Grad Inst Networking & Multimedia, Taipei, Taiwan

来源：

APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING | 2024年 / 13卷 / 05期

关键词：

CLIP; real-time; semantic segmentation; vision-language pre-training;

D O I：

10.1561/116.20240015

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this work, we propose a lightweight approach to enhance realtime semantic segmentation by leveraging the pre-trained vision- language models, specifically utilizing the text encoder of Contrastive Language-Image Pretraining (CLIP) to generate rich semantic embeddings for text labels. Then, our method distills this textual knowledge into the segmentation model, integrating the image and text embeddings to align visual and textual information. Additionally, we implement learnable prompt embeddings for better class-specific semantic comprehension. We propose a two-stage training strategy for efficient learning: the segmentation backbone initially learns from fixed text embeddings and subsequently optimizes prompt embeddings to streamline the learning process. The extensive evaluations and ablation studies validate our approach's ability to effectively improve the semantic segmentation model's performance over the compared methods.

引用

页数：26

共 10 条

[1] CLIPose: Category-Level Object Pose Estimation With Pre-Trained Vision-Language Knowledge
Lin, Xiao
Zhu, Minghao
Dang, Ronghao
Zhou, Guangliang
Shu, Shaolong
Lin, Feng
Liu, Chengju
Chen, Qijun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9125 - 9138
[2] X2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks
Zeng, Yan
Zhang, Xinsong
Li, Hang
Wang, Jiawei
Zhang, Jipeng
Zhou, Wangchunshu
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (05) : 3156 - 3168
[3] CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model
Zhao, Shuai
Quan, Ruijie
Zhu, Linchao
Yang, Yi
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6893 - 6904
[4] Fine-tuning a pre-trained Convolutional Neural Network Model to translate American Sign Language in Real-time
Cayamcela, Manuel Eugenio Morocho
Lim, Wansu
2019 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS (ICNC), 2019, : 100 - 104
[5] Real-time pavement surface crack detection based on lightweight semantic segmentation model
Yu, Huayang
Deng, Yihao
Guo, Feng
TRANSPORTATION GEOTECHNICS, 2024, 48
[6] Improved Real-Time Semantic Segmentation Network Model for Crop Vision Navigation Line Detection
Cao, Maoyong
Tang, Fangfang
Ji, Peng
Ma, Fengying
FRONTIERS IN PLANT SCIENCE, 2022, 13
[7] Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Cheng, Kanzhi
Song, Wenpo
Ma, Zheng
Zhu, Wenhao
Zhu, Zixuan
Zhang, Jianbing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5038 - 5047
[8] Improved real-time semantic segmentation network model for crop vision navigation line detection (vol 13, 898131, 2022)
Cao, Maoyong
Tang, Fangfang
Ji, Peng
Ma, Fengying
FRONTIERS IN PLANT SCIENCE, 2023, 14
[9] Real-time monitoring of weld surface morphology with lightweight semantic segmentation model improved by attention mechanism during laser keyhole welding
Cai, Wang
Shu, LeShi
Geng, ShaoNing
Zhou, Qi
Cao, LongChao
OPTICS AND LASER TECHNOLOGY, 2024, 174
[10] Development and evaluation of a deep learning model for real-time ground vehicle semantic segmentation from UAV-based thermal infrared imagery
Masouleh, Mehdi Khoshboresh
Shah-Hosseini, Reza
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2019, 155 : 172 - 186

← 1 →