Open-world driving scene segmentation via multi-stage and multi-modality fusion of vision-language embedding

被引:0
|
作者
Niu, Yingjie [1 ]
Ding, Ming [2 ]
Zhang, Yuxiao [1 ]
Ge, Maoning [1 ]
Yang, Hanting [1 ]
Takeda, Kazuya [1 ]
机构
[1] Nagoya Univ, Grad Sch Informat, Nagoya, Aichi, Japan
[2] Nagoya Univ, Inst Innovat Future Soc, Nagoya, Aichi, Japan
关键词
Open-world segmentation; driving scene; pixel-text alignment; multi-stage multi-modality fusion;
D O I
10.1109/IV55152.2023.10186652
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this study, a pixel-text level multi-stage multi-modality fusion segmentation method is proposed to make the open-world driving scene segmentation more efficient. It can be used for different semantic perceptual needs of autonomous driving scenarios for real-world driving situations. The method can finely segment unseen labels without additional corresponding semantic segmentation labels, only using the existing semantic segmentation data. The proposed method consists of 4 modules. A visual representation embedding module and a segmentation command embedding module are used to extract the driving scene and the segmentation category command. A multi-stage multi-modality fusion module is used to fuse the driving scene visual information and segmentation command text information for different sizes at the pixel-text level. Finally, a cascade segmentation head is used to ground the segmentation command text to the driving scene for encouraging the model to generate corresponding high-quality semantic segmentation results. In the experiment, we first verify the effectiveness of the method for zero-shot segmentation using a popular driving scene segmentation dataset. We also confirm the effectiveness of synonyms unseen label and hierarchy unseen label for the open-world semantic segmentation.
引用
收藏
页数:6
相关论文
共 7 条
  • [1] Open-World Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
    Liu, Quande
    Wen, Youpeng
    Han, Jianhua
    Xu, Chunjing
    Xu, Hang
    Liang, Xiaodan
    COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 275 - 292
  • [2] Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model
    Wang, Zeyu
    Zhao, Libo
    Zhang, Jizheng
    Song, Rui
    Song, Haiyu
    Meng, Jiana
    Wang, Shidong
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [3] Navigating an Automated Driving Vehicle via the Early Fusion of Multi-Modality
    Haris, Malik
    Glowacz, Adam
    SENSORS, 2022, 22 (04)
  • [4] MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning
    Li, Zejun
    Fan, Zhihao
    Tou, Huaixiao
    Chen, Jingjing
    Wei, Zhongyu
    Huang, Xuanjing
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4395 - 4405
  • [5] Automated Segmentation of Corticospinal Tract in Diffusion Tensor Images via Multi-modality Multi-atlas Fusion
    Tang, Xiaoying
    Mori, Susumu
    Miller, Michael I.
    MEDICAL IMAGING 2014: BIOMEDICAL APPLICATIONS IN MOLECULAR, STRUCTURAL, AND FUNCTIONAL IMAGING, 2014, 9038
  • [6] Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
    Ha, Huy
    Song, Shuran
    CONFERENCE ON ROBOT LEARNING, VOL 205, 2022, 205 : 643 - 653
  • [7] Multi-Stage Hough Space Calculation for Lane Markings Detection via IMU and Vision Fusion
    Sun, Yi
    Li, Jian
    Sun, Zhenping
    SENSORS, 2019, 19 (10):