Cross-modal Representation Learning for Understanding Manufacturing Procedure

被引:0
|
作者
Hashimoto, Atsushi [1 ]
Nishimura, Taichi [2 ]
Ushiku, Yoshitaka [1 ]
Kameko, Hirotaka [2 ]
Mori, Shinsuke [2 ]
机构
[1] OMRON SINIC X Corp, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
来源
CROSS-CULTURAL DESIGN-APPLICATIONS IN LEARNING, ARTS, CULTURAL HERITAGE, CREATIVE INDUSTRIES, AND VIRTUAL REALITY, CCD 2022, PT II | 2022年 / 13312卷
关键词
Procedural text generation; Image captioning; Video captioning; Understanding manufacturing activity;
D O I
10.1007/978-3-031-06047-2_4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Assembling, biochemical experiments, and cooking are representatives that create a new value from multiple materials through multiple processes. If a machine can computationally understand such manufacturing tasks, we will have various options of human-machine collaboration on those tasks, from video scene retrieval to robots that act for on behalf of humans. As one form of such understanding, this paper introduces a series of our studies that aim to associate visual observation of the processes and the procedural texts that instruct such processes. In those studies, captioning is the key task, where input is image sequence or video clips and our methods are still state-of-the-arts. Through the explanation of such techniques, we overview machine learning technologies that deal with the contextual information of manufacturing tasks.
引用
收藏
页码:44 / 57
页数:14
相关论文
共 17 条
  • [1] Content-aware sentiment understanding: cross-modal analysis with encoder-decoder architectures
    Pakdaman, Zahra
    Koochari, Abbas
    Sharifi, Arash
    JOURNAL OF COMPUTATIONAL SOCIAL SCIENCE, 2025, 8 (02):
  • [2] CMCL: Cross-Modal Compressive Learning for Resource-Constrained Intelligent IoT Systems
    Chen, Bin
    Tang, Dong
    Huang, Yujun
    An, Baoyi
    Wang, Yaowei
    Wang, Xuan
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (15): : 25534 - 25542
  • [3] Cross-Modal Graph With Meta Concepts for Video Captioning
    Wang, Hao
    Lin, Guosheng
    Hoi, Steven C. H.
    Miao, Chunyan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5150 - 5162
  • [4] VideoXum: Cross-Modal Visual and Textural Summarization of Videos
    Lin, Jingyang
    Hua, Hang
    Chen, Ming
    Li, Yikang
    Hsiao, Jenhao
    Ho, Chiuman
    Luo, Jiebo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5548 - 5560
  • [5] Cross-Modal Augmented Transformer for Automated Medical Report Generation
    Tang, Yuhao
    Yuan, Ye
    Tao, Fei
    Tang, Minghao
    IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE, 2025, 13 : 33 - 48
  • [6] PAEE: Parameter-Efficient and Data-Effective Image Captioning Model with Knowledge Prompter and Cross-Modal Representation Aligner
    Tian, Yunji
    Liu, Zhiming
    Zou, Quan
    Chen, Geng
    WEB AND BIG DATA, PT II, APWEB-WAIM 2023, 2024, 14332 : 117 - 131
  • [7] Stacked cross-modal feature consolidation attention networks for image captioning
    Pourkeshavarz, Mozhgan
    Nabavi, Shahabedin
    Moghaddam, Mohsen Ebrahimi
    Shamsfard, Mehrnoush
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12209 - 12233
  • [8] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
    Li, Zhengxin
    Zhao, Wenzhe
    Du, Xuanyi
    Zhou, Guangyao
    Zhang, Songlin
    REMOTE SENSING, 2024, 16 (01)
  • [9] Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning
    Yang, Yang
    Wei, Hongchen
    Zhu, Hengshu
    Yu, Dianhai
    Xiong, Hui
    Yang, Jian
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 890 - 902
  • [10] Stacked cross-modal feature consolidation attention networks for image captioning
    Mozhgan Pourkeshavarz
    Shahabedin Nabavi
    Mohsen Ebrahimi Moghaddam
    Mehrnoush Shamsfard
    Multimedia Tools and Applications, 2024, 83 : 12209 - 12233