Cross-modal Representation Learning for Understanding Manufacturing Procedure

被引：0

作者：

Hashimoto, Atsushi ^{[1
]}

Nishimura, Taichi ^{[2
]}

Ushiku, Yoshitaka ^{[1
]}

Kameko, Hirotaka ^{[2
]}

Mori, Shinsuke ^{[2
]}

机构：

[1] OMRON SINIC X Corp, Tokyo, Japan

[2] Kyoto Univ, Kyoto, Japan

来源：

CROSS-CULTURAL DESIGN-APPLICATIONS IN LEARNING, ARTS, CULTURAL HERITAGE, CREATIVE INDUSTRIES, AND VIRTUAL REALITY, CCD 2022, PT II | 2022年 / 13312卷

关键词：

Procedural text generation; Image captioning; Video captioning; Understanding manufacturing activity;

D O I：

10.1007/978-3-031-06047-2_4

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Assembling, biochemical experiments, and cooking are representatives that create a new value from multiple materials through multiple processes. If a machine can computationally understand such manufacturing tasks, we will have various options of human-machine collaboration on those tasks, from video scene retrieval to robots that act for on behalf of humans. As one form of such understanding, this paper introduces a series of our studies that aim to associate visual observation of the processes and the procedural texts that instruct such processes. In those studies, captioning is the key task, where input is image sequence or video clips and our methods are still state-of-the-arts. Through the explanation of such techniques, we overview machine learning technologies that deal with the contextual information of manufacturing tasks.

引用

页码：44 / 57

页数：14

共 17 条

[1] Content-aware sentiment understanding: cross-modal analysis with encoder-decoder architectures
Pakdaman, Zahra
Koochari, Abbas
Sharifi, Arash
JOURNAL OF COMPUTATIONAL SOCIAL SCIENCE, 2025, 8 (02):
[2] CMCL: Cross-Modal Compressive Learning for Resource-Constrained Intelligent IoT Systems
Chen, Bin
Tang, Dong
Huang, Yujun
An, Baoyi
Wang, Yaowei
Wang, Xuan
IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (15): : 25534 - 25542
[3] Cross-Modal Graph With Meta Concepts for Video Captioning
Wang, Hao
Lin, Guosheng
Hoi, Steven C. H.
Miao, Chunyan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5150 - 5162
[4] VideoXum: Cross-Modal Visual and Textural Summarization of Videos
Lin, Jingyang
Hua, Hang
Chen, Ming
Li, Yikang
Hsiao, Jenhao
Ho, Chiuman
Luo, Jiebo
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5548 - 5560
[5] Cross-Modal Augmented Transformer for Automated Medical Report Generation
Tang, Yuhao
Yuan, Ye
Tao, Fei
Tang, Minghao
IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE, 2025, 13 : 33 - 48
[6] PAEE: Parameter-Efficient and Data-Effective Image Captioning Model with Knowledge Prompter and Cross-Modal Representation Aligner
Tian, Yunji
Liu, Zhiming
Zou, Quan
Chen, Geng
WEB AND BIG DATA, PT II, APWEB-WAIM 2023, 2024, 14332 : 117 - 131
[7] Stacked cross-modal feature consolidation attention networks for image captioning
Pourkeshavarz, Mozhgan
Nabavi, Shahabedin
Moghaddam, Mohsen Ebrahimi
Shamsfard, Mehrnoush
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12209 - 12233
[8] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
Li, Zhengxin
Zhao, Wenzhe
Du, Xuanyi
Zhou, Guangyao
Zhang, Songlin
REMOTE SENSING, 2024, 16 (01)
[9] Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning
Yang, Yang
Wei, Hongchen
Zhu, Hengshu
Yu, Dianhai
Xiong, Hui
Yang, Jian
IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 890 - 902
[10] Stacked cross-modal feature consolidation attention networks for image captioning
Mozhgan Pourkeshavarz
Shahabedin Nabavi
Mohsen Ebrahimi Moghaddam
Mehrnoush Shamsfard
Multimedia Tools and Applications, 2024, 83 : 12209 - 12233

← 1 2 →