You can manage bookmarks using lists, please log in to your user account for this.
Media type:
E-Article
Title:
A Multi-instance Multi-label Dual Learning Approach for Video Captioning
Contributor:
Ji, Wanting;
Wang, Ruili
Published:
Association for Computing Machinery (ACM), 2021
Published in:
ACM Transactions on Multimedia Computing, Communications, and Applications, 17 (2021) 2s, Seite 1-18
Language:
English
DOI:
10.1145/3446792
ISSN:
1551-6857;
1551-6865
Origination:
Footnote:
Description:
Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.