情報処理学会第85回全国大会 会期:2023年3月2日~4日 会場:電気通信大学

1Q-08
Research on Video Captioning with a Late Fusion Based Multimodal Transformer Network
○鮑  飛,石川孝明,渡辺 裕(早大)
Video captioning is a task that aims to generate natural language descriptions of a given video, which has drawn increasing attention in recent years. As the video is a combination of different modalities of data, multimodal learning has become relevant in the video captioning area. One of the multimodal fusion strategies, early-fusion, which involves simply concatenating multiple modalities before inputting them into the model, is a general operation used by most methods. However, such a naive operation may lead to potential representations being ignored by the model and usually suffers from a high computational cost, even a quadratic cost with regard to the length of input information in Transformer. Therefore, we propose a method that integrates different modalities in a late-fusion way, which reduces the computational complexity and increases the evaluation metric CIDEr by 1.22.