基于多模态记忆知识的密集视频描述方法

方豪杰; 李永刚; 曹宗瑞; 叶利华

doi:10.11959/j.issn.1000-0801.2025154

您当前的位置：

首页 >

文章列表页 >

基于多模态记忆知识的密集视频描述方法

研究与开发 | 更新时间：2025-11-16

- 基于多模态记忆知识的密集视频描述方法
- Approach of dense video captioning based on multimodal memory knowledge
- 电信科学 2025年41卷第9期页码：133-151
- 作者机构：
  
  1.浙江理工大学计算机科学与技术学院（人工智能学院），浙江杭州 310018
  2.嘉兴大学人工智能学院，浙江嘉兴 314001
  3.嘉兴大学全省多模态感知与智能系统重点实验室，浙江嘉兴 314001
- 作者简介：
  
  [ "方豪杰（2000- ），男，浙江理工大学计算机科学与技术学院（人工智能学院）硕士生，主要研究方向为计算机视觉和密集视频描述等。" ]
  [ "李永刚（1979- ），男，博士，嘉兴大学人工智能学院、嘉兴大学全省多模态感知与智能系统重点实验室副教授、硕士生导师，主要研究方向为计算机视觉、视频图像处理、机器学习等。" ]
  [ "曹宗瑞（2000- ），男，浙江理工大学计算机科学与技术学院（人工智能学院）硕士生，主要研究方向为计算机视觉和密集视频描述等。" ]
  [ "叶利华（1978- ），男，博士，嘉兴大学人工智能学院、嘉兴大学全省多模态感知与智能系统重点实验室讲师、硕士生导师，主要研究方向为计算机视觉、视频图像处理等。" ]
- 基金信息：
  
  国家重点研发计划项目(2023YFC3305900);浙江省自然科学基金资助项目(LTGG24F020001);嘉兴市科技计划项目(2023AY11047;2023AY11030)
- DOI：10.11959/j.issn.1000-0801.2025154
  中图分类号： TP391
- 收稿：2024-12-30，
  
  修回：2025-04-15，
  
  纸质出版：2025-09-20
- 稿件说明：
移动端阅览
方豪杰,李永刚,曹宗瑞等.基于多模态记忆知识的密集视频描述方法[J].电信科学,2025,41(09):133-151.

FANG Haojie,LI Yonggang,CAO Zongrui,et al.Approach of dense video captioning based on multimodal memory knowledge[J].Telecommunications Science,2025,41(09):133-151.
方豪杰,李永刚,曹宗瑞等.基于多模态记忆知识的密集视频描述方法[J].电信科学,2025,41(09):133-151. DOI： 10.11959/j.issn.1000-0801.2025154.

FANG Haojie,LI Yonggang,CAO Zongrui,et al.Approach of dense video captioning based on multimodal memory knowledge[J].Telecommunications Science,2025,41(09):133-151. DOI： 10.11959/j.issn.1000-0801.2025154.

摘要

密集视频描述旨在从未修剪的视频中定位事件，并为每个有意义的事件生成相应的描述。现有方法主要利用源视频输入来生成描述，无法捕捉到视频中的隐含知识，即视频中隐含的视觉、音频、文本等多模态记忆知识，其中多模态记忆知识可以理解为视频内对象、动作和属性对应的有意义词集合。为解决该问题，提出了基于多模态记忆知识的密集视频描述方法，不仅利用了视频本身的多模态信息，还拓展了与视频相关的多模态记忆知识，极大地提高了密集视频描述生成的准确性。首先，该方法构建了多模态记忆知识库，设计了基于模态共享编码器的事件定位模块，实现源视频多模态特征之间的深层次融合并生成高质量事件提案。然后，模型从多模态记忆知识库中检索与候选事件提案密切相关的视觉、音频和文本记忆知识作为描述生成的先验信息。最后，该方法通过记忆增强解码器，有效地整合了多模态记忆知识和视频多模态信息，生成详细的密集视频描述。在ActivityNet Captions 和YouCook2 数据集上进行了对比实验和消融实验，结果验证了该方法的有效性。

Abstract

Dense video captioning aims to localize events in an untrimmed video and generate a corresponding captions for each meaningful event. Existing methods mainly utilize the source video input to generate captions

and these methods are unable to capture the implicit knowledge in the video

i.e.

the multimodal memory knowledge such as visual

audio

text

etc.

implicit in the video

where the multimodal memory knowledge can be understood as a collection of meaningful words corresponding to the objects

actions

and attributes within the video. In order to solve the problem

an approach of dense video captioning based on the multimodal memory knowledge was proposed. Not only the multimodal information of the video itself was utilized

but also the multimodal memory knowledge related to the video was expanded

by which the accuracy of dense video captioning generation was greatly improved. Firstly

a multimodal memory knowledge base was constructed

a modal sharing encoder-based event localization module was designed to achieve deep fusion between multimodal features of the source video and generate high-quality event proposals. Then

visual

audio and textual memory knowledge closely related to the candidate event proposals was retrieved from the multimodal external memory knowledge base as a priori information for caption generation. Finally

with the designed memory-enhanced decoder

the multimodal memory knowledge and video multimodal information were effectively combined to generate detailed and dense video captioning. The results of extensive comparison experiments with current mainstream algorithms on ActivityNet Captions and YouCook2 datasets as well as ablation experiments demonstrate the effectiveness of the method.

关键词

Keywords

references

KRISHNA R , KENJI H T , REN F , et al . Dense-captioning events in videos [C ] // Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2017 : 706 - 715 .

KIM M , KIM H B , MOON J , et al . Do you remember? Dense video captioning with cross-modal memory retrieval [C ] // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2024 : 13894 - 13904 .

WU H , LIU H B , QIAO Y , et al . DIBS: enhancing dense video captioning with unlabeled videos via pseudo boundary enrichment and online refinement [C ] // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2024 : 18699 - 18708 .

MUN J , YANG L , REN Z , et al . Streamlined dense video captioning [C ] // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . Piscataway : IEEE Press 2019 : 6588 - 6597 .

CHOI W , CHEN J S , YOON J . Step by step: a gradual approach for dense video captioning [J ] . IEEE Access , 2023 , 11 : 51949 - 51959 .

LI P , ZHANG P , WANG T , et al . Time–frequency recurrent transformer with diversity constraint for dense video captioning [J ] . Information Processing & Management , 2023 , 60 ( 2 ): 103204 .

LASHIN V , RAHTU E . Multi-modal dense video captioning [C ] // Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE Press , 2020 : 4117 - 4126 .

RAHMAN T , XU B C , SIGAL L . Watch, listen and tell: multi-modal weakly supervised dense event captioning [C ] // Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2019 : 8907 - 8916 .

JING S Q , ZHANG H N , ZENG P P , et al . Memory-based augmentation network for video captioning [J ] . IEEE Transactions on Multimedia , 2023 , 26 : 2367 - 2379 .

CAO S , WANG B , ZHANG W , et al . Visual consensus modeling for video-text retrieval [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Los Altos : AAAI , 2022 , 36 ( 1 ): 167 - 175 .

马苗 , 王伯龙 , 吴琦 , 等 . 视觉场景描述及其效果评价 [J ] . 软件学报 , 2019 , 30 ( 4 ): 867 - 883 .

MA M , WANG B L , WU Q , et al . Visual scene description and its performance evaluation [J ] . Journal of Software , 2019 , 30 ( 4 ): 867 - 883 .

汤鹏杰 , 王瀚漓 . 从视频到语言: 视频标题生成与描述研究综述 [J ] . 自动化学报 , 2022 , 48 ( 2 ): 375 - 397 .

TANG P J , WANG H L . From video to language: survey of video captioning and description [J ] . Acta Automatica Sinica , 2022 , 48 ( 2 ): 375 - 397 .

CHANG Z , ZHAO D X , CHEN H L , et al . Event-centric multi-modal fusion method for dense video captioning [J ] . Neural Networks , 2022 , 146 : 120 - 129 .

NIU J J , XIE Y L , ZHANG Y , et al . Tri-modal dense video captioning based on fine-grained aligned text and anchor-free event proposals generator [J ] . International Journal of Pattern Recognition and Artificial Intelligence , 2022 , 36 ( 12 ).

XIE Y L , NIU J J , ZHANG Y , et al . Global-shared text representation based multi-stage fusion transformer network for multi-modal dense video captioning [J ] . IEEE Transactions on Multimedia , 2023 , 26 : 3164 - 3179 .

WANG T , ZHANG R M , LU Z C , et al . End-to-end dense video captioning with parallel decoding [C ] // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2021 : 6827 - 6837 .

SELVI T K , S N , THOMPSON M , et al . Quasi-parallel dense video captioning: a novel approach to achieving ground truth event captions using hierarchical attention [C ] // Proceedings of the 2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT) . Piscataway : IEEE Press , 2024 : 1278 - 1283 .

CHEN F Y , XU C , JIA Q , et al . Egocentric vehicle dense video captioning [C ] // Proceedings of the 32nd ACM International Conference on Multimedia . New York : ACM Press , 2024 : 137 - 146 .

SHOMAN M , WANG D D , ABOAH A , et al . Enhancing traffic safety with parallel dense video captioning for end-to-end event analysis [C ] // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE Press , 2024 : 7125 - 7133 .

ZHOU L W , ZHOU Y B , CORSO J J , et al . End-to-end dense video captioning with masked transformer [C ] // Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 8739 - 8748 .

IASHIN V , RAHTU E . A better use of audio-visual cues: Dense video captioning with bi-modal transformer [C ] // Proceedings of the 2020 British Machine Vision Conference . Durham : BMVA , 2020 .

YU W M , XU H , YUAN Z Q , et al . Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Los Altos : AAAI , 2021 , 35 ( 12 ): 10790 - 10797 .

YANG D K , KUANG H P , HUANG S , et al . Learning modality-specific and-agnostic representations for asynchronous multimodal language sequences [C ] // Proceedings of the 30th ACM International Conference on Multimedia . New York : ACM Press , 2022 : 1708 - 1717 .

WANG D H , ZHAO T , YU W H , et al . Deep multimodal complementarity learning [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2023 , 34 ( 12 ): 10213 - 10224 .

YOU Q Z , JIN H L , WANG Z W , et al . Image captioning with semantic attention [C ] // Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2016 : 4651 - 4659 .

CHEN Z F , ZHOU Q H , SHEN Y K , et al . Visual chain-of-thought prompting for knowledge-based visual reasoning [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 2 ): 1254 - 1262 .

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] // Proceedings of the International Conference on Machine Learning . New York : PMLR , 2021 : 8748 - 8763 .

HERSHEY S , CHAUDHURI S , ELLIS D P W , et al . CNN architectures for large-scale audio classification [C ] // Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2017 : 131 - 135 .

HEILBRON F C , ESCORCIA V , GHANEM B , et al . ActivityNet: a large-scale video benchmark for human activity understanding [C ] // Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2015 : 961 - 970 .

ZHOU L W , XU C L , CORSO J . Towards automatic learning of procedures from web instructional videos [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2018 , 32 ( 1 ).

BANERJEE S , LAVIE A . METEOR: an automatic metric for MT evaluation with improved correlation with human judgments [C ] // Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization . Cambridge : MIT Press , 2005 : 65 - 72 .

PAPINENI K , ROUKOS S , WARD T , et al . BLEU: a method for automatic evaluation of machine translation [J ] . Proceedings of the Annual Meeting of the Association for Computational Linguistics , 2002 , 7 : 311 - 318 .

YANG A , NAGRANI A , SEO P H , et al . Vid2seq: large-scale pretraining of a visual language model for dense video captioning [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2023 : 10714 - 10726 .

ZHANG Q , SONG Y Q , JIN Q . Unifying event detection and captioning as sequence generation via pre-training [C ] // Proceedings of the Computer Vision-ECCV 2022 . Cham : Springer , 2022 : 363 - 379 .

ZHOU X Y , ARNAB A , BUCH S , et al . Streaming dense video captioning [C ] // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2024 : 18243 - 18252 .

HAN S X , LIU J , ZHANG J , et al . Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph [J ] . Complex & Intelligent Systems , 2023 , 9 ( 5 ): 4995 - 5012 .

WEI Y W , YUAN S Z , CHEN M , et al . Enhancing multimodal alignment with momentum augmentation for dense video captioning [C ] // Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2023 : 1 - 5 .

ZHU W R , PANG B , THAPLIYAL A V , et al . End-to-end dense video captioning as sequence generation [EB ] . 2022 .

WANG J F , YANG Z Y , HU X W , et al . GIT: a generative image-to-text transformer for vision and language [EB ] . 2022 .

WEI Y , YUAN S , CHEN M , et al . MPP-net: multi-perspective perception network for dense video captioning [J ] . Neurocomputing , 2023 , 552 : 126523 .

浏览量

458

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据