1.中国科学院计算技术研究所, 北京 100190
2.龙眼国科(北京)智能信息技术有限公司, 北京 100010
3.中科工业人工智能研究院, 江苏 南京 211100
4.中国科学院大学, 北京101408
5.交控科技股份有限公司, 北京 100071
刘建然,博士研究生,中国计算机学会(CCF)会员,主要研究领域为机器视觉编码、图像处理,特征编码等。E-mail:liujianran21b@ict.ac.cn。;
纪雯(通信作者),博士,研究员,博士生导师,中国计算机学会(CCF)高级会员,主要研究领域为视觉处理器、多媒体系统、工业人工智能,包括高性能视觉处理器、多媒体端边云计算系统、视觉编码与传输、工业智能芯片与系统等。E-mail:jiwen@ict.ac.cn。;
付哲,硕士,副高级工程师,现任职于交控科技股份有限公司,担任交控研究院数字孪生研究室副主任职务,主要研究方向为基于多传感器融合的轨道交通智能化平台及列车自主运行系统。E-mail:fuzhe912@163.com
收稿:2014-12-01,
修回:2025-08-08,
录用:2025-08-25,
网络首发:2026-01-06,
移动端阅览
刘建然,纪雯,付哲.面向机器视觉的自监督视频时域重采样方法[J].电信科学,
LIU Jianran,JI Wen,FU Zhe.Unsupervised video temporal resampling for machine vision[J].Telecommunications Science,
刘建然,纪雯,付哲.面向机器视觉的自监督视频时域重采样方法[J].电信科学, DOI:10.11959/j.issn.1000−0801.2026048.
LIU Jianran,JI Wen,FU Zhe.Unsupervised video temporal resampling for machine vision[J].Telecommunications Science, DOI:10.11959/j.issn.1000−0801.2026048.
针对视频时域重采样中帧间内容非线性变化导致的语义冗余问题,提出一种基于自监督特征嵌入和聚类的视频时域自适应重采样方法。该方法通过预训练的ResNet-18提取视频帧特征并微调,利用自监督度量学习构建帧间相似性度量,采用余弦相似度衡量相邻帧相似性,并设计损失函数使同一视频序列内的帧在嵌入空间中流形分布光滑,同时抑制不同视频帧间相似度。之后,对嵌入后的帧特征进行基于流形等分点的时序数据聚类,并确保视频首尾完整。重采样后的视频序列经H.266/VVC编码,解码端结合插帧网络重建原始帧。实验表明,该方法在BDmAP和Pareto mAP指标上平均提高约2.3%和19.4%,且计算开销满足实时处理需求,有效平衡压缩效率、视觉任务精度与零样本兼容性,为机器视觉场景下的视频传输提供新思路。
To address the semantic redundancy caused by nonlinear inter-frame content changes during video temporal resampling
an adaptive sampling method based on self-supervised feature embedding and clustering was proposed. In this method
features from video frames were extracted using a pre-trained ResNet-18
which was subsequently fine-tuned. Self-supervised metric learning was employed to construct an inter-frame similarity measure
and cosine similarity was used to gauge the resemblance between adjacent frames. A loss function was designed to ensure a smooth manifold distribution for frames from the same video sequence in the embedding space
while similarity between frames from different videos was simultaneously suppressed. Subsequently
the embedded frame features were subjected to temporal data clustering based on manifold bisection points
and the integrity of the video's first and last frames was ensured. The sampled video sequence was then encoded using H.266/VVC
and the original frames were reconstructed at the decoding end by a frame interpolation network. Through experiments
it was demonstrated that average improvements of approximately 2.3% and 19.4% were achieved on the BDmAP and Pareto mAP metrics
respectively. Furthermore
the computational overhead was found to meet the demands of real-time processing. The proposed approach effectively balances compression efficiency
visual task accuracy
and zero-shot compatibility. It offers a novel solution for video transmission in machine vision scenarios.
GUAN Y C , LIAO H C , LI Z N , et al . World models for autonomous driving: an initial survey [J ] . IEEE Transactions on Intelligent Vehicles , 2024 , 99 : 1 - 17 .
JI W , XU J C , QIAO H X , et al . Visual IoT: enabling Internet of things visualization in smart cities [J ] . IEEE Network , 2019 , 33 ( 2 ): 102 - 110 .
DUI H Y , ZHANG S R , LIU M , et al . IoT-enabled real-time traffic monitoring and control management for intelligent transportation systems [J ] . IEEE Internet of Things Journal , 2024 , 11 ( 9 ): 15842 - 15854 .
KIM A , WOO S T , PARK M , et al . Deep learning-guided video compression for machine vision tasks [J ] . EURASIP Journal on Image and Video Processing , 2024 , 2024 ( 1 ): 32 .
熊皓萱 , 徐媛媛 , 朱琨 . 面向机器视觉的VVC帧内编码算法 [J ] . 信号处理 , 2025 , 41 ( 2 ): 350 - 358 .
XIONG H X , XU Y Y , ZHU K . VVC intra-coding scheme for machines [J ] . Journal of Signal Processing , 2025 , 41 ( 2 ): 350 - 358 .
GE X T , LUO J X , ZHANG X J , et al . Task-aware encoder control for deep video compression [C ] // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2024 : 26036 - 26045 .
SCHIAPPA M C , RAWAT Y S , SHAH M . Self-supervised learning for videos: a survey [J ] . ACM Computing Surveys , 2023 , 55 ( 13 s): 1 - 37 .
HUA H , TANG Y L , XU C L , et al . V2Xum-LLM: cross-modal video summarization with temporal prompt instruction tuning [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto : AAAI Press , 2025 , 39 ( 4 ): 3599 - 3607 .
ZHAO Y , YE M , JI L P , et al . Temporal adaptive learned surveillance video compression [J ] . IEEE Transactions on Broadcasting , 2025 , 71 ( 1 ): 142 - 153 .
TIAN J R , LIN Z X , DAI Y , et al . Keyframes selection from multiscene videos for stress detection [J ] . Information Processing & Management , 2025 , 62 ( 5 ): 104215 .
ZENG J H , LIANG G , MA Y X , et al . Pornographic video detection based on semantic and image enhancement [J ] . The Computer Journal , 2024 , 67 ( 10 ): 3009 - 3019 .
LEE J , HWANG K I . YOLO with adaptive frame control for real-time object detection applications [J ] . Multimedia Tools and Applications , 2022 , 81 ( 25 ): 36375 - 36396 .
DUAN Y Q , LU J W , FENG J J , et al . Deep localized metric learning [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2018 , 28 ( 10 ): 2644 - 2656 .
ROTH K , MILBICH T , SINHA S , et al . Revisiting training strategies and generalization performance in deep metric learning [EB ] . 2020 .
ZHANG D Y , LI Y M , ZHANG Z F . Deep metric learning with spherical embedding [EB ] . 2020 .
FU Z R , LI Y , MAO Z D , et al . Deep metric learning with self-supervised ranking [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto : AAAI Press , 2021 , 35 ( 2 ): 1370 - 1378 .
JIA X H , HAN K , ZHU Y K , et al . Joint representation learning and novel category discovery on single- and multi-modal data [C ] // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2022 : 590 - 599 .
WU Z R , XIONG Y J , YU S X , et al . Unsupervised feature learning via non-parametric instance discrimination [C ] // Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 3733 - 3742 .
TAGIEW R , KLASEK P , TILLY R , et al . OSDaR23: open sensor data for rail 2023 [C ] // Proceedings of the 2023 8th International Conference on Robotics and Automation Engineering (ICRAE) . Piscataway : IEEE Press , 2024 : 270 - 276 .
KAY W , CARREIRA J , SIMONYAN K , et al . The kinetics human action video dataset [EB ] . 2017 .
WIECKOWSKI A , BRANDENBURG J , HINZ T , et al . Vvenc: an open and optimized vvc encoder implementation [C ] // Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) . Piscataway : IEEE Press , 2021 : 1 - 2 .
WIECKOWSKI A , HEGE G , BARTNIK C , et al . Towards a live software decoder implementation for the upcoming versatile video coding (VVC) codec [C ] // Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP) . Piscataway : IEEE Press , 2020 : 3124 - 3128 .
ZHANG G Z , ZHU Y H , WANG H N , et al . Extracting motion and appearance via inter-frame attention for efficient video frame interpolation [C ] // Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2023 : 5682 - 5692 .
XUE T F , CHEN B A , WU J J , et al . Video enhancement with task-oriented flow [J ] . International Journal of Computer Vision , 2019 , 127 ( 8 ): 1106 - 1125 .
MERZ G , LIU Y C , BURKE C J , et al . Detection, instance segmentation, and classification for astronomical surveys with deep learning (deepdisc): detectron2 implementation and demonstration with Hyper Suprime-Cam data [J ] . Monthly Notices of the Royal Astronomical Society , 2023 , 526 ( 1 ): 1122 - 1137 .
GIRSHICK R . Fast R-CNN [C ] // Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2016 : 1440 - 1448 .
LIN T Y , MAIRE M , BELONGIE S , et al . Microsoft COCO: common objects in context [C ] // Proceedings of the 2014 European Conference on Computer Vision (ECCV) . Cham : Springer , 2014 : 740 - 755 .
PERERA A , ADZIC V , KALVA H , et al . Comparative analysis of VCM and AhG8 for machine vision applications [EB ] . 2025 .
PONT-TUSET J , PERAZZI F , CAELLES S , et al . The 2017 davis challenge on video object segmentation [EB ] . 2017 .
ZHOU C . Yolact++ better real-time instance segmentation [M ] . Davis : University of California, Davis , 2020 .
0
浏览量
16
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621