基于深度学习的视频行为识别方法综述

赵朵朵; 章坚武; 郭春生; 周迪; 穆罕默德·阿卜杜·沙拉夫·哈基米

doi:10.11959/j.issn.1000-0801.2019286

您当前的位置：

首页 >

文章列表页 >

基于深度学习的视频行为识别方法综述

综述 | 更新时间：2024-06-05

- 基于深度学习的视频行为识别方法综述
- A survey of video behavior recognition based on deep learning
- 电信科学 2019年35卷第12期页码：99-111
- 作者机构：
  
  1. 杭州电子科技大学，浙江杭州 310018
  2. 浙江宇视科技有限公司，浙江杭州 310018
- 作者简介：
  
  [ "赵朵朵（1995- ），女，杭州电子科技大学通信工程学院硕士生，主要研究方向为图像处理与人工智能等" ]
  [ "章坚武（1961- ），男，博士，杭州电子科技大学通信工程学院教授、博士生导师，中国电子学会、中国通信学会高级会员，浙江省通信学会常务理事，主要研究方向为移动通信、多媒体信号处理与人工智能、通信网络与信息安全。" ]
  [ "郭春生（1971- ），男，博士，杭州电子科技大学通信工程学院副教授、硕士生导师，主要研究方向为视频分析与模式识别。" ]
  [ "周迪（1975- ），男，浙江宇视科技有限公司高级工程师、宇视研究院院长，主要研究方向为视频安全、人工智能等。" ]
  [ "穆罕默德·阿卜杜·沙拉夫·哈基米（1991- ），男，杭州电子科技大学博士生，主要研究方向为图像处理与人工智能。" ]
- 基金信息：
  
  国家自然科学基金资助项目;The National Natural Science Foundation of China(61772162);国家自然科学基金资助项目;The National Natural Science Foundation of China(U1866209);国家重点研发计划基金资助项目;The National Key Research Development Program of China(2018YFC0831503);浙江省自然科学基金资助项目;The Natural Science Foundation of Zhejiang Province of China(LYl6F020016);浙江省重点研发计划基金资助项目;The Key Research Development Program of Zhejiang Province of China(2018C01059);浙江省重点研发计划基金资助项目;The Key Research Development Program of Zhejiang Province of China(2019C01062)
- DOI：10.11959/j.issn.1000-0801.2019286
  中图分类号： TP393
- 网络出版日期：2019-12，
  
  纸质出版日期：2019-12-20
- 稿件说明：
移动端阅览
赵朵朵, 章坚武, 郭春生, 等. 基于深度学习的视频行为识别方法综述[J]. 电信科学, 2019,35(12):99-111.

Duoduo ZHAO, Jianwu ZHANG, Chunsheng GUO, et al. A survey of video behavior recognition based on deep learning[J]. Telecommunications science, 2019, 35(12): 99-111.
赵朵朵, 章坚武, 郭春生, 等. 基于深度学习的视频行为识别方法综述[J]. 电信科学, 2019,35(12):99-111. DOI： 10.11959/j.issn.1000-0801.2019286.

Duoduo ZHAO, Jianwu ZHANG, Chunsheng GUO, et al. A survey of video behavior recognition based on deep learning[J]. Telecommunications science, 2019, 35(12): 99-111. DOI： 10.11959/j.issn.1000-0801.2019286.

摘要

近年来，自动学习特征的深度学习方法在视频行为识别领域中不断被挖掘探索。在总结了常用的行为识别数据集的基础上，对传统的行为识别方法以及深度学习的相关基础原理进行了概述，着重对基于不同输入内容与不同深度网络的行为识别方法进行了较为全面、系统性的总结、对比与分析。最后，对深度学习在行为识别领域的发展做了总结并展望了未来的发展趋势。

Abstract

In recent years

the deep learning method of automatic learning features has been continuously explored in the field of video behavior recognition.The traditional behavior recognition methods and the underlying principles of deep learning were outlined.Then a number of behavior recognition methods based on different input content and different deep networks was compared and analyzed.Finally

the development of deep learning in the field of behavior recognition was concluded and its future development trend was prospected.

关键词

Keywords

references

SCHULDT C , LAPTEV I , CAPUTO B . Recognizing human actions:a local SVM approach [C ] // 17th International Conference on Pattern Recognition(ICPR),Aug 23-26,2004,Cambridge,UK . Piscataway:IEEE Press , 2004 : 32 - 36 .

BLANK M , GORELICK L , SHECHTMAN E , et al . Actions as space-time shapes [C ] // 10th IEEE International Conference on Computer Vision(ICCV),Oct 17-21,2005,Beijing,China . Piscataway:IEEE Press , 2005 : 1395 - 1402 .

GORELICK L , BLANK M , SHECHTMAN E , et al . Actions as space-time shapes [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2007 , 29 ( 12 ): 2247 - 2253 .

MARSZALEK M , LAPTEV I , SCHMID C . Actions in context [C ] // 22nd IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 20-25,2009,Florida,USA . Piscataway:IEEE Press , 2009 : 2929 - 2936 .

NIEBLES J C , CHEN C W , LI F F . Modeling temporal structure of decomposable motion segments for activity classification [C ] // 11th European Conference on Computer Vision (ECCV),Sep 5-11,2010,Heraklion,Crete,Greece . Berlin:Springer Verlag , 2010 : 392 - 405 .

KUEHNE H , JHUANG H , GARROTE E , et al . HMDB:a large video database for human motion recognition [C ] // 16th IEEE International Conference on Computer Vision(ICCV),Nov 6-13,2011,Barcelona,Spain . Piscataway:IEEE Press , 2011 : 2556 - 2563 .

LIU J G , LUO J B , SHAH M . Recognizing realistic actions from videos [C ] // 22nd IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 20-25,2009,Florida,USA . Piscataway:IEEE Press , 2009 : 1996 - 2003 .

REDDY K K , SHAH M . Recognizing 50 human action categories of Web videos [J ] . Machine Vision and Applications , 2013 , 24 ( 5 ): 971 - 981 .

SOOMRO K , ZAMIR A R , SHAH M . UCF101:a dataset of 101 human actions classes from videos in the wild [J ] . Computer Science , 2012 : 1 - 7 .

RODRIGUEZ M D , AHMED J , SHAH M . Action match a spatio-temporal maximum average correlation height filter for action recognition [C ] // 21st IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 24-26,2008,Anchorage,Alaska,USA . Piscataway:IEEE Press , 2008 : 1 - 8 .

KARPATHY A , TODERICI G , SHETTY S , et al . Large-scale video classification with convolutional neural networks [C ] // 27th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 23-28,2014,Columbus,USA . Piscataway:IEEE Press , 2014 : 1725 - 1732 .

KAY W , CARREIRA J , SIMONYAN K , et al . The kinetics human action video dataset [J ] . arXiv:1705.06950 , 2017 :

MONFORT M , ANDONIAN A , ZHOU B , et al . Moments in time dataset:one million videos for event understanding [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2019 ( 3 ): 1 - 9 .

XU W R , MIAO Z J , TIAN Y . A novel mid-level distinctive feature learning for action recognition via diffusion map [J ] . Neurocomputing , 2016 ( 218 ): 185 - 196 .

TONG M , WANG H Y , TIAN W J , et al . Action recognition new framework with robust 3D-TCCHOGAC and 3D-HOOFGAC [J ] . Multimedia Tools and Applications , 2017 , 76 ( 2 ): 3011 - 3030 .

VISHWAKARMA D K , KAPOOR R , DHIMAN A . Unified framework for human activity recognition:an approach using spatial edge distribution and transform [J ] . AEU-International Journal of Electronics and Communications , 2016 , 70 ( 3 ): 341 - 353 .

WANG Y , TRAN V , HOAI M . Evolution-preserving dense trajectory descriptors [J ] . arXiv:1702.04037 , 2017 :

LI Y W , LI W X , MAHADEVAN V , et al . VLAD3:encoding dynamics of deep features for action recognition [C ] // 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA . Piscataway:IEEE Press , 2016 : 1951 - 1960 .

ZHU J , ZOU W , ZHU Z . End-to-end video-level representation learning for action recognition [C ] // 24th International Conference on Pattern Recognition(ICPR),Aug 20-24,2018,Beijing,China . Piscataway:IEEE Press , 2018 : 645 - 650 .

SUN Q , LIU H , MA L , et al . A novel hierarchical bag-of-words model for compact action representation [J ] . Neurocomputing , 2016 ( 174 ): 722 - 732 .

IJJINA E P , MOHAN C K . Human action recognition using genetic algorithms and convolutional neural networks [J ] . Pattern Recognition , 2016 ( 59 ): 199 - 212 .

MAHASSENI B , TODOROVIC S . Regularizing long short term memory with 3D human-skeleton sequences for action recognition [C ] // 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA . Piscataway:IEEE Press , 2016 : 3054 - 3062 .

ALHARBI N , GOTOH Y . A unified spatio-temporal human body region tracking approach to action recognition [J ] . Neurocomputing , 2015 ( 161 ): 56 - 64 .

ZHANG X , BAO Y , ZHANG F , et al . Qiniu submission to Activity Net challenge 2018 [J ] .,2018. 2018 Computer Vision and Pattern Recognition Challenge,arXiv:1806.04391 , 2018 .

LI Y , XU Z , WU Q , et al . Submission to moments in time challenge 2018 [J ] . 2018 Computer Vision and Pattern Recognition Challenge,a rXiv:1808.03766 , 2018 .

WANG H,KLĀSER A , SCHMID C , et al . Dense trajectories and motion boundary descriptors for action recognition [J ] . International Journal of Computer Vision , 2013 , 103 ( 1 ): 60 - 79 .

WANG H , SCHMID C . Action recognition with improved trajectories [C ] // 18th IEEE International Conference on Computer Vision(ICCV),Dec 1-8,2013,Sydeny,Australia . Piscataway:IEEE Press , 2013 : 3551 - 3558 .

LECUN Y , BOTTOU L , BENGIO Y , et al . Gradient-based learning applied to document recognition [J ] . Proceedings of the IEEE , 1998 , 86 ( 11 ): 2278 - 2324 .

KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks [C ] // 25th Annual Conference on Neural Information Processing Systems,Dec 3-6,2012,Lake Tahoe,USA . Massachusetts:MIT Press , 2012 : 1106 - 1114 .

SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition [C ] // 3rd International Conference on Learning Representations(ICLR),May 7-9,2015,San Diego,USA . New York:AMC Press , 2015 : 1 - 14 .

HE K , ZHANG X , REN S , et al . Deep residual learning for image recognition [C ] // 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 26-Jul 1,2016,Las Vegas,USA . Piscataway:IEEE Press , 2016 : 770 - 778 .

SZEGEDY C , LIU W , JIA Y , et al . Going deeper with convolutions [C ] // 28th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 7-12,2015,Boston,USA . Piscataway:IEEE Press , 2015 : 7 - 12 .

ARIF S , WANG J , HASSAN U T , et al . 3D-CNN-based fused feature maps with LSTM applied to action recognition [J ] . Future Internet , 2019 , 11 ( 2 ):42.

JI S , XU W , YANG M , et al . 3D convolutional neural networks for human action recognition [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2013 , 35 ( 1 ): 221 - 231 .

NG Y H , HAUSKNECHT M , VIJAYANARASIMHAN S , et al . Beyond short snippets:deep networks for video classification [C ] // 28th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 7-12,2015,Boston,USA . Piscataway:IEEE Press , 2015 : 4694 - 4702 .

LIU Z , HU H F . Spatiotemporal relation networks for video action recognition [J ] . IEEE Access , 2019 ( 7 ): 14969 - 14976 .

BACCOUCHE M , MAMALET F , WOLF C , et al . Sequential deep learning for human action recognition [C ] // 2nd International Conference on Human Behavior Unterstanding(HBU),Nov 16-16,2011,Amsterdam,Netherlands . Berlin:Springer Verlag , 2011 : 29 - 39 .

DONAHUE J , HENDRICKS L A , ROHRBACH M , et al . Long-term recurrent convolutional networks for visual recognition and description [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2014 , 39 ( 4 ): 677 - 691 .

ILG E , MAYER N , SAIKIA T , et al . FlowNet 2.0:evolution of optical flow estimation with deep networks [C ] // 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA . Piscataway:IEEE Press , 2017 : 1467 - 1655 .

FISCHER P , DOSOVITSKIY A , ILG E , et al . FlowNet:learning optical flow with convolutional networks [C ] // 20th IEEE International Conference on Computer Vision(ICCV),Dec 11-18,2015,Santiago,Chile . Piscataway:IEEE Press , 2015 : 2758 - 2766 .

YE H , WU Z , ZHAO R W , et al . Evaluating Two-Stream CNN for Video Classification [C ] // 5th ACM on International Conference on Multimedia Retrieval(ICMR),Jun 23-26,2015,Shanghai,China . New York:ACM , 2015 : 435 - 442 .

WU Z , WANG X , JIANG Y G , et al . Modeling spatial-temporal clues in a hybrid deep learning framework for video classification [C ] // 23rd ACM Multimedia Conference,Oct 26-30,2015,Brisbane,Australia . New York:ACM Press , 2015 : 461 - 470 .

WU Z , JIANG Y G , WANG X , et al . Multi-stream multi-class fusion of deep networks for video classification [C ] // 24th ACM Multimedia Conference,Oct 15-19,2016,Amsterdam,UK . New York:ACM Press , 2016 : 791 - 800 .

LONG X , GAN C , MELO G D , et al . Attention clusters:Purely attention based local feature integration for video classification [C ] // 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR),Jun 18-22,2018,Salt Lake,USA . Piscataway:IEEE Press , 2018 : 7834 - 7843 .

JIANG Y G , WU Z , TANG J , et al . Modeling multimodal clues in a hybrid deep learning framework for video classification [J ] . IEEE Transactions on Multimedia , 2018 , 20 ( 11 ): 3137 - 3147 .

SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos [C ] // 28th Annual Conference on Neural Information Processing Systems(NIPS),Dec 8-13,2014,Montreal,Canda . Massachusetts:MIT Press , 2014 : 568 - 576 .

FEICHTENHOFER , PINZ A , ZISSERMAN A . Convolutional two-stream network fusion for video action recognition [C ] // 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA . Piscataway:IEEE Press , 2016 : 1933 - 1941 .

WANG L , XIONG Y , WANG Z , et al . Temporal segment networks:towards good practices for deep action recognition [C ] // 14th European Conference on Computer Vision(ECCV),Oct 8-16,2016,Amsterdam,Netherlands . Berlin:Springer Verlag , 2016 : 20 - 36 .

LAN Z , ZHU Y , HAUPTMANN A G . Deep local video feature for action recognition [C ] // 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA . Piscataway:IEEE Press , 2017 : 1219 - 1225 .

ZHOU B , ANDONIAN A , TORRALBA A . Temporal relational reasoning in videos [C ] // 15th European Conference on Computer Vision(ECCV),Sep 8-14,2018,Munich,Germany . Berlin:Springer Verlag , 2018 : 831 - 846 .

DIBA A , SHARMA V , VAN GOOL L . Deep temporal linear encoding networks [C ] // 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),July 21-26,2017,Honolulu,USA . Piscataway:IEEE Press , 2017 : 1541 - 1550 .

TRAN D , BOURDEV L , FERGUS R , et al . Learning spatiotemporal features with 3D convolutional networks [C ] // 20th IEEE International Conference on Computer Vision(ICCV),Dec 11-18,2015,Santiago,Chile . Piscataway:IEEE Press , 2015 : 4489 - 4497 .

SUN L , JIA K , YEUNG D Y , et al . Human action recognition using factorized spatio-temporal convolutional networks [C ] // 20th IEEE International Conference on Computer Vision(ICCV),Dec 11-18,2015,Santiago,Chile . Piscataway:IEEE Press , 2015 : 4597 - 4605 .

QIU Z , YAO T , MEI T . Learning Spatio-temporal representation with pseudo-3D residual networks [C ] // 22nd IEEE International Conference on Computer Vision(ICCV),Oct 22-29,2017,Venice,Italy . Piscataway:IEEE Press , 2017 : 5534 - 5542 .

DIBA A , FAYYAZ M , SHARMA V , et al . Temporal 3D ConvNets:new architecture and transfer learning for video classification [J ] .,2017. arXiv:1711.08200 , 2017 .

CARREIRA J , ZISSERMAN A . Quo Vadis,action recognition? A new model and the kinetics dataset [C ] // 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA . Piscataway:IEEE Press , 2017 : 6299 - 6308 .

TRAN D , WANG H , TORRESANI L , et al . A closer look at spatiotemporal convolutions for action recognition [C ] // 31st IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 18-22,2018,Salt Lake,USA . Piscataway:IEEE Press , 2018 : 6450 - 6459 .

FAN L , HUANG W , GAN C , et al . End-to-end learning of motion representation for video understanding [C ] // 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR),Jun 18-22,2018,Salt Lake,USA . Piscataway:IEEE Press , 2018 : 6016 - 6025 .

ZHU W , HU J , SUN G , et al . A key volume mining deep framework for action recognition [C ] // 29th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 27-30,2016,Las Vegas,USA . Piscataway:IEEE Press , 2016 : 1991 - 1999 .

KAR A , RAI N , SIKKA K , et al . AdaScan:adaptive scan pooling in deep convolutional neural networks for human action recognition in Videos [C ] // 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA . Piscataway:IEEE Press , 2017 : 5699 - 5708 .

ZHU Y , LAN Z , NEWSAM S , et al . Hidden two-stream convolutional networks for action recognition [C ] // 14th Asian Conference on Computer Vision(ACCV),Dec 2-6,2018,Perth,Australia . Berlin:Springer Verlag , 2018 : 363 - 378 .

WANG L , XIONG Y , WANG Z , et al . Towards good practices for very deep two-stream ConvNets [J ] . Computer Science,arXiv:1507.02159 , 2015 .

FEICHTENHOFER C , PINZ A , WILDES R P . Spatiotemporal residual networks for video action recognition [C ] // 30th Conference and Workshop on Neural Information Processing Systems (NIPS),Dec 5-10,2016,Barcelona,Spain.[S.l.:s.n] . 2016 : 3476 - 3484 .

WANG Y , LONG M , WANG J , et al . Spatiotemporal pyramid network for video action recognition [C ] // 30th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jul 21-26,2017,Honolulu,USA . Piscataway:IEEE Press , 2017 : 2097 - 2106 .

FEICHTENHOFER C , PINZ A , WILDES R P . Spatiotemporal multiplier networks for video action recognition [C ] // 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR),Jul 21-26,2017,Honolulu,USA . Piscataway:IEEE Press , 2017 : 7445 - 7454 .

OUYANG X , XU S J , ZHANG C Y , et al . A 3D-CNN and LSTM based multi-task learning architecture for action recognition [J ] . IEEE Access , 2019 ( 7 ): 40757 - 40770 .

WANG L , QIAO Y , TANG X . Action recognition with trajectory-pooled deep-convolutional descriptors [C ] // 28th IEEE Conference on Computer Vision and Pattern Recognition(CVPR),Jun 7-12,2015,Boston,USA . Piscataway:IEEE Press , 2015 : 4305 - 4314 .

VAROL G , LAPTEV I , SCHMID C . Long-term temporal convolutions for action recognition [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018 : 1510 - 1517 .

LEV G , SADEH G , KLEIN B , et al . RNN fisher vectors for action recognition and image annotation [C ] // 14th European Conference on Computer Vision(ECCV),Oct 8-16,2016,Amsterdam,Netherlands . Berlin:Springer Verlag , 2016 : 833 - 850 .

BILEN H , FERNANDO B , GAVVES E , et al . Action recognition with dynamic image networks [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018 , 40 ( 12 ): 2799 - 2813 .

浏览量

735

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据