本篇文章目录导航:
【题目】研究深度学习的目标检测与搜索算法??
【第一章】基于深度学习的视频运动目标绪论
【第二章】学习视频运动目标相关工作
【第三章】基于卷积神经网络的视频目标定位检测
【第四章】基于时空双流的视频人物动作检测
【第五章】基于循环神经网络的视频目标自然语言搜索
【第六章】目标检测与搜索算法的结论与参考文献
第六章 总结与展望
6.1 工作总结
视频运动目标的检测与搜索计算机视觉领域的重要任务,其主要难点在于如何通过运用深度学习算法来分析出目标的类别以及目标的位置。近年来传统的深度学习算法在简单静态图片识别方面取得了突出成就,但仍不能视频目标的检测以及视频目标的自然语言搜索要求。
本学位论文以视频图像中的目标的检测与搜索为研究目标,首先提出一种基于目标候选框边界概率的卷积神经网络模型的目标检测算法来完成视频中目标的定位与识别,再通过一种基于时空双流模型的人物动作检测算法来完成视频中动作的识别,最后通过一种基于循环网络的视频目标自然语言搜索模型来搜索出自然语言查询语句对应的目标,具体的研究工作内容如下:
(1)提出了一种基于目标候选框边界概率的卷积神经网络模型来完成视频图像目标定位与识别,该模型首先计算出目标候选边界框的四条边在一定搜索区域上的概率,获得更加接近人工标注边框的候选,再通过迭代的方式与目标识别模型进行融合。实验结果表明该模型能够提高了视频中目标区域定位的准确性与目标类别识别的准确性,并且该模型实验的目标区域召回率与目标识别准确率都获得了一定程度的提升,说明该模型能够更加准确地检测出视频中的目标。
(2)提出了一种基于时空双流特征的卷积神经网络模型来完成视频中人物动作检测,首先将预先训练好的空间流与时间流动作检测网络在深度卷积层进行融合,然后利用融合后的时空双流动作检测模型提取中层时空特征提取,最后利用3D卷积神经网络模型完成视频人物动作检测。此外,在实验过程中将该模型与如今已有的模型进行对比分析与验证,通过观察实验数据,可以发现本文提出视频人物动作识别模型在数据集UCF101与HMDB51上的识别准确率都比较高,表明基于时空双流特征的卷积神经网络模型能够更加准确的识别出视频中人物的动作。
(3)提出了一种基于自然搜索语言的循环神经网络模型来完成视频目标的自然语言查询,首先利用卷积神经网络并行的抽取局部目标区域和全局的特征,之后通过两层的GRU循环神经网络融合这两方面的特征以及自然语言搜索语句的特征来完成自然语言目标搜索。该模型实验的搜索准确率高于现有的文本搜索模型LRCN和CAFFE-7K模型,说明该模型能够更加准确的利用自然语言搜索出视频中对应的目标。
6.2 工作展望
视频中目标的检测与搜索是计算机视觉领域的重要课题之一。本学位论文通过构建视频目标的定位与检测模型、视频人物的动作检测模型和视频目标的自然语言搜索模型,已经能够完成简单的视频运动目标搜索,但是对于视频中目标的检测与识别任务,还需要对以下几点做进一步的深入研究工作:
(1)提高视频目标检测的鲁棒性,本文的目标检测过程中没有考虑一些外在环境因素对目标的影响、视频目标的互相遮挡和没有固定形状的目标的检测问题,天气要素包括大雨、下雪等等,没有固定的形状的目标包括天空、马路等等。如果将上述要素考虑在内,视频目标的检测能力将会大大提升,本人后续工作将重点研究这些条件下的目标定位与识别问题。
(2)扩展视频人物动作检测模型的线索,本文未能充分利用视频中的音频、文本等固有信息,这些都是视频人物动作检测的重要线索,若能够有效的提取并且融合这些重要线索,视频人物行为检测的能力将会获得显着的提升,本文将在后续工作中重点研究音频与文本信息融合的视频人物动作检测问题。
(3)扩展视频目标自然语言搜索能力,本文的视频自然语言目标搜索包含了视频图像中的全局信息目标的位置信息,但是该目标搜索仍然无法做到视频中多个目标同时搜索,因为多个目标的自然搜索语句可能包含了多个目标的间的关系,本文并没有进行目标间关系信息的提取。如果能够完成自然语言的多目标搜索,将会使得目标自然语言搜索更加贴近生活的实际用途,本文将在后续的工作中完成重点研究多目标的自然语言搜索问题。
参考文献
[1] Mishra P K, Saroha G P. A study on video surveillance system for object detection andtracking[C]//Proceedings of the 2016 IEEE Conference on Computing for Sustainable Global Development(INDIACom), 2016: 221-226.
[2] Li H, Huang Y, Zhang Z. An Improved Faster R-CNN for Same Object Retrieval[J]. IEEE Access, 2017, 5:13665-13676.
[3] Pham B, Smith R. A Metadata Augmentation for Semantic and Context-Based Retrieval of Digital CulturalObjects[C]//Proceedings of the 2007 IEEE Conference on Digital Image Computing Techniques andApplications, 2007: 515-522.
[4] Yang Y, Yang L, Wu G, et al. Image relevance prediction using query-context bag-of-object retrieval model[J].IEEE Transactions on Multimedia, 2014, 16(6): 1700-1712.
[5] Yu C, Xue B, Wang Y, et al. Multi-class constrained background suppression approach to hyperspectralimage classification[C]//Proceedings of the 2017 IEEE/GRSS International Conference on Geoscience andRemote Sensing Symposium (IGARSS), 2017: 23-28.
[6] 周志华.机器学习[M].北京:清华大学出版社,2016:97-120.
[7] Guan Y, Jiang B, Xiao Y, et al. A new graph ranking model for image saliency detectionproblem[C]//Proceedings of the 2017 IEEE Conference on Software Engineering Research, Management andApplications (SERA), 2017: 151-156.
[8] Bonardi F, Ainouz S, Boutteau R, et al. A novel global image description approach for long term vehiclelocalization[C]//Proceedings of the 2017 Signal Processing Conference (EUSIPCO), 2017: 808-812.
[9] Bhatia S, Hayat M, Breakspear M, et al. A video-based facial behaviour analysis approach tomelancholia[C]//Proceedings of the 2017 IEEE Conference on Automatic Face & Gesture Recognition, 2017:754-761.
[10] Wickramaarachchi W U, Kariapper R. An approach to get overall emotion from comment text towards acertain image uploaded to social network using Latent Semantic Analysis[C]//Proceedings of the 2017 IEEEConference on Image, Vision and Computing (ICIVC), 2017: 788-792.
[11] Zhang H, Chen Y, Li D, et al. Low-Complexity Sliding Window Block Decoding Using Bit-Flipping forOVFDM Systems[J]. IEEE Access, 2017, 5: 25171-25180.
[12] Felzenszwalb P F, Girshick R B, Mc Allester D, et al. Object detection with discriminatively trained part-based models[J]. IEEE transactions on pattern analysis and machine intelligence, 2010, 32(9)1627-1645.
[13] Girshick R, Donahue J, Darrell T, et al. Region-based convolutional networks for accurate object detectionand segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 38(1): 142-158.
[14] He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visualrecognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(9): 1904-1916.
[15] Girshick R. Fast R-CNN[C]//Proceedings of the 2015IEEE International Conference on Computer Vision,2015:1440-1448.
[16] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposalnetworks[C]//Proceedings of the Advances in neural information processing systems, 2015: 91-99.
[17] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time objectdetection[C]//Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, 2016:779-788.
[18] Yu J, Yang X, Gao F, et al. Deep multimodal distance metric learning using click constraints for imageranking[J]. IEEE transactions on cybernetics, 2017, 47(12): 4014-4024.
[19] Zheng Y, Jiang Z, Zhang H, et al. Histopathological whole slide image analysis using context-based CBIR[J].IEEE Transactions on Medical Imaging, 2018.
[20] Hu R, Xu H, Rohrbach M, et al. Natural language object retrieval[C]//Proceedings of the 2016 IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2016: 4555-4564.
[21] Kato H, Billinghurst M. Marker tracking and hmd calibration for a video-based augmented realityconferencing system[C]//Proceedings of the 1999 IEEE and ACM International Workshop on AugmentedReality, 1999: 85-94.
[22] Ding G, Dai Q, Xu W, et al. Affine-invariant image retrieval based on Wavelet interest points[C]//Proceedings of the 2005 IEEE Conference on Multimedia Signal Processing, 2005: 1-4.
[23] Lin C H, Chen R T, Chan Y K. A smart content-based image retrieval system based on color and texturefeature[J]. Image and Vision Computing, 2009, 27(6): 658-665.
[24] Huang P W, Dai S K. Image retrieval by texture similarity[J]. Pattern recognition, 2003, 36(3): 665-679.
[25] Choy S K, Tong C S. Statistical wavelet subband characterization based on generalized gamma density andits application in texture retrieval[J]. IEEE Transactions on Image Processing, 2010, 19(2): 281-289.
[26] Shanmugavadivu P, Sumathy P, Vadivel A. FOSIR: fuzzy-object-shape for image retrieval applications[J].Neurocomputing, 2016, 171: 719-735.
[27] Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visualrecognition and description[C]//Proceedings of the 2015 IEEE conference on computer vision and patternrecognition, 2015: 2625-2634.
[28] Pal G, Rudrapaul D, Acharjee S, et al. Video shot boundary detection: a review[C]//Proceedings of the 2015Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Societyof India CSI Volume 2, 2015: 119-127.
[29] Lefever S. incent V. Efficient and robust shot change detection[J]. Springer, 2007, 2(1): 23-34.
[30] Yeung M M, Liu B. Efficient matching and clustering of video shots[C]//Proceedings of the 1995International Conference on Image Processing, 1995: 338-341.
[31] 毛星云,冷雪飞等.Open CV3编程入门[M].电子工业出版社,2015:352-355.
[32] Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and itsapplication to scene text recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2017,39(11): 2298-2304.
[33] Zeng Y R, Zeng Y, Choi B, et al. Multifactor-influenced energy consumption forecasting using enhancedback-propagation neural network[J]. Energy, 2017, 127: 381-396.
[34] Ma Y, Jiang Z, Zhang H, et al. Proposing regions from histopathological whole slide image for retrieval usingSelective Search[C]//Proceedings of the 2017 IEEE Conference on Biomedical Imaging (ISBI 2017), 2017:156-159.
[35] Nguyen D, Shijian L, Ouarti N, et al. Text-Edge-Box: An Object Proposal Approach for Scene TextsLocalization[C]//Proceedings of the 2017 IEEE Conference on Applications of Computer Vision (WACV),2017: 1296-1305.
[36] Li W, Wen L, Chang M C, et al. Adaptive rnn tree for large-scale human action recognition[C]//Proceedingsof the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1444-1452.
[37] Cascianelli S, Costante G, Ciarfuglia T A, et al. Full-GRU Natural Language Video Description for ServiceRobotics Applications[J]. IEEE Robotics and Automation Letters, 2018.
[38] Portaz M, Kohl M, Quénot G, et al. Fully Convolutional Network and Region Proposal for InstanceIdentification with egocentric vision[C]//Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2017: 2383-2391.
[39] Wang W, Shi M, Li W. Object Tracking with Shallow Convolution Feature[C]//Proceedings of the 2017 IEEEConference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2017: 97-100.
[40] He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(9): 1904-1916.
[41] Magnusson L V, Olsson R. Improving the Canny Edge Detector Using Automatic Programming: ImprovingNon-Max Suppression[C]//Proceedings of the 2016 Genetic and Evolutionary Computation Conference.,2016: 461-468.
[42] Needell D, Ward R, Srebro N. Stochastic gradient descent, weighted sampling, and the randomizedKaczmarz algorithm[C]//Proceedings of the Advances in Neural Information Processing Systems. 2014:1017-1025.
[43] Fluharty M, Jentzsch I, Spitschan M, et al. Erratum: Eye fixation during multiple object attention is based ona representation of discrete spatial foci[J]. Scientific Reports, 2017, 7: 46777.
[44] Kong T, Yao A, Chen Y, et al. Hypernet: Towards accurate region proposal generation and joint objectdetection[C]//Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, 2016:845-853.
[45] Cao X, Yang L, Guo X. Total variation regularized RPCA for irregularly moving object detection underdynamic background[J]. IEEE transactions on cybernetics, 2016, 46(4): 1014-1027.
[46] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition invideos[C]//Proceedings of the 2014 Neural Information Processing Systems Conference on Advances inNeural Information Processing Systems, 2014:568-576.
[47] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neuralnetworks[C]//Proceedings of the 2014 IEEE conference on Computer Vision and Pattern Recognition, 2014:1725-1732.
[48] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neuralnetworks[C]//Proceedings of the 2012 Neural Information Processing Systems Conference on Advances inNeural Information Processing Systems, 2012: 1097-1105.
[49] Valverde S, Cabezas M, Roura E, et al. Improving automated multiple sclerosis lesion segmentation with acascaded 3D convolutional neural network approach[J]. Neuro Image, 2017, 155: 159-168.
[50] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3d convolutionalnetworks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, 2015:4489-4497.
[51] Sun L, Jia K, Yeung D Y, et al. Human action recognition using factorized spatio-temporal convolutionalnetworks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015:
4597-4605. [52] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neuralnetworks[C]//Proceedings of the 2012 Neural Information Processing Systems Conference on Advances inNeural Information Processing Systems, 2012: 1097-1105.
[53] Rahmani H, Mian A, Shah M. Learning a deep model for human action recognition from novel viewpoints[J].IEEE transactions on pattern analysis and machine intelligence, 2017.
[54] Wu Q, Teney D, Wang P, et al. Visual question answering: A survey of methods and datasets[J]. ComputerVision and Image Understanding, 2017, 163: 21-40.
[55] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]//Proceedings of the 2017 IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017: 2881-2890.
[56] Brahmbhatt S, Christensen H I, Hays J. Stuff Net: Using 'Stuff'to Improve Object Detection[C]//Proceedingsof the 2017 IEEE Conference on Applications of Computer Vision (WACV), 2017: 934-943.