中文版 | English
题名

基于注意力机制和对比学习的人体行为识别方法

其他题名
HUMAN ACTION RECOGNITION METHOD BASED ON ATTENTION AND CONTRASTIVE LEARNING
姓名
姓名拼音
WU Taozhong
学号
12132150
学位类型
硕士
学位专业
080902 电路与系统
学科门类/专业学位类别
08 工学
导师
林志赟
导师单位
系统设计与智能制造学院
论文答辩日期
2024-05-09
论文提交日期
2024-06-22
学位授予单位
南方科技大学
学位授予地点
深圳
摘要
本文针对骨架行为识别问题进行了深入研究。骨架数据去除了冗余的背景信息,仅保留了人体关键节点的空间位置信息,已成为行为识别任务中的主流数据来源。然而,由于人体动作所固有的多样性以及骨架数据内在的稀疏特性,构建既能精准捕捉细微差异又能实现高效识别的神经网络模型成为当前研究的主要挑战。
尽管图卷积神经网络在行为识别任务上已取得显著成果,但该领域仍面临一系列亟待解决的问题:首先,随着图卷积网络层数增加,特征过度平滑现象突出,导致模型难以精准辨别相似行为类别;其次,行为识别需考量动作的时间演化过程,尤其是对于长时序动作,如何有效的从整个行为序列中提炼出足以表征行为全貌的关键特征是该问题的一个难点。最后,由于个体差异性以及拍摄视角变化,相同的动作在不同个体上会呈现不同的骨架空间分布,视角的变化会导致行为拍摄发生扭曲和变形,如何有效消除个体间差异性并校正由视角变化引发的数据失真问题,也是研究的一个难点。
本文针对图卷积神经网络在行为识别中存在的过度平滑以及长时序数据建模难的问题,提出了一种新的图卷积网络模型架构。该网络模型的主要创新点在于提出了三个新的模块:注意力特征采样、时间注意力机制以及多尺度图卷积模块。首先,在空间维度上,运用统计学方法量化各节点的运动能量,依据运动能量大小生成网络最后的特征采样权重。其次,在特征提取阶段设计了多尺度图卷积,从多层级和尺寸上对空间信息进行融合。最后在时间维度上,设计了时间注意力模块动态调节时间序列中各时间点的权重分配,对行为识别中起决定性作用的关键时间点赋予较高的权重。在NTU RGB+D 60和NTU RGB+D 120两个数据集上进行实验,其识别率可以达到93.4\% 与90.1\%。
针对行为识别中个体差异性以及拍摄视角变化导致数据失真的问题,在新模型架构的基础上,提出了对比增强深度神经网络方法。该方法旨在引导模型聚焦于各类行为间的关键判别特征,从而提升神经网络在行为识别任务上的精确度和鲁棒性。在网络的中间层嵌入了对比学习方法,构建了一种多层次、适应性的监督机制,引导网络提取在各种数据增强操作下保持不变的特征。在NTU RGB+D 60和NTU RGB+D 120两个数据集上进行实验,其识别率可以达到93.7\% 与90.3\%。 
关键词
语种
中文
培养类别
独立培养
入学年份
2021
学位授予年份
2024-06
参考文献列表

[1] YAO B, FEI-FEI L. Grouplet: A structured image representation for recognizing human and object interactions[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA: IEEE, 2010: 9-16.
[2] KLASER A, MARSZAŁEK M, SCHMID C. A spatio-temporal descriptor based on 3dgradients[C]//Proceedings of the 2008 British Machine Vision Conference. Leeds ,UK, 2008: 275:1-10.
[3] ZHANG Z, HU Y, CHAN S, et al. Motion context: a new representation for human action recognition[C]//Proceedings of the 2008 European Conference on Computer Vision. Marseille, France, 2008: 817-829.
[4] WANG H, KLäSER A, SCHMID C, et al. Action recognition by dense trajectories[C/OL]// CVPR 2011. Colorado Springs, CO, USA, 2011: 3169-3176. DOI: 10.1109/CVPR.2011.5995407.
[5] CHÉRON G, LAPTEV I, SCHMID C. P-cnn: Pose-based cnn features for action recognition [C]//Proceedings of the 2015 IEEE international conference on computer vision. Washington, DC, USA, 2015: 3218-3226.
[6] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 2014 Advances in Neural Information Processing Systems. Cambridge, Montreal, Canada, 2014: 568-576.
[7] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//2016 European conference on computer vision. Amsterdam, The Netherlands: Springer, 2016: 20-36.
[8] DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, 2015: 2625-2634.
[9] HE J Y, WU X, CHENG Z Q, et al. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition[J]. Neurocomputing, 2021, 444: 319-331.
[10] GE H, YAN Z, YU W, et al. An attention mechanism based convolutional LSTM network for video action recognition[J]. Multimedia Tools and Applications, 2019, 78(14): 20533-20556.
[11] SUDHAKARAN S, ESCALERA S, LANZ O. Lsta: Long short-term attention for egocentric action recognition[C]//2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA, 2019: 9954-9963.
[12] WU Z, XIONG C, MA C Y, et al. Adaframe: Adaptive frame selection for fast video recognition [C]//2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog￾nition. Long Beach, CA, USA, 2019: 1278-1287.
[13] LIU Z, LI Z, WANG R, et al. Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition[J]. Neural Computing and Applications, 2020, 32: 14593-14602.
[14] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C]//2015 Proceedings of the IEEE international conference on computer vision. Santiago, Chile, 2015: 4489-4497.
[15] HUSSEIN N, GAVVES E, SMEULDERS A W. Timeception for complex action recognition [C]//2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog￾nition. Long Beach, CA, USA, 2019: 254-263.
[16] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//2017 proceedings of the IEEE Conference on Computer Vision and Pattern Recog￾nition. Honolulu, HI, USA, 2017: 6299-6308.
[17] FAYYAZ M, BAHRAMI E, DIBA A, et al. 3D CNNs with adaptive temporal feature reso￾lutions[C]//2021 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, 2021: 4731-4740.
[18] XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification[C]//2018 Proceedings of the European conference on computer vision (ECCV). Munich, Germany, 2018: 305-321.
[19] LI K, LI X, WANG Y, et al. Ct-net: Channel tensorization network for video classification[C]// 2021 International Conference on Learning Representations. Vienna, Austria, 2021.
[20] HAN K, WANG Y, CHEN H, et al. A survey on vision transformer[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(1): 87-110.
[21] KHAN S, NASEER M, HAYAT M, et al. Transformers in vision: A survey[J]. ACM computing surveys (CSUR), 2022, 54(10s): 1-41.
[22] BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[C]//2021 ICML: Vol. 2. Vienna, Austria, 2021: 4.
[23] TRUONG T D, BUI Q H, DUONG C N, et al. Direcformer: A directed attention in transformer approach to robust action recognition[C]//2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, 2022: 20030-20040.
[24] RANASINGHE K, NASEER M, KHAN S, et al. Self-supervised video transformer[C]//2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, 2022: 2874-2884.
[25] BULAT A, PEREZ RUA J M, SUDHAKARAN S, et al. Space-time mixing attention for video transformer[J]. 2021 Advances in neural information processing systems, 2021, 34: 19594-19607.
[26] CHEN R, PANDA R, FAN Q. RegionViT: Regional-to-Local Attention for Vision Transformers [C]//2022 International Conference on Learning Representations. 2022.
[27] YANG J, DONG X, LIU L, et al. Recurring the transformer for video action recognition[C]//2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, 2022: 14063-14073.
[28] DU Y, WANG W, WANG L. Hierarchical recurrent neural network for skeleton based action recognition[C]//2015 Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, MA, USA, 2015: 1110-1118.
[29] LI S, LI W, COOK C, et al. Independently recurrent neural network (indrnn): Building a longer and deeper rnn[C]//Proceedings of the 2018 IEEE conference on computer vision and pattern recognition. Salt Lake City, UT, USA, 2018: 5457-5466.
[30] ZHENG W, LI L, ZHANG Z, et al. Relational network for skeleton-based action recognition [C]//Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. Shanghai, China, 2019: 826-831.
[31] WANG H, WANG L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, 2017: 499-508.
[32] SONG S, LAN C, XING J, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[C]//2017 Proceedings of the AAAI conference on artificial intelligence: Vol. 31. San Francisco, California, 2017.
[33] LI B, DAI Y, CHENG X, et al. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN[C]//Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops. Hong Kong, China, 2017: 601-604.
[34] LIU M, LIU H, CHEN C. Enhanced skeleton visualization for view invariant human action recognition[J]. Pattern Recognition, 2017, 68: 346-362.
[35] KE Q, AN S, BENNAMOUN M, et al. Skeletonnet: Mining deep part features for 3-d action recognition[J]. IEEE signal processing letters, 2017, 24(6): 731-735.
[36] LI B, DAI Y, CHENG X, et al. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN[C]//2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). Hong Kong, China: IEEE, 2017: 601-604.
[37] KORBAN M, LI X. Ddgcn: A dynamic directed graph convolutional network for action recognition[C]//ECCV. Glasgow, UK: Springer, 2020: 761-776.
[38] YAN S, XIONG Y, LIN D. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 2018 AAAI conference on artificial intelligence: Vol. 32. Palo Alto, California, 2018: 1801.07455.
[39] PENG W, HONG X, CHEN H, et al. Learning graph convolutional network for skeleton-based human action recognition by neural searching[C]//2020 Proceedings of the AAAI conference on artificial intelligence: Vol. 34. Hilton New York Midtown, New York, USA, 2020: 2669-2676.
[40] ZHANG X, XU C, TAO D. Context aware graph convolution for skeleton-based action recogni￾tion[C]//2020 Proceedings of the IEEE/CVF conference on computer vision and pattern recog￾nition. Seattle, WA, USA, 2020: 14333-14342.
[41] LI M, CHEN S, CHEN X, et al. Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction[J]. 2021 IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(6): 3316-3333.
[42] CHEN Y, ZHANG Z, YUAN C, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada, 2021: 13359-13368.
[43] WANG S, ZHANG Y, WEI F, et al. Skeleton-based Action Recognition via Temporal-Channel Aggregation[A/OL]. 2022. 2205.15936. https://doi.org/10.48550/arXiv.2205.15936.
[44] HU L, LIU S, FENG W. Spatial Temporal Graph Attention Network for Skeleton-Based Action Recognition[A/OL]. 2022. 2208.08599. https://doi.org/10.48550/arXiv.2208.08599.
[45] ZENG A, SUN X, YANG L, et al. Learning skeletal graph neural networks for hard 3d pose estimation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal, BC, Canada, 2021: 11436-11445.
[46] ZHANG Y, WU B, LI W, et al. STST: Spatial-temporal specialized transformer for skeleton-based action recognition[C]//2021 Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event China, 2021: 3229-3237.
[47] PLIZZARI C, CANNICI M, MATTEUCCI M. Spatial temporal transformer network for skeleton-based action recognition[C]//ICPR. :Virtual Event: Springer, 2021: 694-701.
[48] QIU H, HOU B, REN B, et al. Spatio-temporal tuples transformer for skeleton-based action recognition[A]. 2022.
[49] LIU Y, ZHANG H, XU D, et al. Graph transformer network with temporal kernel attention for skeleton-based action recognition[J]. Knowledge-Based Systems, 2022, 240: 108146.
[50] WANG Q, SHI S, HE J, et al. Iip-transformer: Intra-inter-part transformer for skeleton-based action recognition[C]//2023 IEEE International Conference on Big Data (BigData). Sorrento, Italy: IEEE, 2023: 936-945.
[51] YANG X, ZHANG C, TIAN Y. Recognizing actions using depth motion maps-based histograms of oriented gradients[C]//2012 Proceedings of the 20th ACM international conference on Multimedia. Nara,Japan, 2012: 1057-1060.
[52] WANG P, LI W, GAO Z, et al. Action recognition from depth maps using deep convolutional neural networks[J]. IEEE Transactions on Human-Machine Systems, 2015, 46(4): 498-509.
[53] WANG P, WANG S, GAO Z, et al. Structured images for RGB-D action recognition[C]//2017 Proceedings of the IEEE international conference on computer vision workshops. Venice, Italy, 2017: 1005-1014.
[54] FERNANDO B, GAVVES E, ORAMAS J, et al. Rank pooling for action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(4): 773-787.
[55] WANG P, LI W, GAO Z, et al. Depth pooling based large-scale 3-d action recognition with convolutional neural networks[J]. IEEE Transactions on Multimedia, 2018, 20(5): 1051-1061.
[56] XIAO Y, CHEN J, WANG Y, et al. Action recognition for depth video using multi-view dynamic images[J]. Information Sciences, 2019, 480: 287-304.
[57] LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition[J]. Neural computation, 1989, 1(4): 541-551.
[58] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[59] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: abs/2010.11929[A/OL]. 2020. https://api.semanticscholar.org/CorpusID:225039882.
[60] SHAHROUDY A, LIU J, NG T T, et al. Ntu rgb+ d: A large scale dataset for 3d human activity analysis[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, NV, USA, 2016: 1010-1019.
[61] LIU J, SHAHROUDY A, PEREZ M, et al. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding[J]. IEEE transactions on pattern analysis and machine intelligence, 2019, 42(10): 2684-2701.
[62] WANG J, NIE X, XIA Y, et al. Cross-view action modeling, learning and recognition[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Columbus, OH, USA, 2014: 2649-2656.
[63] LI M, CHEN S, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, USA, 2019: 3595-3603.
[64] SHI L, ZHANG Y, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, USA, 2019: 12026-12035.
[65] SHI L, ZHANG Y, CHENG J, et al. Skeleton-based action recognition with directed graph neural networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, USA, 2019: 7912-7921.
[66] CHENG K, ZHANG Y, HE X, et al. Skeleton-based action recognition with shift graph convolutional network[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA, USA, 2020: 183-192.
[67] LIU Z, ZHANG H, CHEN Z, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA, USA, 2020: 143-152.
[68] CHENG K, ZHANG Y, CAO C, et al. Decoupling gcn with dropgraph module for skeleton-based action recognition[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Glasgow, UK: Springer, 2020: 536-553.
[69] SONG Y F, ZHANG Z, SHAN C, et al. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition[C]//proceedings of the 28th ACM international conference on multimedia. New York,United States, 2020: 1625-1633.
[70] YE F, PU S, ZHONG Q, et al. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition[C]//Proceedings of the 28th ACM international conference on multimedia. New York,United States, 2020: 55-63.
[71] SONG Y F, ZHANG Z, SHAN C, et al. Constructing stronger and faster baselines for skeleton-based action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(2): 1474-1488.
[72] CHI H G, HA M H, CHI S, et al. Infogcn: Representation learning for human skeleton-basedaction recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and patternrecognition. New Orleans, Louisiana, USA, 2022: 20186-20196.
[73] XIANG W, LI C, ZHOU Y, et al. Generative action description prompts for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, France, 2023: 10276-10285.

所在学位评定分委会
电子科学与技术
国内图书分类号
TP183
来源库
人工提交
成果类型学位论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/765832
专题南方科技大学
工学院
工学院_系统设计与智能制造学院
推荐引用方式
GB/T 7714
武韬忠. 基于注意力机制和对比学习的人体行为识别方法[D]. 深圳. 南方科技大学,2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可 操作
12132150-武韬忠-系统设计与智能(11214KB)----限制开放--请求全文
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[武韬忠]的文章
百度学术
百度学术中相似的文章
[武韬忠]的文章
必应学术
必应学术中相似的文章
[武韬忠]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。