南方科技大学知识苑(SUSTech KC): 基于深度学习的3D人体姿态估计及其端侧实现

题名	基于深度学习的3D人体姿态估计及其端侧实现
其他题名	DEEP LEARNING BASED 3D HUMAN POSE ESTIMATION ON EDGE DEVICE
姓名	李书玮
姓名拼音	LI Shuwei
学号	12031024
学位类型	硕士
学位专业	1401 集成电路科学与工程
学科门类/专业学位类别	14 交叉学科
导师	余浩
导师单位	深港微电子学院
论文答辩日期	2024-05-24
论文提交日期	2024-06-26
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	人体姿态估计（Human Pose Estimation）是计算机视觉和机器学习领域中一个重要课题，它主要研究如何通过分析图像或视频来自动识别并定位出人体各个关键点（如关节、骨骼连接点等）的位置，为后续的人体三维重建或人体三维结构及动态运动理解等问题提供支持。人体姿态估计的研究成果可应用于人机交互、运动员或病患动作分析、公共安全监控、影视游戏动作捕捉等领域，具有广阔的应用前景和价值。近年来，深度学习方法在视觉领域快速发展，也推动了人体姿态估计方法的进步，但依然有一些问题亟待解决，如精度与鲁棒性、三维姿态估计和实时性等问题。另外，为了降低处理延迟和计算成本，将模型部署于边缘设备也是一个重要研究课题。为解决这些问题，本文优化并提出了新的姿态估计神经网络模型，对模型进行压缩和边侧部署，并将其应用于跌倒检测的具体问题中，具体的工作包括以下三个部分：（1）基于空间转换网络（Spatial Transformer Network，STN）优化二维姿态估计模型，以应对由人体不同动作、部位大小差异、不同的距离和视角产生的图像差异。引入RGBD数据，基于多尺度特征融合和注意力机制提出了新的三维姿态估计模型。提高了模型精度，鲁棒性和泛用性。（2）将姿态估计模型应用于老人跌倒检测问题。和长短期记忆网络（Long Short-Term Memory，LSTM）构建时空模型，对老人室内动作进行准确分类，实现跌倒检测。基于公开的跌倒检测数据集对模型性能进行验证，准确率达到97.58%的较高水平。（3）基于张量分解和模型量化方法对神经网络进行压缩，降低模型参数量和存储占用。将模型部署在边缘神经网络处理器（Neural network Processing Unit，NPU）平台，实现硬件加速，以应对实时应用带来的处理速度要求。最终在边缘设备上实现了以17.85 FPS的速度运行的跌倒检测算法。
其他摘要	Human Pose Estimation (HPE) constitutes a significant research topic in the realms of computer vision and machine learning, focusing primarily on automatically identifying and localizing key points of the human body, such as joints and skeletal connection points, through the analysis of images or videos. This underpins subsequent problems like 3D reconstruction of the human form and understanding its three-dimensional structure and dynamic movement. The outcomes of HPE research have broad application prospects and value across various domains, including human-computer interaction, athletic and patient motion analysis, public safety surveillance, and action capture in film and gaming industries. In recent years, the rapid advancement of deep learning methods within the visual domain has propelled progress in human pose estimation techniques; however, several issues persist that warrant resolution, such as precision versus robustness, 3D pose estimation, and real-time processing capabilities. To tackle these challenges, this paper optimizes and proposes a novel neural network model for pose estimation, compresses it for edge deployment, and applies it to the specific problem of fall detection. The work encompasses the following three components: (1) An optimized 2D pose estimation model is developed based on the Spatial Transformer Network (STN), addressing image variations caused by different human actions, varying body part sizes, distances, and viewpoints. By incorporating RGBD data, a new 3D pose estimation model is proposed that fuses multi-scale features and employs a global attention mechanism, thereby enhancing model accuracy, robustness, and versatility. (2) The pose estimation model is applied to the elderly fall detection scenario. A spatiotemporal model is constructed using Long Short-Term Memory (LSTM) networks to accurately classify indoor movements of seniors, thus enabling fall detection. The model performance was verified based on the public fall detection data set, and the accuracy reached a high level of 97.58%. (3) Leveraging tensor decomposition and model quantization techniques, the neural network is compressed, reducing model parameter count and storage consumption. The compressed model is deployed onto Edge Neural Network Process Units (NPUs) platforms, enabling hardware acceleration to meet the speed requirements demanded by real-time applications. Finally, a fall detection algorithm running at 17.85 FPS was implemented on edge device.
关键词	深度学习人体姿态估计神经网络压缩跌倒检测
其他关键词	Deep Learning Human Pose Estimation Neural Network Compression Fall Detection
语种	中文
培养类别	独立培养
入学年份	2020
学位授予年份	2024-06
参考文献列表	[1] CAO Z, SIMON T, WEI S, et al. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. [2] NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of European Conference on Computer Vision (ECCV). Springer, 2016. [3] WANG J, SUN K, CHENG T, et al. Deep high-resolution representation learning for visual recognition [J]. IEEE transactions on pattern analysis machine intelligence, 2020, 43(10): 3349-64. [4] TOSHEV A, SZEGEDY C. DeepPose: Human pose estimation via deep neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2014. [5] WEI S-E, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. [6] XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking[C]//Proceedings of European conference on computer vision (ECCV). 2018. [7] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. [8] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. Advances in neural information processing systems, 2017, 30. [9] ZHENG C, ZHU S, MENDIETA M, et al. 3d human pose estimation with spatial and temporal transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021. [10] ZHANG T, WANG J, XU L, et al. Fall detection by wearable sensor and one-class SVM algorithm[C]//Proceedings of the International Conference on Intelligent Computing (ICIC). Springer, 2006. [11] PRASANTH H, CABAN M, KELLER U, et al. Wearable sensor-based real-time gait detection: A systematic review [J]. Sensors, 2021, 21(8): 2727. [12] SADREAZAMI H, BOLIC M, RAJAN S. Fall detection using standoff radar-based sensing and deep convolutional neural network [J]. IEEE Transactions on Circuits Systems, 2019, 67(1): 197-201. [13] MARTíNEZ-VILLASEñOR L, PONCE H, PEREZ-DANIEL K. Deep learning for multimodal fall detection[C]//Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE, 2019. [14] HAN S, MAO H, DALLY W. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding [J]. arXiv preprint arXiv:1510.00149, 2015. [15] JADERBERG M, VEDALDI A, ZISSERMAN A. Speeding up convolutional neural networks with low rank expansions [J]. arXiv preprint arXiv:1405.3866, 2014. [16] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network [J]. arXiv preprint arXiv:1503.02531, 2015. [17] COURBARIAUX M, BENGIO Y, DAVID J-P. Binaryconnect: Training deep neural networks with binary weights during propagations [J]. Advances in neural information processing systems, 2015, 28. [18] LI F, LIU B, WANG X, et al. Ternary weight networks [J]. arXiv preprint arXiv:1605.04711, 2016. [19] WEN W, WU C, WANG Y, et al. Learning structured sparsity in deep neural networks [J]. Advances in neural information processing systems, 2016, 29. [20] SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition [J]. Computer Science, 2014. [21] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural computation, 1997, 9(8): 1735-80. [22] CARREIRA J, AGRAWAL P, FRAGKIADAKI K, et al. Human pose estimation with iterative error feedback[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. [23] SUN X, SHANG J, LIANG S, et al. Compositional human pose regression[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. [24] LUVIZON D C, TABIA H, PICARD D. Human pose regression by combining indirect part detection and contextual information [J]. Computers Graphics, 2019, 85: 15-22. [25] CHEN X, YUILLE A. Articulated pose estimation by a graphical model with image dependent pairwise relations [J]. Advances in neural information processing systems, 2014, 27. [26] CHEN Y, SHEN C, WEI X-S, et al. Adversarial posenet: A structure-aware convolutional network for human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2017. [27] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [J]. Advances in neural information processing systems, 2014, 27. [28] CHU X, YANG W, OUYANG W, et al. Multi-context attention for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. [29] MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3d human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. [30] TOMPSON J J, JAIN A, LECUN Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation [J]. Advances in neural information processing systems, 2014, 27. [31] PFISTER T, CHARLES J, ZISSERMAN A. Flowing convnets for human pose estimation in videos[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2015. [32] GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015. [33] CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. [34] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. [35] NEWELL A, HUANG Z, DENG J. Associative embedding: End-to-end learning for joint detection and grouping [J]. Advances in neural information processing systems, 2017, 30. [36] ZHANG X, ZOU J, HE K, et al. Accelerating very deep convolutional networks for classification and detection [J]. IEEE transactions on pattern analysis machine intelligence 2015, 38(10): 1943-55. [37] LEBEDEV V, GANIN Y, RAKHUBA M, et al. Speeding-up convolutional neural networks using fine-tuned cp-decomposition [J]. arXiv preprint arXiv:1412.6553, 2014. [38] KIM Y-D, PARK E, YOO S, et al. Compression of deep convolutional neural networks for fast and low power mobile applications [J]. arXiv preprint arXiv:1511.06530, 2015. [39] NOVIKOV A, PODOPRIKHIN D, OSOKIN A, et al. Tensorizing neural networks [J]. Advances in neural information processing systems, 2015, 28. [40] OPHOFF T, VAN BEECK K, GOEDEMé T. Exploring RGB+ Depth fusion for real-time object detection [J]. Sensors, 2019, 19(4): 866. [41] OPHOFF T, VAN BEECK K, GOEDEMé T. Exploring RGB + Depth fusion for real-time object detection [J]. Sensors, 2019, 19(4): 866. [42] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. [43] FANG H-S, XIE S, TAI Y-W, et al. RMPE: Regional multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. [44] LI Y, ZHANG S, WANG Z, et al. Tokenpose: Learning keypoint tokens for human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021. [45] LUVIZON D C, PICARD D, TABIA H. 2d/3d pose estimation and action recognition using multitask deep learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. [46] LEE K, LEE I, LEE S. Propagating lstm: 3d pose estimation based on joint interdependency. [C]// Proceedings of European Conference on Computer Vision (ECCV). Springer, 2018. [47] LIN K, WANG L, LIU Z. End-to-end human pose and mesh reconstruction with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021. [48] ZENG A, SUN X, HUANG F, et al. SRNet: Improving generalization in 3d human pose estimation with a split-and-recombine approach[C]// Proceedings of European Conference on Computer Vision (ECCV). Springer, 2020. [49] WANG K, CAO G T, MENG D, et al. Automatic fall detection of human in video using combination of features[C]//Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2016. [50] CHEN Y T, LIN Y C, FANG W H, et al. A hybrid human fall detection scheme[C]//Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2010. [51] WANG S K, CHEN L, ZHOU Z X, et al. Human fall detection in surveillance video based on PCANet [J]. Multimedia Tools and Applications, 2016, 75(19): 11603-13. [52] ASIF U, MASHFORD B S, CAVALLAR S V, et al. Privacy Preserving Human Fall Detection using Video Data[C]//Proceedings of the Workshop Machine Learning for Health (ML4H) at Conference and Workshop on Neural Information Processing Systems (NeurIPS). 2019. [53] FENG Q, GAO C Q, WANG L, et al. Spatio-temporal fall event detection in complex scenes using attention guided LSTM [J]. Pattern Recognition Letters, 2020, 130: 242-9. [54] HARROU F, ZERROUKI N, SUN Y, et al. Vision-Based Fall Detection System for Improving Safety of Elderly People [J]. Ieee Instrumentation & Measurement Magazine, 2017, 20(6): 49-55. [55] NúñEZ-MARCOS A, AZKUNE G, ARGANDA-CARRERAS I. Vision-Based Fall Detection with Convolutional Neural Networks [J]. Wireless Communications & Mobile Computing, 2017. [56] ZERROUKI N, HARROU F, HOUACINE A, et al. Fall Detection Using Supervised Machine Learning Algorithms: A Comparative Study[C]// Proceedings of the International Conference on Modelling, Identification and Control (ICMIC). 2016. [57] BHANDARI S, BABAR N, GUPTA P, et al. A Novel Approach for Fall Detection in Home Environment[C]//Proceedings of the IEEE 6th Global Conference on Consumer Electronics (GCCE). IEEE, 2017. [58] LIU W, BAO Q, SUN Y, et al. Recent advances of monocular 2d and 3d human pose estimation: a deep learning perspective [J]. ACM Computing Surveys, 2022, 55(4): 1-41. [59] DIAZ-ARIAS A, SHIN D, MESSMORE M, et al. On the role of depth predictions for 3D human pose estimation[C]//Proceedings of the Future Technologies Conference. Springer, 2022. [60] CHEN Y, DU R, LUO K, et al. Fall detection system based on real-time pose estimation and SVM[C]//Proceedings of the IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE). IEEE, 2021. [61] CHENG Y, HUANG G, ZHEN P, et al. An anomaly comprehension neural network for surveillance videos on terminal devices[C]//Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2020. [62] CHENG Y, LI G, WONG N, et al. DEEPEYE: A deeply tensor-compressed neural network for video comprehension on terminal devices [J]. ACM Transactions on Embedded Computing Systems, 2020, 19(3): 1-25. [63] CHENG Y, YANG Y, CHEN H-B, et al. S3-Net: A fast scene understanding network by single-shot segmentation for autonomous driving [J]. ACM Transactions on Intelligent Systems Technology, 2021, 12(5): 1-19. [64] CHéRON G, LAPTEV I, SCHMID C. P-CNN: Pose-based cnn features for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015. [65] GUTIéRREZ J, RODRíGUEZ V, MARTIN S. Comprehensive review of vision-based fall detection systems [J]. Sensors, 2021, 21(3): 947. [66] LU N, WU Y, FENG L, et al. Deep learning for fall detection: Three-dimensional CNN combined with LSTM on video kinematic data [J]. IEEE Journal of Biomedical Health Informatics, 2018, 23(1): 314-23. [67] PAPANDREOU G, ZHU T, CHEN L-C, et al. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]// Proceedings of European Conference on Computer Vision (ECCV). Springer, 2020. [68] NUNEZ J C, CABIDO R, PANTRIGO J J, et al. Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition [J]. Pattern Recognition, 2018, 76: 80-94. [69] LI S, MAN C, SHEN A, et al. A fall detection network by 2d/3d spatio-temporal joint models with tensor compression on edge [J]. ACM Transactions on Embedded Computing Systems, 2022, 21(6): 1-19. [70] GUAN Z, LI S, CHENG Y, et al. A video-based fall detection network by spatio-temporal joint-point model on edge devices[C]//Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2021.
所在学位评定分委会	集成电路科学与工程
国内图书分类号	TP183
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/766180
专题	南方科技大学-香港科技大学深港微电子学院筹建办公室
推荐引用方式 GB/T 7714	李书玮. 基于深度学习的3D人体姿态估计及其端侧实现[D]. 深圳. 南方科技大学,2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12031024-李书玮-南方科技大学-（5395KB）	--	--	限制开放	--	请求全文