南方科技大学知识苑(SUSTech KC): 轻量级多人人体姿态实时重建

题名	轻量级多人人体姿态实时重建
其他题名	LIGHTWEIGHT REAL-TIME MULTI-PERSON POSE ESTIMATION
姓名	吴钰
姓名拼音	WU Yu
学号	11930386
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	郝祁
导师单位	计算机科学与工程系
论文答辩日期	2022-11
论文提交日期	2022-12-14
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	人体姿态估计任务是计算机视觉中的重要任务。基于深度学习的人体姿态估计方法已经取得了优越的表现，但现有最先进的方法执行一次人体姿态估计往往需要极为庞大的计算量，难以满足现实应用中的实时性需求。本文首先针对二维实时多人人体姿态估计任务，提出了一种轻量化卷积模块(稠密反残差模块)与一种高效神经网络结构(平衡高分辨率网络)。本文使用该卷积模块与神经网络结构构建了用于二维实时多人人体姿态估计任务的轻量级卷积神经网络。我们在不同数据集上进行的实验结果表明我们的方法与现有方法对比，可以在相同或更低的计算资源要求下，达到更高的准确度。此外，本文针对实时三维多人人体姿态估计任务，提出了一种基于投影体素与二维卷积神经网络的方法。我们将三维人体姿态估计任务转化为计算三维人体姿态在二维平面的投影，使用二维卷积神经网络从投影体素中预测二维人体姿态，最后通过聚合算法将计算得到的二维投影姿态转化为三维人体姿态。我们通过降低特征维度以及避免使用需求庞大计算量的三维卷积将现有基于体素特征与三维卷积神经网络的 VoxelPose 方法所需的浮点运算量大幅降低至原有的约 2.5%。最后，我们的实时三维多人人体姿态估计方法在多个数据集上获得了超过 90% 的人体部分检测正确率(Percentage of Correct Part，PCP)。
关键词	人体姿态估计轻量化神经网络
语种	中文
培养类别	独立培养
入学年份	2019
学位授予年份	2022-12
参考文献列表	[1] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]// European conference on computer vision. Springer, 2014: 740-755. [2] WANG H, WANG L. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection[J]. IEEE Transactions on Image Processing, 2018, 27(9): 4382-4394. [3] SHI L, ZHANG Y, CHENG J, et al. Skeleton-based action recognition with multi-stream adap- tive graph convolutional networks[J]. IEEE Transactions on Image Processing, 2020, 29: 9532- 9545. [4] HERDA L, FUA P, PLÄNKERS R, et al. Using skeleton-based tracking to increase the reliability of optical motion capture[J]. Human movement science, 2001, 20(3): 313-341. [5] SONG Y, LIU H, HONG F, et al. Syncline reservoir pooling as a general model for coalbed methane (cbm) accumulations: Mechanisms and case studies[J]. Journal of Petroleum Science and Engineering, 2012, 88: 5-12. [6] VON MARCARD T, HENSCHEL R, BLACK M J, et al. Recovering accurate 3d human pose in the wild using imus and a moving camera[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 601-617. [7] CHEN Q, ZHANG C, LIU W, et al. Shpd: Surveillance human pose dataset and performance evaluation for coarse-grained pose estimation[C]//2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018: 4088-4092. [8] SUMA E A, LANGE B, RIZZO A S, et al. Faast: The flexible action and articulated skeleton toolkit[C]//2011 IEEE Virtual Reality Conference. IEEE, 2011: 247-248. [9] WANG J, QIU K, PENG H, et al. Ai coach: Deep human pose estimation and analysis for personalized athletic training assistance[C]//Proceedings of the 27th ACM International Con- ference on Multimedia. 2019: 374-382. [10] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep con- volutional neural networks[J]. Advances in neural information processing systems, 2012, 25: 1097-1105. [11] SZEGEDY C, TOSHEV A, ERHAN D. Deep neural networks for object detection[J]. 2013. [12] WANG P, CHEN P, YUAN Y, et al. Understanding convolution for semantic segmentation [C]//2018 IEEE winter conference on applications of computer vision (WACV). IEEE, 2018:1451-1460. [13] TOSHEV A, SZEGEDY C. Deeppose: Human pose estimation via deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 1653-1660. [14] JOHNSON S, EVERINGHAM M. Clustered pose and nonlinear appearance models for human pose estimation.[C]//bmvc: volume 2. Citeseer, 2010: 5. [15] SAPP B, TASKAR B. Modec: Multimodal decomposable models for human pose estimation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013: 3674-3681. [16] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part aﬀinity fields[C]//CVPR 2017. 7291-7299. [17] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: Eﬀicient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017. [18] SANDLER M, HOWARD A, ZHU M, et al. Mobilenetv2: Inverted residuals and linear bot- tlenecks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4510-4520. [19] ZHANG X, ZHOU X, LIN M, et al. Shufflenet: An extremely eﬀicient convolutional neural network for mobile devices[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6848-6856. [20] ZOPH B, VASUDEVAN V, SHLENS J, et al. Learning transferable architectures for scal- able image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8697-8710. [21] CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation [C]//CVPR 2018. 7103-7112. [22] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose es- timation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. 2019: 5693-5703. [23] NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]// European conference on computer vision. Springer, 2016: 483-499. [24] CHENG B, XIAO B, WANG J, et al. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 5386-5395. [25] FANG H S, XIE S, TAI Y W, et al. RMPE: Regional multi-person pose estimation[C]//ICCV 2017. 2334-2343. [26] XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking[C]// Proceedings of the European conference on computer vision (ECCV). 2018: 466-481. [27] HE K, GKIOXARI G, DOLLÁR P, et al. Mask r-cnn[C]//ICCV 2017. 2961-2969. [28] ISKAKOV K, BURKOV E, LEMPITSKY V, et al. Learnable triangulation of human pose[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 7718-7727. [29] HE Y, YAN R, FRAGKIADAKI K, et al. Epipolar transformers[C]//CVPR 2020. 7776-7785. [30] QIU H, WANG C, WANG J, et al. Cross view fusion for 3D human pose estimation[C]//ICCV 2019. IEEE: 4341-4350. [31] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//CVPR 2016. [32] TOMPSON J J, JAIN A, LECUN Y, et al. Joint training of a convolutional network and a graph- ical model for human pose estimation[C]//Advances in neural information processing systems. 2014: 1799-1807. [33] JAIN A, TOMPSON J, LECUN Y, et al. Modeep: A deep learning framework using motion features for human pose estimation[C]//Asian conference on computer vision. Springer, 2014: 302-315. [34] ZHANG F, ZHU X, DAI H, et al. Distribution-aware coordinate representation for human pose estimation[C]//CVPR 2020. 7093-7102. [35] WEI S E, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]//CVPR 2016. 4724-4732. [36] TOMPSON J, GOROSHIN R, JAIN A, et al. Eﬀicient object localization using convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 648-656. [37] NEWELL A, HUANG Z, DENG J. Associative embedding: End-to-end learning for joint de- tection and grouping[J]. arXiv preprint arXiv:1611.05424, 2016. [38] XU Y, ZHANG J, ZHANG Q, et al. Vitpose: Simple vision transformer baselines for human pose estimation[J]. arXiv preprint arXiv:2204.12484, 2022. [39] MAO W, GE Y, SHEN C, et al. Tfpose: Direct human pose estimation with transformers[J]. arXiv preprint arXiv:2103.15320, 2021. [40] LI K, WANG S, ZHANG X, et al. Pose recognition with cascade transformers[C]//CVPR 2021. 1944-1953. [41] YANG S, QUAN Z, NIE M, et al. Transpose: Keypoint localization via transformer[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 11802- 11812. [42] BULAT A, KOSSAIFI J, TZIMIROPOULOS G, et al. Toward fast and accurate human pose estimation via soft-gated skip connections[C]//2020 15th IEEE International Conference on Au- tomatic Face and Gesture Recognition. 2020: 101-108. [43] RAFI U, LEIBE B, GALL J, et al. An eﬀicient convolutional network for human pose estimation. [C]//BMVC: volume 1. 2016: 2. [44] CHU X, YANG W, OUYANG W, et al. Multi-context attention for human pose estimation[C]// CVPR 2017. 1831-1840. [45] YANG W, LI S, OUYANG W, et al. Learning feature pyramids for human pose estimation[C]// ICCV 2017. 1281-1290. [46] PAPANDREOU G, ZHU T, KANAZAWA N, et al. Towards accurate multi-person pose esti- mation in the wild[C]//CVPR 2017. 4903-4911. [47] JIN S, LIU W, XIE E, et al. Differentiable hierarchical graph grouping for multi-person pose estimation[C]//European Conference on Computer Vision. Springer, 2020: 718-734. [48] KOCABAS M, KARAGOZ S, AKBAS E. Multiposenet: Fast multi-person pose estimation using pose residual network[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 417-433. [49] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014. [50] OSOKIN D. Real-time 2d multi-person pose estimation on cpu: Lightweight openpose[J]. arXiv preprint arXiv:1811.12004, 2018. [51] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125. [52] REN S, HE K, GIRSHICK R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J]. arXiv preprint arXiv:1506.01497, 2015. [53] PENG C, XIAO T, LI Z, et al. Megdet: A large mini-batch object detector[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018: 6181-6189. [54] YU C, XIAO B, GAO C, et al. Lite-hrnet: A lightweight high-resolution network[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 10440-10450. [55] NEFF C, SHETH A, FURGURSON S, et al. Eﬀicienthrnet: Eﬀicient scaling for lightweight high-resolution multi-person pose estimation[J]. arXiv preprint arXiv:2007.08090, 2020. [56] IONESCU C, PAPAVA D, OLARU V, et al. Human3.6M: Large scale datasets and predic- tive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325-1339. [57] JOO H, SIMON T, LI X, et al. Panoptic studio: A massively multiview system for social inter- action capture[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 41(1): 190-204. [58] BERGTHOLDT M, KAPPES J, SCHMIDT S, et al. A study of parts-based object class detec- tion using complete graphs[J]. International journal of computer vision, 2010, 87(1-2): 93. [59] ANDRILUKA M, ROTH S, SCHIELE B. Discriminative appearance models for pictorial struc- tures[J]. International journal of computer vision, 2012, 99(3): 259-280. [60] ANDREW A M. Multiple view geometry in computer vision[J]. Kybernetes, 2001. [61] CHEN C, RAMANAN D. 3D human pose estimation = 2D pose estimation + matching[C]// CVPR 2017. 5759-5767. [62] LI S, CHAN A B. 3D human pose estimation from monocular images with deep convolutional neural network[C]//Asian Conference on Computer Vision. Springer, 2014: 332-347. [63] FU H, GONG M, WANG C, et al. Deep ordinal regression network for monocular depth esti- mation[C]//CVPR 2018. 2002-2011. [64] PAVLAKOS G, ZHOU X, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3d human pose[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7025-7034. [65] NIBALI A, HE Z, MORGAN S, et al. 3d human pose estimation with 2d marginal heatmaps [C]//2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019: 1477-1485. [66] MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]//ICCV 2017. 2659-2668. [67] TEKIN B, MÁRQUEZ-NEILA P, SALZMANN M, et al. Learning to fuse 2d and 3d image cues for monocular body pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3941-3950. [68] CI H, WANG C, MA X, et al. Optimizing network structure for 3d human pose estimation[C]// Proceedings of the IEEE/CVF international conference on computer vision. 2019: 2262-2271. [69] ROGEZ G, WEINZAEPFEL P, SCHMID C. Lcr-net: Localization-classification-regression for human pose[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition. 2017: 3433-3441. [70] MOON G, CHANG J Y, LEE K M. Camera distance-aware top-down approach for 3d multi- person pose estimation from a single rgb image[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 10133-10142. [71] WANG C, LI J, LIU W, et al. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation[C]//European Conference on Computer Vision. Springer, 2020: 242-259. [72] MEHTA D, SOTNYCHENKO O, MUELLER F, et al. Single-shot multi-person 3d body pose estimation from monocular rgb input[J]. arXiv preprint arXiv:1712.03453, 2017. [73] NIE X, FENG J, ZHANG J, et al. Single-stage multi-person pose machines[C]//ICCV 2019. 6951-6960. [74] DONG J, JIANG W, HUANG Q, et al. Fast and robust multi-person 3D pose estimation from multiple views[C]//CVPR 2019. 7792-7801. [75] CHEN L, AI H, CHEN R, et al. Cross-view tracking for multi-human 3D pose estimation at over 100 FPS[C]//CVPR 2020. 3279-3288. [76] HUANG C, JIANG S, LI Y, et al. End-to-end dynamic matching network for multi-view multi- person 3D pose estimation[C]//European Conference on Computer Vision. Springer, 2020: 477-493. [77] FABBRI M, LANZI F, CALDERARA S, et al. Compressed volumetric heatmaps for multi- person 3D pose estimation[C]//CVPR 2020. 7204-7213. [78] MEHTA D, SOTNYCHENKO O, MUELLER F, et al. Single-shot multi-person 3d pose es- timation from monocular rgb[C]//2018 International Conference on 3D Vision (3DV). IEEE, 2018: 120-130. [79] ZHANG Z, WANG C, QIU W, et al. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild[J]. International Journal of Computer Vision, 2021, 129(3): 703- 718. [80] MOON G, CHANG J Y, LEE K M. V2v-posenet: Voxel-to-voxel prediction network for accu- rate 3d hand and human pose estimation from a single depth map[C]//Proceedings of the IEEE conference on computer vision and pattern Recognition. 2018: 5079-5088. [81] TU H, WANG C, ZENG W. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment[C]//European Conference on Computer Vision. Springer, 2020: 197-212. [82] SIFRE L, MALLAT S. Rigid-motion scattering for texture classification[J]. arXiv preprint arXiv:1403.1687, 2014. [83] XIE S, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1492-1500. [84] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9. [85] ZHANG T, QI G J, XIAO B, et al. Interleaved group convolutions[C]//Proceedings of the IEEE international conference on computer vision. 2017: 4373-4382. [86] ZHOU D, HOU Q, CHEN Y, et al. Rethinking bottleneck structure for eﬀicient mobile network design[C]//European Conference on Computer Vision. Springer, 2020: 680-697. [87] HOWARD A, SANDLER M, CHU G, et al. Searching for mobilenetv3[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 1314-1324. [88] MA N, ZHANG X, ZHENG H, et al. Shufflenet V2: practical guidelines for eﬀicient CNN architecture design[J/OL]. CoRR, 2018, abs/1807.11164. http://arxiv.org/abs/1807.11164. [89] YU C, WANG J, PENG C, et al. Bisenet: Bilateral segmentation network for real-time semantic segmentation[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 325-341. [90] ZHAO H, QI X, SHEN X, et al. Icnet for real-time semantic segmentation on high-resolution images[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 405- 420. [91] ZOPH B, LE Q V. Neural architecture search with reinforcement learning[J]. arXiv preprint arXiv:1611.01578, 2016. [92] YAO X. Evolutionary artificial neural networks[J]. International journal of neural systems, 1993, 4(03): 203-222. [93] MCNALLY W, VATS K, WONG A, et al. EvoPose2D: Pushing the boundaries of 2D human pose estimation using neuroevolution[Z]. 2020. [94] REAL E, MOORE S, SELLE A, et al. Large-scale evolution of image classifiers[C]// International Conference on Machine Learning. PMLR, 2017: 2902-2911. [95] SANKARARAMAN K A, DE S, XU Z, et al. The impact of neural network overparameter- ization on gradient confusion and stochastic gradient descent[C]//International Conference on Machine Learning. PMLR, 2020: 8469-8479. [96] LI J, WANG C, ZHU H, et al. Crowdpose: Eﬀicient crowded scenes pose estimation and a new benchmark[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 10863-10872. [97] BELAGIANNIS V, AMIN S, ANDRILUKA M, et al. 3d pictorial structures for multiple hu- man pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014: 1669-1676. [98] BELAGIANNIS V, WANG X, SCHIELE B, et al. Multiple human pose estimation with tempo- rally consistent 3d pictorial structures[C]//European Conference on Computer Vision. Springer, 2014: 742-754. [99] BELAGIANNIS V, AMIN S, ANDRILUKA M, et al. 3d pictorial structures revisited: Multiple human pose estimation[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 38(10): 1929-1942. [100] ERSHADI-NASAB S, NOURY E, KASAEI S, et al. Multiple human 3d pose estimation from multiview images[J]. Multimedia Tools and Applications, 2018, 77(12): 15573-15601.
所在学位评定分委会	计算机科学与工程系
国内图书分类号	TM301.2
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/416438
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	吴钰. 轻量级多人人体姿态实时重建[D]. 深圳. 南方科技大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
11930386-吴钰-计算机科学与工程（4887KB）	--	--	限制开放	--	请求全文