南方科技大学知识苑(SUSTech KC): 基于 RGB 相机与稀疏惯性数据融合的人体位姿还原

题名	基于 RGB 相机与稀疏惯性数据融合的人体位姿还原
其他题名	HUMAN POSE ESTIMATION BASED ON RGB CAMERA AND SPARSE INERTIAL DATA FUSION
姓名	方垲文
姓名拼音	FANG Kaiwen
学号	12132250
学位类型	硕士
学位专业	08 工学
学科门类/专业学位类别	08 工学
导师	杨再跃
导师单位	系统设计与智能制造学院
论文答辩日期	2024-05-09
论文提交日期	2024-07-03
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	稀疏惯性与视觉融合的人体位姿估计是一种具有巨大发展前景的动作捕捉技术。相比视觉跟踪或惯性测量动作捕捉系统在精度、鲁棒性和使用成本方面的局限性，传感器融合的方法有着轻量化、高精度的优点。然而，由于骨架大小的多样性、稀疏数据源的局限性以及在多种运动类型上的不可预测性，视觉惯性融合位姿估计在实际场景中的应用仍然面临着精度与稳定性上的挑战。针对普适性、轻量化运动捕捉的需求，本文从多感知数据置信度、生物力学约束及双目视觉人体位姿融合优化三个角度进行研究。本文将人体位姿还原定义为优化问题，通过构建数据驱动的视觉和稀疏惯性置信度先验模型进行优化问题中的参数选择。针对不同动作模式下传感器数据统计特征，本文定义了先验知识与实际测量相结合的融合权重，提高了融合过程的可靠性。继而构建了稀疏惯性与视觉融合函数的非线性多变量最优化方法。经测试，视觉惯性融合下的人体位姿还原方法全身姿态误差为 10.21°，达到同类方案相近精度的同时减少一半的惯性传感器数量。针对视觉惯性融合位置估计连续性差的问题，本文从生物力学的角度引入人体-环境接触约束与位移连续性约束，通过脚部接触位置特征以及根节点速度差分限定模型预测的根节点位置。同时通过提取相机人体剪影轮廓，匹配得到人体位置参考并引入人体位姿融合优化过程。经测试，生物力学约束项的引入将人体根节点位置误差降低至 14.98 cm，起点终点重合误差降低至 8.18 cm，实验结果中根节点位移的精度与连续性得到了明显提高。同时，模型利用双目相机视差图与人体深度位置的关系，提出基于双目视觉与稀疏惯性数据融合的位姿估计，解决单目相机对深度位置估计不准导致位移轨迹整体偏移的问题。并利用双目相机提取同一时刻人体关键点位置并检验其在置信度模型中的表现，补充视觉置信度先验模型的信息来源的同时提高可靠性。实验证明，本文方法在不同动作模式下的人体位姿还原中取得较好的效果，人体根节点位置误差 6.17 cm 比单目方案降低逾 50%；全身姿态误差 8.23°比同类方案降低 0.6°。相比现有方法，本文在轻量化动作捕捉精度与使用成本上有一定优势，为相关研究提供了新的思路。
关键词	动作捕捉传感器融合置信度模型非线性多变量优化
语种	中文
培养类别	独立培养
入学年份	2021
学位授予年份	2024-06
参考文献列表	[1] WOUDA F J, GIUBERTI M, BELLUSCI G, et al. Estimation of full-body poses using only five inertial sensors: an eager or lazy learning approach? [J]. Sensors, 2016, 16(12): 2138.6. [2] YI X, ZHOU Y, XU F. Transpose: Real-time 3d human translation and pose estimation withsix inertial sensors[J]. ACM Transactions on Graphics (TOG), 2021, 40(4): 1-13. [3] REMPE D, BIRDAL T, HERTZMANN A, et al. Humor: 3d human motion model for robustpose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 11488-11499. [4] PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3d human pose estimation in video with temporal convolutions and semi-supervised training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 7753 -7762. [5] CAI Y, GE L, LIU J, et al. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 2272-2281. [6] YI X, ZHOU Y, HABERMANN M, et al. Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors[C]//Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition. 2022: 13167-13178. [7] PAVLAKOS G, ZHOU X, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3D human pose[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 7025-7034. [8] OMRAN M, LASSNER C, PONS-MOLL G, et al. Neural body fitting: Unifying deep learning and model based human pose and shape estimation[C]//2018 International Conference on 3D Vision (3DV). IEEE, 2018: 484-494. [9] COLYER S L, EVANS M, COSKER D P, et al. A review of the evolution of vision-based motion analysis and the integration of advanced computer vision methods towards developing a markerless system[J]. Sports Medicine -Open, 2018, 4(1): 1-15. [10] AVLAKOS G, ZHOU X, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3D human pose[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 7025-7034. [11] PAVLAKOS G, ZHU L, ZHOU X, et al. Learning to estimate 3D human pose and shape from a single color image[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 459-468. [12] TAUTGES J, ARNOZINKE, KRUGER B, et al. Motion reconstruction using sparse accelerometer data[J]. ACM Transactions on Graphics, 2011, 30(3):1 -12.参考文献60 [13] VON MARCARD T, ROSENHAHN B, BLACK M J, et al. Sparse inertial poser: Automatic 3d human pose estimation from sparse imus[C]//Computer Graphics Forum. 2017, 36(2): 349-360. [14] OPER M, MAHMOOD N, ROMERO J, et al. SMPL: A skinned multi-person linear model[J]. ACM Transactions on Graphics (TOG), 2015, 34(6): 1-16. [15] HUANG Y, KAUFMANN M, AKSAN E, et al. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time[J]. ACM Transactions on Graphics (TOG), 2018, 37(6): 1-15 [16] 郑昭龙. 基于稀疏惯性传感器的人体姿态还原[D] . 深圳. 南方科技大学,2022. [17] YU T, ZHENG Z, GUO K, et al. Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7287-7296. [18] WANG K, LIN L, JIANG C, et al. 3D human pose machines with self-supervised learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(5): 1069-1082. [19] GÜLER R A, NEVEROVA N, KOKKINOS I. Densepose: Dense human pose estimation in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7297-7306. [20] XIE K, WANG T, IQBAL U, et al. Physics-based human motion estimation and synthesis from videos[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 11532-11541. [21] ZHU L, XU C, SHI K, et al. Recovering walking trajectories from local measurements and inertia data[J]. Mathematical Problems in Engineering, 2020, 2020: 1-11. [22] HELTEN T, MEINARD MÜLLER, SEIDEL H P, et al. Real-time body tracking with one depth camera and inertial sensors[C]//IEEE International Conference on Computer Vision.IEEE, 2014.DOI:10.1109/ICCV.2013.141. [23] SCHREINER P, PEREPICHKA M, LEWIS H, et al. Global position prediction for interactive motion capture[J]. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 2021, 4(3): 1-16 [24] CAO Z, HIDALGO G, SIMONT, et al. OpenPose：realtime multi-person 2d pose estimation using part affinity fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence，2018(7):76-79. [25] ZHENG C, ZHU S, MENDIETA M, et al. 3d human pose estimation with spatial and temporal transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 参考文献612021: 11656-11665. [26] ZHENG Z, YU T, LI H, et al. Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 384-400. [27] JATESIKTAT P, ANOPAS D, ANG W T. Personalized markerless upper-body tracking with a depth camera and wrist-worn inertial measurement units[C]//Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2018. DOI:10.1109/EMBC.2018.8513068. [28] ANDREWS S, HUERTA I, KOMURA T, et al. Real-time physics-based motion capture with sparse sensors[C]//Proceedings of the 13th European Conference on Visual Media Production (CVMP 2016). 2016: 1-10. [29] LEE Y, DO W, YOON H, et al. Visual-inertial hand motion tracking with robustness against occlusion, interference, and contact[J]. Science Robotics, 2021(58):6. DOI :10.1126/. abe1315. [30] MALLESON C, COLLOMOSSE J, HILTON A. Real-time multi-person motion capture from multi-view video and IMUs[J]. Springer, 2020.DOI:10.1007/S11263-019-01270-5. [31] YI X, ZHOU Y, HABERMANN M, et al. EgoLocate: real-time motion capture, localization, and mapping with sparse body-mounted sensors[J]. arXiv preprint arXiv:2305.01599, 2023. [32] ZHAN Y, LI F, WENG R, et al. Ray3D: ray-based 3D human pose estimation for monocular absolute 3D localization[J]. 2022.DOI:10.48550/ arXiv. 2203.11471. [33] VON MARCARD T, PONS-MOLL G, ROSENHAHN B. Human pose estimation from video and imus[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(8): 1533-1547. [34] LOPER M, MAHMOOD N, ROMERO J, et al. SMPL: A Skinned Multi-Person Linear Model[J]. ACM Transactions on Graphics, 2015, 34(6cd): 248.DOI:10.1145/ 2816795.2818013. [35] KAICHI T, MARUYAMA T, TADA M, et al. Resolving position ambiguity of imu-based human pose with a single RGB camera. [J]. Multidisciplinary Digital Publishing Institute (MDPI), 2020(19). DOI:10.3390/S20195453. [36] WANG J, LIU L, XU W, et al. Estimating egocentric 3d human pose in the wild with external weak supervision[J]. arXiv e-prints, 2022.DOI:10.48550/ arXiv.2201. 07929. [37] RODRIGUES T B, SALGADO D P, CIARÁN CATHÁIN, et al. Human gait assessment using a 3D marker-less multimodal motion capture system[J]. Multimedia Tools and Applications, 参考文献622020, 79(1). DOI:10.1007/s11042-019-08275-9. [38] MEHTA D, SOTNYCHENKO O, MUELLER F, et al. Single-shot multi-person 3d pose estimation from monocular rgb[C]//2018 International Conference on 3D Vision (3DV). IEEE, 2018: 120-130. [39] MEHTA D, SRIDHAR S, SOTNYCHENKO O, et al. Vnect: Real-time 3d human pose estimation with a single rgb camera[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-14. [40] MADARAS M, RIEČICKỲ A, MESÁROŠ M, et al. Position estimation and calibration of inertial motion capture systems using single camera[J]. Journal of Virtual Reality and Broadcasting, 2019, 15(3). [41] RIECICKỲ A, MADARAS M, PIOVARCI M, et al. Optical-inertial synchronization of mocap suit with single camera setup for reliable position tracking. [C]//VISIGRAPP (1: GRAPP). 2018: 40-47. [42] ZHU Y, FUJIMURA K. BAYESIAN 3D human body pose tracking from depth image sequences[C]//Asian Conference on Computer Vision.2010. [43] YAMAZOE H, UTSUMI A, TETSUTANI N, et al. Vision‐based human motion tracking using head‐mounted cameras and fixed cameras[J]. Electronics & Communications in Japan, 2010, 90(2): 40-53.DOI:10.1002/ecjb.20331. [44] ZSEDROVITS T, BAUER P, ZARANDY A, et al. Towards real-time visual and IMU data fusion[C]//AIAA Guidance, Navigation, and Control Conference.2014.DOI:10.2514/6.2014-0604. [45] 任振娜,杨颖.一次性生成约束 Delaunay 三角网的算法研究[C]//几何设计与计算的新进展.2005. [46] FUSIELLO A, TRUCCO E, VERRI A. A compact algorithm for rectification of stereo pairs[J]. Machine Vision & Applications, 2000, 12(1): 16-22.DOI:10.1007/s 001380050120. [47] LI T and YU H, Visual–Inertial fusion-based human pose estimation: a review[J]. IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1-16, 2023, Art no. 4007816, doi: 10.1109/TIM.2023.3286000.
所在学位评定分委会	力学
国内图书分类号	TP391
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/778991
专题	工学院_系统设计与智能制造学院
推荐引用方式 GB/T 7714	方垲文. 基于 RGB 相机与稀疏惯性数据融合的人体位姿还原[D]. 深圳. 南方科技大学,2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132250-方垲文-系统设计与智能（4755KB）	--	--	限制开放	--	请求全文