南方科技大学知识苑(SUSTech KC): 基于深度学习的边缘VSLAM系统开发

题名	基于深度学习的边缘VSLAM系统开发
姓名	任宏伟
姓名拼音	REN Hongwei
学号	11930185
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	余浩
导师单位	深港微电子学院
论文答辩日期	2022-05-12
论文提交日期	2022-06-13
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	如何赋予机器人人类的大脑是学术研究和工业应用领域共同研究的重要问题。随着人工智能算法爆发这十年，机器智能领域已经迸发出许多令人惊叹的成果，但计算机计算能力提升速度变慢，智能模型复杂度的提升遇到了瓶颈，系统的发展逐渐从复杂向分布式边缘的方向发展。智能机器人的关键在于环境感知和行为决策。本文主要聚焦于环境感知的领域，研究如何基于高复杂度的深度学习算法的即时定位和建图系统部署在计算和存储性能均受限的边缘设备上去，即基于深度学习的边缘VSLAM（Visual Simultaneous Localization and Mapping ）系统开发。本文重点研究设计了VSLAM系统中的视觉里程计和闭环检测两个模块，以深度学习为基础，采用张量化的模型压缩方式实现了高精度高速率边缘推理的VSLAM系统。本文提出了基于卷积神经网络和张量分解的注意力机制 LSTM （Long Short-Term Memory）的单目视觉里程计 ATFVO （Attentive Tensor-compressed Optical Flow Visual Odometry ）该算法融合了空间特征和时间特征能够更好的回归出相机本体的位姿变化情况。整个视觉里程计模块以RGB图像作为输入，通过光流网络提取光流信息，对光流信息进行特征提取并降维，以降维后的光流特征作为当前时刻的循环神经网络的输入，最后模块输出为当前时刻的六个自由度的位姿变化量。最后通过张量化网络并分解的方式，将 ATFVO部署在边缘设备上并进行测试，实验表明该模型在保持准确的情况下也可以高速率推理。本文提出了基于注意力机制的Transformer网络的闭环检测 TLCD（Transformer Based Loop Closure Detection），该闭环检测基于高级词袋模型对图像进行编码，并通过主成分分析和序列匹配的方法完成对闭环的检测。整个模块以 RGB图像作为输入，通过 Transformer网络进行图片描述向量的编码，对描述向量进行特征降维和相似度匹配最终输出闭环结果。对 TLCD进行实验比较，该闭环检测模块在整体效果上优于卷积神经网络，并证明了该模块具有很强的图像编码能力。
其他摘要	How to endow robots with human brains is an important issue. With the outbreak of artificial intelligence algorithms in the past decade, many amazing achievements have been made in the field of robot intelligence. However, the speed of computer computing power has slowed down, the complexity of intelligent models has encountered a dilemma, and the evolution of systems has gradually changed from complexity to the direction of the distributed edge. The key to robot intelligence lies in perceiving the environment and making behavioral decisions. This paper mainly focuses on the field of perceiving the environment, and studies how to deploy the complex localization and mapping system based on deep learning algorithms on edge devices. This paper mainly designs two modules of visual odometry and loop closure detection in the VSLAM system. Based on deep learning, a tensorized model compression method is adopted to keep the modules accurate and high frame rate inference. In this paper, a monocular visual odometry ATFVO based on convolutional neural network and tensor-compressed attentive LSTM is proposed. The algorithm combines spatial and temporal features to better regress the pose changes of the camera body. The whole module takes the RGB image as input, extracts the optical flow information through the optical flow network, performs feature extraction on the optical flow information and reduces the dimension, and uses the optical flow feature after dimension reduction as the input of the RNN. The pose is changed of the six degrees of freedom. The optical flow feature can describe the positional relationship between the physical world under the relationship of a single camera, and the T-LSTM based on the soft attention mechanism can effectively convey the temporal relationship between the pose at the current moment and the pose at the previous moment. Finally, ATFVO is deployed on edge devices and tested by tensorizing the network and decomposing. Experiments show that the model can also reason at a high rate while maintaining accuracy.
关键词	单目视觉里程计闭环检测边缘计算深度学习
其他关键词	Monocular Visual Odometry Loop Closure Detection Edge Computing Deep Learning
语种	中文
培养类别	独立培养
入学年份	2019
学位授予年份	2022-06
参考文献列表	[1] Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International journal of computer vision, 2004, 60(2): 91-110. [2] Bay H, Tuytelaars T, Gool L V. Surf: Speeded up robust features[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2006: 404-417. [3] Rublee E, Rabaud V, Konolige K, et al. ORB: An efficient alternative to SIFT or SURF[C]//2011 International conference on computer vision. Ieee, 2011: 2564-2571. [4] Klein G, Murray D. Parallel tracking and mapping for small AR workspaces[C]//2007 6th IEEE and ACM international symposium on mixed and augmented reality. IEEE, 2007: 225-234. [5] Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM: a versatile and accurate monocular SLAM system[J]. IEEE transactions on robotics, 2015, 31(5): 1147-1163. [6] Campos C, Elvira R, Rodríguez J J G, et al. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam[J]. IEEE Transactions on Robotics, 2021, 37(6): 1874-1890. [7] Engel J, Sturm J, Cremers D. Semi-dense visual odometry for a monocular camera[C]//Proceedings of the IEEE international conference on computer vision. 2013: 1449-1456. [8] Newcombe R A, Lovegrove S J, Davison A J. DTAM: Dense tracking and mapping in real-time[C]//2011 international conference on computer vision. IEEE, 2011: 2320-2327. [9] Forster C, Pizzoli M, Scaramuzza D. SVO: Fast semi-direct monocular visual odometry[C]//2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014: 15-22. [10] Roberts R, Nguyen H, Krishnamurthi N, et al. Memory-based learning for visual odometry[C]//2008 IEEE International Conference on Robotics and Automation. IEEE, 2008: 47-52. [11] Ciarfuglia T A, Costante G, Valigi P, et al. Evaluation of non-geometric methods for visual odometry[J]. Robotics and Autonomous Systems, 2014, 62(12): 1717-1730. [12] Konda K R, Memisevic R. Learning visual odometry with a convolutional network[C]//VISAPP (1). 2015: 486-490. [13] Costante G, Mancini M, Valigi P, et al. Exploring representation learning with cnns for frame-to-frame ego-motion estimation[J]. IEEE robotics and automation letters, 2015, 1(1): 18-25. [14] Muller P, Savakis A. Flowdometry: An optical flow and deep learning based approach to visual odometry[C]//2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017: 624-631. [15] Wang S, Clark R, Wen H, et al. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks[C]//2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017: 2043-2050. [16] Jiao J, Jiao J, Mo Y, et al. Magicvo: End-to-end monocular visual odometry through deep bi-directional recurrent convolutional neural network[J]. arXiv preprint arXiv:1811.10964, 2018. [17] Almalioglu Y, Saputra M R U, de Gusmao P P B, et al. Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks[C]//2019 International conference on robotics and automation (ICRA). IEEE, 2019: 5474-5480. [18] Yin Z, Shi J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 1983-1992. [19] Nister D, Stewenius H. Scalable recognition with a vocabulary tree[C]//2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). Ieee, 2006, 2: 2161-2168. [20] Cummins M, Newman P. FAB-MAP: Probabilistic localization and mapping in the space of appearance[J]. The International Journal of Robotics Research, 2008, 27(6): 647-665. [21] Hou Y, Zhang H, Zhou S. Convolutional neural network-based image representation for visual loop closure detection[C]//2015 IEEE international conference on information and automation. IEEE, 2015: 2238-2245. [22] Gao X, Zhang T. Unsupervised learning to detect loops using deep neural networks for visual SLAM system[J]. Autonomous robots, 2017, 41(1): 1-18. [23] Sünderhauf N, Shirazi S, Dayoub F, et al. On the performance of convnet features for place recognition[C]//2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2015: 4297-4304. [24] Arandjelovic R, Gronat P, Torii A, et al. NetVLAD: CNN architecture for weakly supervised place recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 5297-5307. [25] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [26] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25. [27] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9. [28] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014. [29] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [30] Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//International conference on machine learning. PMLR, 2019: 6105-6114. [31] Iandola F N, Han S, Moskewicz M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size[J]. arXiv preprint arXiv:1602.07360, 2016. [32] Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017. [33] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE transactions on Signal Processing, 1997, 45(11): 2673-2681. [34] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014. [35] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780. [36] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141. [37] Li X, Wang W, Hu X, et al. Selective kernel networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 510-519. [38] Zhang H, Wu C, Zhang Z, et al. Resnest: Split-attention networks[J]. arXiv preprint arXiv:2004.08955, 2020. [39] Ramachandran P, Parmar N, Vaswani A, et al. Stand-alone self-attention in vision models[J]. Advances in Neural Information Processing Systems, 2019, 32. [40] Cheng Y, Yang Y, Chen H B, et al. S3-Net: A Fast Scene Understanding Network by Single-Shot Segmentation for Autonomous Driving[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2021, 12(5): 1-19. [41] Zhen P, Chen H B, Cheng Y, et al. Fast Video Facial Expression Recognition by a Deeply Tensor-Compressed LSTM Neural Network for Mobile Devices[J]. ACM Transactions on Internet of Things, 2021, 2(4): 1-26. [42] Cheng Y, Li G, Wong N, et al. Deepeye: A deeply tensor-compressed neural network for video comprehension on terminal devices[J]. ACM Transactions on Embedded Computing Systems (TECS), 2020, 19(3): 1-25. [43] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015, 2(7). [44] Yuan L, Tay F E H, Li G, et al. Revisit knowledge distillation: a teacher-free framework[J]. 2019. [45] Wei L, Xiao A, Xie L, et al. Circumventing outliers of autoaugment with knowledge distillation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 608-625. [46] Abnar S, Dehghani M, Zuidema W. Transferring inductive biases through knowledge distillation[J]. arXiv preprint arXiv:2006.00555, 2020. [47] Dosovitskiy A, Fischer P, Ilg E, et al. Flownet: Learning optical flow with convolutional networks[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2758-2766. [48] Sun D, Yang X, Liu M Y, et al. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8934-8943. [49] Ren H, Li C, Zhang X, et al. ATFVO: An Attentive Tensor-compressed LSTM Model with Optical Flow Features for Monocular Visual Odometry[C]//2021 WRC Symposium on Advanced Robotics and Automation (WRC SARA). IEEE, 2021: 79-85. [50] Geiger A, Lenz P, Stiller C, et al. Vision meets robotics: The kitti dataset[J]. The International Journal of Robotics Research, 2013, 32(11): 1231-1237. [51] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020. [52] Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention[C]//International Conference on Machine Learning. PMLR, 2021: 10347-10357. [53] Huang H, Yu H. Tensor-Solver for Deep Neural Network[M]//Compact and Fast Machine Learning Accelerator for IoT Devices. Springer, Singapore, 2019: 63-105. [54] Huang H, Yu H. Distributed-Solver for Networked Neural Network[M]//Compact and Fast Machine Learning Accelerator for IoT Devices. Springer, Singapore, 2019: 107-143. [55] Huang H, Yu H. Compact and Fast Machine Learning Accelerator for IoT Devices[M]. Singapore: Springer, 2019. [56] Mao W, Xiao Z, Xu P, et al. Energy-efficient machine learning accelerator for binary neural networks[C]//Proceedings of the 2020 on Great Lakes Symposium on VLSI. 2020: 77-82. [57] Mao W, Li K, Cheng Q, et al. A Configurable Floating-Point Multiple-Precision Processing Element for HPC and AI Converged Computing[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021. [58] Cheng Y, Wang C, Chen H B, et al. A large-scale in-memory computing for deep neural network with trained quantization[J]. Integration, 2019, 69: 345-355. [59] Yu H, Wang Y, Zhao J, et al. Memory device, and data processing method based on multi-layer RRAM crossbar array: U.S. Patent 10,459,724[P]. 2019-10-29. [60] Chen Y, Li T, Zhang Q, et al. ANT-UNet: Accurate and noise-tolerant segmentation for pathology image processing[J]. ACM Journal on Emerging Technologies in Computing Systems (JETC), 2021, 18(2): 1-17.
所在学位评定分委会	深港微电子学院
国内图书分类号	TP391.41
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/335688
专题	南方科技大学-香港科技大学深港微电子学院筹建办公室
推荐引用方式 GB/T 7714	任宏伟. 基于深度学习的边缘VSLAM系统开发[D]. 深圳. 南方科技大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
11930185-任宏伟-南方科技大学-（3226KB）	--	--	限制开放	--	请求全文