南方科技大学知识苑(SUSTech KC): 面向边缘深度学习的稀疏化可重构多精度加速器设计

题名	面向边缘深度学习的稀疏化可重构多精度加速器设计
其他题名	A SPARSE AND RECONFIGURABLE MULTI PRECISION ACCELERATOR FOR DEEP LEARNING ON EDGE
姓名	谢歆昂
姓名拼音	XIE Xinang
学号	12032663
学位类型	硕士
学位专业	0856 材料与化工
学科门类/专业学位类别	0856 材料与化工
导师	余浩
导师单位	深港微电子学院
论文答辩日期	2022-05-12
论文提交日期	2022-06-17
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	随着深度学习的不断发展，多精度深度神经网络在边缘设备上广泛使用，导致了支持可重构多精度运算的加速设备的需求。同时面对庞大的计算量，稀疏化成为提升加速器性能的设计需求之一。因此开发一款面向边缘深度学习的稀疏化可重构多精度加速器是十分有必要的。针对多精度网络的稀疏实现，本设计以掩膜编码理论为基础，在不同精度下，从相应权重中过滤零元素，然后对非零元素离线编码，通过选择器电路进行相应多精度输入数据的筛选，实现了此多精度可重构加速器的稀疏化设计。其次本设计提出了多精度拆分融合策略以实现一种能够对 2- bit 精度数据、4-bit 精度数据和 8-bit 精度数据的卷积处理的多精度乘法器单元,以完成对多精度加速器的可重构实现。本设计又提出了分类策略对加法树的结构输入划分，提出了以一种新型的 4-2 压缩器为基本单位的加法树，解决了多级累加路径延迟差的毛刺功耗问题降低了系统功耗。本设计使用 Verilog HDL 设计实现。在 28nm 工艺下的最终实验结果说明，系统的总功耗为 125.75mW，总面积为 0.6189mm2。本设计在具有 50%稀疏度的网络下处理 8-bit 精度数据时可以实现 8.1431TOPS/W 能效比，处理 4-bit 精度数据时可以实现 32.5724TOPS/W 能效比，处理 2-bit 精度数据时可以实现 65.1548TOPS/W 能效比。而在具有 87.5%稀疏度的网络下处理 8-bit 精度数据时可以实现 17.9565TOPS/W 能效比，处理 4-bit 精度数据时可以实现 71.8261TOPS/W 能效比，处理 2-bit 精度数据时可以实现 143.6523TOPS/W 能效比。随着网络稀疏度的增加该设计的优势随之增加。
其他摘要	With the development of deep learning, multi-precision deep neural networks are widely used in edge devices, which demands reconfigurable multi-precision accelerator. Exploiting sparsity is a necessary approach to further improve the energy-efficiency of accelerator. Therefore, it is of great importance to design a sparse and reconfigurable multi-precision accelerator. In this paper, we proposed a sparse and reconfigurable multi-precision accelerator for deep learning on edge. The proposed accelerator has the following three features, namely, the mask coding sparse module, the multiplication unit that supports three kinds of precision processing, and the addition tree based on a new 4-2 compressor. Firstly, aiming at the sparse implementation of the multi-precision accelerator, we describe a method of mask by filtering zero elements from weights at corresponding frame, and then encode the non-zero elements under multi-precision. Secondly, this design proposes a multi-precision split and fusion strategy to realize a multi-precision multiplier that can convolve 2 -bit precision, 4-bit precision and 8-bit precision data, so as to complete the reconfigurable implementation of the multi-precision accelerator. Finally, we proposed an adder tree with a new classification strategy to divide the structure input and a new structure of 4-2 compressor, which can solve the burr power consumption caused by the path delay difference of the multistage accumulation in adder tree. The proposed work is designed by Verilog HDL language and synthesized in 28nm process. The total power consumption of the system is 125.75mW and the total area is 0.6189mm2 . The proposed accelerator achieves 8.1431 TOPS/W with 50% sparsity under 8-bit precision, 32.5724 TOPS/W under 4-bit precision, and 65.1548 TOPS/W under 2-bit precision. And the advantages of the design increase with the increase of network sparsity, the proposed accelerator achieves 17.9565 TOPS/W with 87.5% sparsity under 8-bit precision, 71.8261 TOPS/W under 4-bit precision, and 143.6523 TOPS/W under 2-bit precision.
关键词	边缘深度学习加速器稀疏化可重构多精度
其他关键词	Deep Learning On Edge Accelerator Sparse Reconfigurable Multi-precision
语种	中文
培养类别	独立培养
入学年份	2020
学位授予年份	2022-06
参考文献列表	[1] OTTER D W, MEDINA J R, KALITA J K. A Survey of the Usages of Deep Learning for Natural Language Processing[J]. Ieee Transactions on Neural Networks and Learning Systems, 2021, 32(2): 604 -624. [2] GYSEL P, PIMENTEL J, MOTAMEDI M, et al. Ristretto: A Framework for Empirical Study of Resource -Efficient Inference in Convolutional Neural Networks[J]. Ieee Transactions on Neu ral Networks and Learning Systems, 2018: 5784-5789. [3] SANTOS P D, ALVES J C, FERREIRA J C. An FPGA Array for Cellular Genetic Algorithms: Application to the Minimum Energy Broadcast Problem[J]. Microprocessors and Microsystems, 2018, 58(APR.): 1 -12. [4] CAI H, WANG T, WU Z, et al. On -Device Image Classification with Proxyless Neural Architecture Search and Quantization -Aware Fine -Tuning; proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), F, 2019 [C]. 2019. [5] WANG K, LIU Z, LIN Y, et al. HAQ: Hardware -Aware Automated Quantization With Mixed Precision; proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), F, 2020 [C].2020. [6] TECHNICOLOR T, RELATED S, TECHNICOLOR T, et a l. ImageNet Classification with Deep Convolutional Neural Networks[J]. Advances in neural information processing systems, 2012 . [7] SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. Computer Science, 2014 . [8] SZEGEDY C, LIU W, JIA Y, et al. Going Deeper with Convolutions[J]. IEEE Conference on Computer Vision and Pattern Recognition , 2015. [9] IOFFE S, SZEGEDY C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift; proce edings of the 32nd International Conference on Machine Learning, Lille, FRANCE, F Jul 07 -09, 2015 [C]. 2015. [10] HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recognition[J]. IEEE Conference on Computer Vision and Pattern Recognition, 2016. [11] HUANG G, LIU Z, LAURENS V, et al. Densely Connected Convolutional Networks[J]. IEEE Conference on Computer Vision and Pattern Recognition ,2017. [12] IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size[J]. arXiv preprint arXiv:160207360, 2016. [13] HOWARD A G, ZHU M, CHEN B, et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications[J]. arXiv preprint arXiv:170404861, 2017. [14] ZHANG X, ZHOU X, LIN M, et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices[J]. arXiv preprint arXiv:170701083, 2017. [15] BARRET Z, Q L. Neural Architecture Search with Reinforcement Learning[J]. arXiv preprint arXiv:16 1101578, 2016. [16] MILLER G. Designing neural networks using Genetic Algorithms; proceedings of the the 3rd International Conference on Genetic Algorithms, F, 1989 [C]. 1989. [17] NICKOLLS J R, BUCK I, GARLAND M, et al. Scalable parallel programming with CUDA[J]. IEEE Hot Chips 20 Symposium, 2008 . [18] JOUPPI N P, YOUNG C, PATIL N, et al. In -Datacenter Performance Analysis of a Tensor Processing Unit; proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, CANADA, F Jun 24-28, 2017 [C]. 2017. [19] CHEN Y H, EMER J, SZE V, et al. Eyeriss: A Spatial Architecture for Energy -Efficient Dataflow for Convolutional Neural Networks; proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer A rchitecture (ISCA), Seoul, SOUTH KOREA, F Jun 18 -22, 2016 [C]. 2016. [20] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: An Energy -Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks[J]. Ieee Journal of Solid-State Circuits, 2017, 52(1): 127-138. [21] CHEN T S, DU Z D, SUN N H, et al. DianNao: A Small-Footprint High Throughput Accelerator for Ubiquitous Machine -Learning[J]. Acm Sigplan Notices, 2014, 49(4): 269-283. [22] LUO T, LIU S L, LI L, et al. DaDianNao: A Neural Network Superc omputer[J]. Ieee Transactions on Computers, 2017, 66(1): 73 -88. [23] DU Z D, FASTHUBER R, CHEN T S, et al. ShiDianNao: Shifting Vision Processing Closer to the Sensor; proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, F Jun 13-17, 2015 [C]. 2015. [24] LIU D F, CHEN T S, LIU S L, et al. PuDianNao: A Polyvalent Machine Learning Accelerator[J]. Acm Sigplan Notices, 2015, 50(4): 369 -381. [25] LIU S L, DU Z D, TAO J H, et al. Cambricon: An Instruction Set Architecture for Neural Networks; proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Seoul, SOUTH KOREA, F Jun 18-22, 2016 [C]. 2016. [26] MO H, ZHU W, HU W, et al. 9.2A 28nm 12.1TOPS/W Dual-Mode CNN Processor Using Effective -Weight-Based Convolution and Error Compensation-Based Prediction; proceedings of the 2021 IEEE International Solid- State Circuits Conference, F, 2021 [C]. 2021. [27] PEI J, DENG L, SONG S, et al. Towards artificial general intelligence with hybrid Tianjic chip architecture[J]. Nature, 2019, 572(7767): 106 -110. [28] BIRADAR V B, VISHWAS P G, CHETAN C S, et al. Design and Performance Analysis of modified Unsigned Braun and Signed Baugh -Wooley Multiplier; proceedings of the International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques, Mysuru, India, F Dec 15-16, 2017 [C]. 2017. [29] SWEE K L S, HIUNG L H, IEEE. Performance Comparison Review of Radix -Based Multiplier Designs; proceedings of the 4t h International Conference on Intelligent and Advanced Systems and A Conference of World Engineering, Science and Technology Congress, Kuala Lumpur, MALAYSIA, F Jun 12 -14, 2012 [C]. 2012. [30] YKUNTAM Y D, PAVANI K, SALADI K. Design and Analysis of High Speed Wallace Tree Multiplier Using Parallel Prefix Adders for VLSI Circuit Designs; proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), F, 2020 [C]. 2020. [31] RIAZ M H, AHMED S A, JAVAID Q, et al. Low Power 4x4 Bit Multiplier Design using Dadda Algorithm and Optimized Full Adder; proceedings of the 15th International Bhurban Conference on Applied Sciences and Technology, Natl Ctr Phys, Islamabad, PAKISTAN, F Jan 09 -13, 2018 [C]. 2018. [32] PARK J, KIM Y, IEEE. Design and Implementation of Ternary Carry Lookahead Adder on FPGA; proceedings of the 20th International Conference on Electronics, Information, and Communication (ICEIC), South Korea, F Jan 31-Feb 03, 2021 [C]. 2021. [33] KIM T, JAO, W. Circuit Optimization Using Carry-Save-Adder Cells[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1998. [34] PARMAR S, SINGH K P. Design of High Speed Hybrid Carry Select Adder; proceedings of the Advance Computing Conference (IACC), 2013 IEEE 3rd International, F, 2013 [C]. 2013. [35] REN P Z, XIAO Y, CHANG X J, et al. A Comprehensive Survey of NeuralArchitecture Search: Challenges and Solutions[J]. Acm Computing Surveys, 2021, 54(4). [36] PARASHAR A, RHU M, MUKKARA A, et al. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks; proceedings of the the 44th Annual International Symposium, F, 2017 [C]. 2017. [37] LIU Z G, WHATMOUGH P, MATTINA M. Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration[J]. arXiv preprint arXiv:200902381, 2020. [38] UJWALA D, MATHAN N. Review on Performance of Multipliers[J]. Research Journal of Pharmaceutical Biological and Chemical Sciences, 2017, 8(2): 2668-2672. [39] HARIKA K, SWETHA B V, RENUKA B, et al. Analysis of Different Multiplication Algorithms & FPGA Implementation[J]. IOSR Journal of VLSI and Signal processing, 2014, 4(2): 29 -35. [40] PRAJWAL N, AMARESHA S K, YELLAMPALLI S S. Low Power ASIC Implementation of Signed and Unsigned Wallace-Tree with Vedic Multiplier Using Compressors; proceedings of the International Conference On Smart Technologies For Smart Nation (SmartTechCon), REVA Univ, Bengaluru, INDIA, F Aug 17-19, 2017 [C]. 2017. [41] KUMM M, GUSTAFSSON O, DE DINECHIN F, et al. Karatsuba with Rectangular Multipliers for FPGAs; proceedings of the 25th International Symposium on Computer Arithmetic, Amherst, MA, F Jun 25 -27, 2018 [C]. 2018. [42] DAI L, CHENG Q, WANG Y, et al. An Energy-Efficient Bit-Split-and Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks; proceeding of the 27th Asia and South Pacific Design Automation Conference (ASP-DAC), 2022 [C]. 2022. [43] LI K, ZHOU J, WANG Y, et al. A Precision-Scalable Energy-Efficient Bit Split-and-Combination Vector Systolic Accelerator for NAS -Optimized DNNs on Edge; proceeding of Design Automation and Test in Europe Conference (DATE), 2022 [C]. [44] SHARMA H, PARK J, SUDA N, et al. Bit Fusion: BitLevel Dynamically Composable Architecture for Accelerating Deep Neural Networks; proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, F Jun 01 -06, 2018 [C]. 2018. [45] CAMUSY V, MEIY L, ENZ C, et al. Review and Benchmarking of Precision -Scalable Multiply-Accumulate Unit Architectures for Embedded Neural Network Processing[J]. Ieee Journal on Emerging and Selected Topics in Circuits and Systems, 2019, 9(4): 697 -711. [46] JO J, KIM S, PARK I C. Energy-Efficient Convolution Architecture Based on Rescheduled Dataflow[J]. IEEE Transactions on Circuits & Systems I Regular Papers, 2018, 65(12): 4196-4207. [47] SOHN J, SWARTZLANDER E. A Fused Floating -Point Three-Term Adder[J]. Circuits and Systems I: Regular Papers, IEEE Transactions on, 2014, 61(10): 2842-2850. [48] VERAMACHANENI S, KRISHNA K, AVINASH L, et al. Novel Architecture for High-Speed and Low-Power 3-2, 4-2 and 5-2 Compressors; proceedings of the International Conference on International Conference on Vlsi Desi gn, F, 2007 [C]. 2007. [49] NAJAFI A, MAZLOOM-NEZHAD B, NAJAFI A. Low-Power and High-Speed 4-2 Compressor; proceedings of the International Convention on Information & Communication Technology Electronics & Microelectronics, F, 2013 [C].2013. [50] KUMAR S, KUMAR M, et al. 4-2 Compressor Design with New XOR-XNOR Module; proceedings of the Fourth International Conference on Advanced Computing & Communication Technologies, 2014 [C]. 2014. [51] SHOMRON G, HOROWITZ T, WEISER U. SMT-SA: Simultaneous Multithreading in Systolic Arrays[J]. Ieee Computer Architecture Letters, 2019, 18(2): 99-102. [52] SHARIFY S, LASCORZ A D, MAHMOUD M, et al. Laconic Deep Learning Inference Acceleration ; proceedings of the 46th International Symposium on Computer Architecture (ISCA) / Workshop on Computer Architecture Education (WCAE), Phoenix, AZ, F 2019 Jun 22 -26, 2019 [C]. 2019.
所在学位评定分委会	深港微电子学院
国内图书分类号	TN492
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/335919
专题	南方科技大学-香港科技大学深港微电子学院筹建办公室
推荐引用方式 GB/T 7714	谢歆昂. 面向边缘深度学习的稀疏化可重构多精度加速器设计[D]. 深圳. 南方科技大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12032663-谢歆昂-南方科技大学-（4432KB）	--	--	限制开放	--	请求全文