南方科技大学知识苑(SUSTech KC): 一种基于NAS优化的多精度稀疏神经网络加速器

题名	一种基于NAS优化的多精度稀疏神经网络加速器
其他题名	A Sparse and Mixed-Precision Accelerator for NAS Optimized Convolutional Neural Networks
姓名	刘禹岑
姓名拼音	LIU Yucen
学号	12132461
学位类型	硕士
学位专业	0856 材料与化工
学科门类/专业学位类别	0856 材料与化工
导师	余浩
导师单位	深港微电子学院
论文答辩日期	2023-05-18
论文提交日期	2023-06-21
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	当今，神经网络在图像识别，自然语言处理，语音识别，行为预测等领域的应用越来越广泛。我们越来越需要在边沿的设备上部署神经网络。模型压缩让在边沿设备部署神经网络推理任务成为可能，但网络压缩带来的数据计算模式的变化让现有的加速器的计算模式不能很好的利用网络压缩的成果以提高推理的效率。另外，神经网络搜索（NAS）算法可以对压缩的网络模型进行优化，为各个网络层配置合适的压缩方案，使得网络的推理准确率只有微小下降的同时，大大缩减整个网络的计算与存储容量。目前的加速器设计都不能很好地适应NAS算法给出的同时具有多种精度与多种稀疏度配置的复杂神经网络。为了满足这类网络压缩需求，本文给出了一种多精度稀疏神经网络加速器（SMP）系统，该加速器有如下特点。首先，该加速器能够支持1/2/4/8-bit四种数据精度模式，支持50%、75%、87.5%等多种稀疏度的权重结构化稀疏计算，且提出了低精度下稀疏数据的高效压缩方式，减少了稀疏地址的开销；其次，本加速器还采用了创新的矢量脉动计算阵列，相比传统的并行计算阵列具有更好的时序，相比原子脉动阵列具有更低的延迟；另外，为了适应多种精度与多种稀疏度的片上存储，减少SRAM容量的浪费，本文提出了SRAM混合拼接方案，对比不使用SRAM混合拼接的存储策略，平均计算通量提升到了3.33倍；最后，一般的ASIC加速器结构固化，模块耦合性高，而本加速器有着很好的可扩展性，得益于内部的控制单元与控制总线协议，模块间解耦合，加速器能够方便地集成其他算子单元以支持更多类型的网络计算需求。本文使用Synopsys的EDA工具在28nm工艺下对加速器设计实现进行了前端的功能与功耗仿真。并使用Top-1准确率为67.7%的混合精度与稀疏度的NAS-VGG16网络对加速器进行网表仿真测试，其中4bit层的峰值能效为10.89TOPS/W，87.5%稀疏度8bit层的峰值能效为23.90TOPS/W，全层网表仿真平均能效为15TOPS/W。在保持通用性与可扩展性的同时，对比其他稀疏加速器实现了1.07-3.89倍的能效比提升。
其他摘要	Nowadays, neural networks are being used more and more widely in fields such as image recognition, natural language processing, speech recognition, and behavior prediction. We increasingly need to deploy neural networks on edge devices. Model compression makes it possible to deploy neural network inference tasks on edge devices, but the change in data computation patterns brought about by network compression makes it difficult for existing accelerators to effectively utilize the results of network compression to improve inference efficiency. In addition, neural network search (NAS) algorithms can optimize compressed network models, configuring suitable compression schemes for each network layer, greatly reducing the calculation and storage capacity of the entire network while only slightly reducing the inference accuracy of the network. Current accelerator designs cannot adapt well to complex neural networks configured with multiple accuracies and multiple sparsities given by NAS algorithms. In order to meet these network compression requirements and efficiently adapt to network layers with multiple different configurations, this paper proposes a sparse and multi-precision neural network accelerator (SMP) system, which has the following characteristics. Firstly, the accelerator can support four data precision modes of 1/2/4/8-bit and supports calculations of multiple weight structured sparse sparsities such as 50%, 75%, and 87.5%. Secondly, the accelerator adopts an innovative vector systolic computing array, which has better timing than traditional parallel computing arrays and lower latency than atomic systolic arrays. In addition, to adapt to storage of multiple precisions and multiple sparsities and reduce waste of SRAM capacity, we propose an SRAM hybrid splicing scheme, which improve the computing throughput to 3.33x. Finally, the general ASIC accelerator structure is fixed and has high module coupling, while this accelerator has good scalability, thanks to its internal control unit and control bus protocol, which decouples the modules. The accelerator can easily integrate other operator units to support more types of network computing needs. This paper uses Synopsys' EDA tools to perform front-end functional and power simulations of the accelerator design implementation in 28nm process. The accelerator is then subjected to netlist simulation testing using NAS-VGG16 network with a mixed precision and sparsity top-1 accuracy of 67.7%. The peak energy efficiency of the 4-bit layer is 10.89TOPS/W, and that of the 87.5% sparse 8-bit layer is 23.90TOPS/W. The average energy efficiency of the entire netlist simulation is 15TOPS/W. Compared with other sparse accelerators, it achieved an energy efficiency improvement of 1.07-3.89 times.
关键词	CNN加速器 NAS 多精度计算结构化稀疏矢量脉动阵列
其他关键词	CNN Accelerator NAS Multi-Precision Structured Sparse Vector Systolic Array
语种	中文
培养类别	独立培养
入学年份	2021
学位授予年份	2023-06
参考文献列表	[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet Classification With Deep Convolutional Neural Networks[J]. Advances in Neural Information Processing Systems, 2012, 25: 1106-1114. [2] CHAN W, JAITLY N, LE Q, et al. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition[C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016: 4960-4964. [3] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural Language Processing (Almost) from Scratch[J]. Journal of Machine Learning Research, 2011, 12 (ARTICLE): 2493-2537. [4] LE Q V, MARC'AURELIO R, RAJAT M, et al. Building High-Level Features Using Large Scale Unsupervised Learning[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Madison, WI, USA: IEEE, 2013: 8595-8598. [5] ADAM C, BRODY H, TAO W, et al. Deep Learning with COTS HPC Systems[C]//In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML'13). 2013: III–1337–III–1345. [6] CHEN T, DU Z, SUN N, et al. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning[C]//In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). New York, NY, USA: Association for Computing Machinery, 2014: 269–284. [7] HAN S, MAO H, DALLY W J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding[A]. 2015. [8] JACOB B, KLIGYS S, CHEN B, et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713. [9] ZHOU Y, MOOSAVI-DEZFOOLI S M, CHEUNG N M, et al. Adaptive Quantization for Deep Neural Network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans, Louisiana, USA: 2018: 4596–4604. [10] MCKINSTRY J L, ESSER S K, APPUSWAMY R, et al. Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference[A]. 2018. [11] FROMM J, PATEL S, PHILIPOSE M. Heterogeneous Bitwidth Binarization in Convolutional Neural Networks[J]. Advances in Neural Information Processing Systems, 2018, 31: 4006-4015.N [12] LI H, DE S, XU Z, et al. Training Quantized Nets: A Deeper Understanding[J]. Advances in Neural Information Processing Systems, 2017, 30: 5813-5823. [13] COURBARIAUX M, BENGIO Y, DAVID J P. Binaryconnect: Training Deep Neural Networks with Binary Weights During Propagations[J]. Advances in Neural Information Processing Systems, 2015, 28: 3123-3131. [14] RASTEGARI M, ORDONEZ V, REDMON J, et al. Xnor-net: Imagenet Classification Using Binary Convolutional Neural Networks[C]//European Conference on Computer Vision. Springer, 2016: 525-542. [15] ZHOU S, WU Y, NI Z, et al. Dorefa-net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients[A]. 2016. [16] ZHUANG B, SHEN C, TAN M, et al. Towards Effective Low-Bitwidth Convolutional Neural Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7920-7928. [17] WANG P, HU Q, ZHANG Y, et al. Two-Step Quantization for Low-Bit Neural Networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4376- 4384. [18] HINTON G, VINYALS O, DEAN J, et al. Distilling the Knowledge in a Neural Network: volume 2[A]. 2015. [19] ZHANG S, DU Z, ZHANG L, et al. Cambricon-X: An Accelerator for Sparse Neural Networks[C]// 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Taipei, Taiwan: IEEE, 2016: Article 20, 1-12. [20] JUDD P, ALBERICIO J, HETHERINGTON T, et al. Stripes: Bit-Serial Deep Neural Network Computing[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Taipei, Taiwan: IEEE, 2016: Article 19, 1-12. [21] JOUPPI N P, YOUNG C, PATIL N, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit[C]//In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). New York, NY, USA: Association for Computing Machinery, 2017: 1–12. [22] KUANG. Why Systolic Architectures?[J]. Computer. 1982: 37-46. [23] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52: 127-138. [24] THOMAS E, JAN H M, FRANK H. Neural Architecture Search: A Survey[J]. The Journal of Machine Learning Research. 2019, 20: 1997-2017. [25] WANG K, LIU Z, LIN Y, et al. HAQ: Hardware-Aware Automated Quantization with Mixed Precision[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019: 8604-8612. [26] SHIN D, LEE Jinmook, LEE Jinsu, et al. 14.2 DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks[C]//2017 IEEE International Solid-State Circuits Conference (ISSCC). 2017: 240-241. [27] RYU S, KIM H, YI W, et al. BitBlade: Area and Energy-Efficient Precision-Scalable Neural Network Accelerator with Bitwise Summation[C]//2019 56th ACM/IEEE Design Automation Conference (DAC). 2019: 1-6. [28] MEI L, DANDEKAR M, RODOPOULOS D, et al. Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference[C]//2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). 2019: 6-10. [29] MAO W, LI K, XIE X, et al. A Reconfigurable Multiple-Precision Floating-Point Dot Product Unit for High-Performance Computing[C]//2021 Design, Automation Test in Europe Conference Exhibition (DATE). 2021: 1793-1798. [30] MAO W, LI K, CHENG Q, et al. A Configurable Floating-Point Multiple-Precision Processing Element for HPC and AI Converged Computing[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(2): 213-226. [31] LI K, MAO W, XIE X, et al. Multiple-Precision Floating-Point Dot Product Unit for Efficient Convolution Computation[C]//2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). 2021: 1-4. [32] HUANG M, LIU Y, MAN C, et al. A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks[J]. IEEE Transactions on Circuits and Systems I: Regular Papers. 2022: 1-13. [33] SHARMA H, PARK J, SUDA N, et al. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network[C]//2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018: 764-775. [34] MOONS B, UYTTERHOEVEN R, DEHAENE W, et al. 14.5 Envision: A 0.26-to-10tops/w Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network rocessor in 28nm FDSOI[C]//2017 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2017: 246-247. [35] DAI L, CHENG Q, WANG Y, et al. An Energy-Efficient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks[C]//2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC). Taipei, Taiwan: IEEE, 2022: 448–453. [36] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[J]. The Journal of Machine Learning Research. 2014, 15, 1: 1929-1958. [37] RANZATO M A, POULTNEY C, CHOPRA S, et al. Efficient Learning of Sparse Representations with an Energy-Based Model[C]//Proceedings of the 19th International Conference on Neural Information Processing Systems. Canada: MIT Press, 2006: 1137–1144. [38] RANZATO M A, BOUREAU Y L LECUN Y. Sparse Feature Learning for Deep Belief Networks[C]//Proceedings of the 20th International Conference on Neural Information Processing Systems. Vancouver, British Columbia, Canada: Curran Associates Inc. , 2007: 1185–1192. [39] LEE H, EKANADHAM C, NG A Y. Sparse Deep Belief Net Model for Visual Area V2[C]//Proceedings of the 20th International Conference on Neural Information Processing Systems. Vancouver, British Columbia, Canada: Curran Associates Inc. , 2007: 873–880. [40] LEE H, BATTLE A, RAINA R, et al. Efficient Sparse Coding Algorithms[C]//Proceedings of the 19th International Conference on Neural Information Processing Systems. Canada: MIT Press, 2006: 801-808. [41] ZHANG J F, LEE C E, LIU C, et al. SNAP: An Efficient Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference[J]. IEEE Journal of Solid-State Circuits. 2021, 56, 2: 636-647. [42] ESSER S K, MCKINSTRY J L, BABLANI D, et al. Learned Step Size Quantization[J/OL]. CoRR. 2019: abs/1902.08153. [43] TAN Z, TAN S H, LAMBRECHTS J H, et al. A 400MHz NPU with 7.8TOPS2/W High-PerformanceGuaranteed Efficiency in 55nm for Multi-Mode Pruning and Diverse Quantization Using Pattern-Kernel Encoding and Reconfigurable MAC Units[C]// 2021 IEEE Custom Integrated Circuits Conference (CICC). Austin, TX, USA: IEEE, 2021: 1-2. [44] LIU Z G, WHATMOUGH P N, ZHU Y, et al. S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration[C]//2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Korea, 2022: 573-586.
所在学位评定分委会	材料与化工
国内图书分类号	TN492
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/543926
专题	南方科技大学-香港科技大学深港微电子学院筹建办公室
推荐引用方式 GB/T 7714	刘禹岑. 一种基于NAS优化的多精度稀疏神经网络加速器[D]. 深圳. 南方科技大学,2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132461-刘禹岑-南方科技大学-（5108KB）	--	--	限制开放	--	请求全文