南方科技大学知识苑(SUSTech KC): 一种面向NAS优化卷积神经网络的混合精度脉动加速器设计

题名	一种面向NAS优化卷积神经网络的混合精度脉动加速器设计
其他题名	A SYSTOLIC MIXED-BIT-WIDTH ACCELERATOR FOR NAS-OPTIMIZED CONVOLUTION NEURAL NETWORKS
姓名	代柳瑶
姓名拼音	DAI Liuyao
学号	11930188
学位类型	硕士
学位专业	080903 微电子学与固体电子学
学科门类/专业学位类别	08 工学
导师	余浩
导师单位	深港微电子学院
论文答辩日期	2022-05-12
论文提交日期	2022-06-14
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	近年来，卷积神经网络(CNN) 飞速发展，网络的应用场景越来越广泛，准确度也不断提升，但是这导致网络结构越来越复杂，网络运算量日益增大。对于CNN 在某些不敏感的层对计算精度进行量化，可以在不牺牲网络准确率的情况下，减少计算和存储的能耗。因此在硬件资源有限的边缘计算应用中，对CNN 的量化和相应的高能效硬件设计非常重要。神经网络架构搜索(NAS) 方法可用于对多精度 CNN 模型进行优化，为了满足多精度神经网络中不同层存在不同计算精度的需求，非常需要设计专用的低能耗高通量的多精度卷积加速器。在CNN 加速设计中，目前存在几种支持多精度乘加（MAC）计算的方案：高精度拆分设计减少了用于可配置性的额外逻辑，但是这导致在低精度计算中通量较低；低精度组合设计通过低精度单元并行重组提高低精度计算通量，但增加了硬件成本。此外，CNN 中99% 以上的计算是MAC 计算，MAC 运算的功耗对于加速器的整体功耗影响较大。针对高精度计算和低精度计算的兼容性，本文提出了一种比特位拆分和组合硬件加速器来克服这一瓶颈。由于4-bit 可满足大多数卷积层的准确率需求，故所提出的MAC 单元以4-bit 为计算基准，向上兼容8-bit 乘法计算，向下兼容2-bit 乘法计算。其次，针对MAC 计算中的部分积生成和累加，我们提出了多精度Radix-4 booth 算法降低MAC 计算中的功耗，并优化了Radix-4 booth 算法的解码器和编码器。最后，基于CNN 的高并行度特点，开发了基于细粒度脉动的多精度卷积数据流，提升了数据复用率。最后，本文通过NAS 方法对VGG-16、ResNet-50 和LeNet-5 三种网络的计算精度进行量化，在不影响网络准确率的情况下，将网络量化为2-bit，4-bit 和8-bit 的混合精度网络，并使用Synopsis 的EDA 工具，在28nm 的工艺节点对所设计的硬件乘法器和加速器进行功耗仿真。在多精度乘加运算中，与低精度组合和高精度拆分的MAC 单元相比，所提出的MAC 单元的能效比最高达到1.11 和1.57 倍。与已发布的加速器设计Gemmini、Bit-fusion 和Bit-serial 相比，所提出的加速器在多精度VGG-16、ResNet-50 和LeNet-5 实验网络上实现了3.26 倍的能量效率的提升。
其他摘要	In recent years, with the rapid development of convolutional neural networks (CNN), the application scenarios of the network have become more and more extensive, and the accuracy has continued to improve, but this has led to more and more complex network structures and increasing amount of network operations. For CNN, quantizing the computational bit width in some insensitive layers can reduce the energy consumption of computation and storage without sacrificing the accuracy of the network. Therefore, in edge computing applications with limited hardware resources, optimized CNN and energy-efficient hardware design are of great importance. The neural architecture search (NAS) methods are employed for CNN optimization with mixed-bit-width networks. To satisfy the computation requirements, mixed-bit-width convolution accelerators are highly desired for low-power and high-throughput performance. There exist several methods to support mixed-bit-width multiply-accumulate (MAC) operations in CNN accelerator designs. High-bit-width-split method minimizes the additional logic gates for configuration. However, the throughput performance in low-bitwidth mode is poor. Low-bit-width-combination method improves the low-bit-width computational throughput through parallel reorganization of low-bit-width units, but increases the hardware cost. In addition, more than 99% of the calculations in CNN are MAC calculations, and the power consumption of the MAC operations has a great impact on the overall power consumption of the accelerator. Owing to the compatibility of high-bit-width computing and low-bit-width computing, this paper proposes a bit-split-and-combination hardware accelerator to overcome this bottleneck. Since 4-bit can meet the accuracy requirements of most convolutional layers, the proposed MAC unit takes 4-bit as the calculation benchmark, which is upward compatible with 8-bit multiplication calculation and downward compatible with 2-bit multiplication calculation. Secondly, for the partial product generation and accumulation in MAC calculation, we propose a mixed-bit-width Radix-4 booth algorithm to reduce the power consumption in the MAC computations, and optimize the decoder and encoder of the Radix-4 booth algorithm. Finally, based on the high parallelism characteristics of CNN, a mixed-bit-width systolic convolution data flow is developed, which improves the data-reuse and transmission efficiency. Finally, this paper uses the NAS method to quantify the computational bit width of VGG-16, ResNet-50 and LeNet-5 networks, without affecting the accuracy of the network, these networks are quantized into 2-bit, 4-bit and 8-bit mixed-bit-width networks, and use Synopsis’ EDA tool to simulate the power consumption of the designed hardware multipliers and accelerators at the 28nm process node. The proposed MAC unit achieves maximum 1.11× and 1.57× energy efficiency than High-bit-width-split and Lowbit-width-combination units in mixed-bit-width MAC operations, respectively. Compared with published accelerator designs Gemmini, Bit-fusion and Bit-serial, the proposed accelerator achieves up to 3.26× energy-saving performance on the mixed-bit-width VGG-16, ResNet-50 and LeNet-5 benchmarks.
关键词	Nas 卷积神经网络多精度加速器脉动阵列
其他关键词	Nas Convolutional Neural Network Mixed-bit-width Accelerator Systolic Array
语种	中文
培养类别	独立培养
入学年份	2019
学位授予年份	2022-05
参考文献列表	[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25:1106-1114. [2] ZHUANG B, SHEN C, TAN M, et al. Towards effective low-bitwidth convolutional neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018: 7920-7928. [3] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of machine learning research, 2011, 12(ARTICLE): 2493-2537. [4] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[A]. 2014. [5] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788. [6] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9. [7] HAN S, MAO H, DALLY W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding[A]. 2015. [8] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[A]. 2017. [9] HINTON G, VINYALS O, DEAN J, et al. Distilling the knowledge in a neural network: volume 2[A]. 2015. [10] JACOB B, KLIGYS S, CHEN B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 2704-2713. [11] MCKINSTRY J L, ESSER S K, APPUSWAMY R, et al. Discovering low-precision networks close to full-precision networks for efficient embedded inference[A]. 2018. [12] ZHOU Y, MOOSAVI-DEZFOOLI S M, CHEUNG N M, et al. Adaptive quantization for deep neural network[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 32. 2018. [13] FROMM J, PATEL S, PHILIPOSE M. Heterogeneous bitwidth binarization in convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2018, 31: 4006-4015. [14] LI H, DE S, XU Z, et al. Training quantized nets: A deeper understanding[J]. Advances in Neural Information Processing Systems, 2017, 30: 5813-5823. [15] COURBARIAUX M, BENGIO Y, DAVID J P. Binaryconnect: Training deep neural networks with binary weights during propagations[J]. Advances in neural information processing systems, 2015, 28: 3123-3131. [16] RASTEGARI M, ORDONEZ V, REDMON J, et al. Xnor-net: Imagenet classification using binary convolutional neural networks[C]//European conference on computer vision. Springer, 2016: 525-542. [17] ZHOU S, WU Y, NI Z, et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients[A]. 2016. [18] ZHUANG B, SHEN C, TAN M, et al. Towards effective low-bitwidth convolutional neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7920-7928. [19] WANG P, HU Q, ZHANG Y, et al. Two-step quantization for low-bit neural networks[C]//Proceedings of the IEEE Conference on computer vision and pattern recognition. 2018: 4376-4384. [20] PARK E, KIM D, YOO S. Energy-efficient neural network accelerator based on outlieraware low-precision computation[C]//2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018: 688-698. [21] LAI L, SUDA N, CHANDRA V. Deep convolutional neural network inference with floating point weights and fixed-point activations[A]. 2017. [22] HE Q, WEN H, ZHOU S, et al. Effective quantization methods for recurrent neural networks [A]. 2016. [23] JUDD P, ALBERICIO J, HETHERINGTON T, et al. Stripes: Bit-serial deep neural network computing[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016: 1-12. [24] WANG K, LIU Z, LIN Y, et al. Haq: Hardware-aware automated quantization with mixed precision[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 8612-8620. [25] HAN S, MAO H, DALLY W. Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015[A]. [26] LIN J, RAO Y, LU J, et al. Runtime neural pruning[J]. Advances in neural information processing systems, 2017, 30: 2178-2188. [27] ZHU C, HAN S, MAO H, et al. Trained ternary quantization[A]. 2016. [28] CHOI J, WANG Z, VENKATARAMANI S, et al. Pact: Parameterized clipping activation for quantized neural networks[A]. 2018. [29] JACOB B, KLIGYS S, CHEN B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 2704-2713. [30] HAN S, MAO H, DALLY W. Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015[A]. [31] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [32] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[A]. 2015. [33] JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th annual international symposium on computer architecture. 2017: 1-12. [34] CHEN Y H, EMER J, SZE V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks[J]. ACM SIGARCH computer architecture news, 2016, 44(3): 367-379. [35] DAI L, CHENG Q, WANG Y, et al. An Energy-Efficient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks[C]//2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC). 2022: 448-453. [36] MAO W, LI K, XIE X, et al. A Reconfigurable Multiple-Precision Floating-Point Dot Product Unit for High-Performance Computing[C]//2021 Design, Automation Test in Europe Conference Exhibition (DATE). 2021: 1793-1798. [37] MAO W, LI K, CHENG Q, et al. A Configurable Floating-Point Multiple-Precision Processing Element for HPC and AI Converged Computing[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(2): 213-226. [38] LI K, MAO W, XIE X, et al. Multiple-Precision Floating-Point Dot Product Unit for Efficient Convolution Computation[C]//2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). 2021: 1-4. [39] SHARMA H, PARK J, SUDA N, et al. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network[C]//2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018: 764-775. [40] MOONS B, UYTTERHOEVEN R, DEHAENE W, et al. 14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi[C]//2017 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2017: 246-247. [41] SANKARADAS M, JAKKULA V, CADAMBI S, et al. A massively parallel coprocessor for convolutional neural networks[C]//2009 20th IEEE International Conference on Applicationspecific Systems, Architectures and Processors. IEEE, 2009: 53-60. [42] SRIRAM V, COX D, TSOI K H, et al. Towards an embedded biologically-inspired machine vision processor[C]//2010 International Conference on Field-Programmable Technology. IEEE, 2010: 273-278. [43] CHAKRADHAR S, SANKARADAS M, JAKKULA V, et al. A dynamically configurable coprocessor for convolutional neural networks[C]//Proceedings of the 37th annual international symposium on Computer architecture. 2010: 247-257. [44] PEEMEN M, SETIO A A, MESMAN B, et al. Memory-centric accelerator design for convolutional neural networks[C]//2013 IEEE 31st International Conference on Computer Design (ICCD). IEEE, 2013: 13-19. [45] GOKHALE V, JIN J, DUNDAR A, et al. A 240 g-ops/s mobile coprocessor for deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2014: 682-687. [46] GUPTA S, AGRAWAL A, GOPALAKRISHNAN K, et al. Deep learning with limited numerical precision[C]//International conference on machine learning. PMLR, 2015: 1737-1746. [47] ZHANG C, LI P, SUN G, et al. Optimizing fpga-based accelerator design for deep convolutional neural networks[C]//Proceedings of the 2015 ACM/SIGDA international symposium on field programmable gate arrays. 2015: 161-170. [48] CHEN T, DU Z, SUN N, et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning[J]. ACM SIGARCH Computer Architecture News, 2014, 42(1): 269-284. [49] DU Z, FASTHUBER R, CHEN T, et al. ShiDianNao: Shifting vision processing closer to the sensor[C]//Proceedings of the 42nd Annual International Symposium on Computer Architecture. 2015: 92-104. [50] CHEN Y, LUO T, LIU S, et al. Dadiannao: A machine-learning supercomputer[C]//2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014: 609-622. [51] YOO H J, PARK S, BONG K, et al. A 1.93 tops/w scalable deep learning/inference processor with tetra-parallel mimd architecture for big data applications[C]//IEEE international solid-state circuits conference. IEEE, 2015: 80-81. [52] CAVIGELLI L, GSCHWEND D, MAYER C, et al. Origami: A convolutional network accelerator[C]//Proceedings of the 25th edition on Great Lakes Symposium on VLSI. 2015: 199-204. [53] WU B, WANG Y, ZHANG P, et al. Mixed precision quantization of convnets via differentiable neural architecture search[A]. 2018. [54] CAI H, ZHU L, HAN S. Proxylessnas: Direct neural architecture search on target task and hardware[A]. 2018. [55] WU B, DAI X, ZHANG P, et al. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10734-10742. [56] REN A, ZHANG T, YE S, et al. Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers[C]//Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 2019: 925-938. [57] DING C, LIAO S, WANG Y, et al. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices[C]//Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 2017: 395-408. [58] WEN W, WU C, WANG Y, et al. Learning structured sparsity in deep neural networks[J]. Advances in neural information processing systems, 2016, 29: 2074-2082. [59] RYU S, KIM H, YI W, et al. Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation[C]//Proceedings of the 56th Annual Design Automation Conference 2019. 2019: 1-6.
所在学位评定分委会	深港微电子学院
国内图书分类号	TP303
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/335835
专题	南方科技大学-香港科技大学深港微电子学院筹建办公室
推荐引用方式 GB/T 7714	代柳瑶. 一种面向NAS优化卷积神经网络的混合精度脉动加速器设计[D]. 深圳. 南方科技大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
11930188-代柳瑶-南方科技大学-（7961KB）	--	--	限制开放	--	请求全文