南方科技大学知识苑(SUSTech KC): 基于费马数变换的卷积神经网络加速器设计与优化

题名	基于费马数变换的卷积神经网络加速器设计与优化
其他题名	DESIGN AND OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORK ACCELERATOR BASED ON FERMAT NUMBER TRANSFORM
姓名	陈炳臻
姓名拼音	CHEN Bingzhen
学号	12132103
学位类型	硕士
学位专业	080902 电路与系统
学科门类/专业学位类别	08 工学
导师	叶涛
导师单位	纳米科学与应用研究院
论文答辩日期	2024-05-08
论文提交日期	2024-06-24
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	随着人工智能技术的飞速进步，作为其关键分支的深度学习获得了广泛的关注。卷积神经网络作为深度学习中经典的算法，被广泛应用于智能驾驶和图像处理等领域。卷积作为卷积神经网络应用中计算的基础，在卷积神经网络中占据了大部分计算量，其运算速度的提升显得尤为重要。随着人工智能与物联网技术的深度融合，边缘计算的需求日益增长，对于设计高能效边缘计算芯片的需求愈发迫切。在此背景下，RISC-V开源处理器凭借其免费开源、设计简洁以及强大的可扩展性等特点，已成为设计边缘端计算芯片时的优选方案。本文提出了一种面向边缘端计算的基于 RISC-V RI5CY 开源处理器的卷积神经网络加速器，称为RI5CY-FNT。该设计包含了一个基于费马数变换(FNT)的卷积加速单元、一个池化单元和一个激活单元，并针对加速模块设计了自定义指令。本文选取费马数变换算法作为卷积加速算法，费马数变换中的计算都是基于实数的，与基于复数的快速傅里叶变换(FFT)计算相比，显著降低了复杂性。与主流的 Winograd 算法相比，该算法在不额外消耗资源的情况下可以选取更多的卷积核尺寸，包括3x3和5x5大小的卷积核。此外，通过新的编码方式设计快速费马数变换的卷积加速器，将乘法和取模运算简化成位操作，并设计成专用的卷积加速器。然后针对加速器的结构设计了对应的RISC-V自定义指令，减少了指令开销，加速卷积计算过程。本文以经典的卷积神经网络为测试用例，采用Pytorch 进行网络训练和参数量化,通过内嵌汇编的方式将自定义指令封装成C语言函数,并使用C语言搭建卷积神经网络。在虚拟验证平台 core-v-verif和 FPGA平台进行的实验表明,RISCY-FNT处理器指令功能正确，并正常进行 CNN的推理任务。RISCYFNT 处理器基于中芯国际 CMOS 55nm 工艺进行 DC 仿真，该处理器相比于原版 RI5CY 处理器，增加了约 12% 的面积和 23% 的功耗。基于 FPGA 平台的实验表明 RI5CY-FNT 处理器在LeNet-5 上执行推理任务时，与原版 RI5CY 处理器相比，能耗减少了约 40%。并且RI5CY-FNT 处理器相比于原处理器，在 LeNet-5 网络上执行推理任务使用3x3卷积核实现了 3.6x 的加速比，使用5x5 卷积核实现了 10.6x的加速比，在 VGG16网络上执行推理任务实现了 5.5x的加速比。
其他摘要	With the rapid advancement of Al technology, deep leaming, as a key branch of it, has gamered widespread attention. Convolutional Neural Networks(CNNs), as classic algorithms in deep learning, are extensively used in fields like autonomous driving and image processing. Convolution as the computational foundation in CNN applications, occupies most of the computation in CNNs, making the improvement of its operation speed particularly important. With the deep integration of artificial intelligence and Internet of Things technology, the demand for edge computing is increasing, and the need to design high-performance edge computing chips has become more urgent. Against this backdrop, the RISC-V open-source processor, with its free and open-source nature, simple design, and powerful scalability, has become the preferred solution for designing edge computing chips. This article proposes a convolutional neural network accelerator based on the RISC-V RI5CY open-source processor for edge computing, named RI5CY-FNT. This design includes a convolution acceleration unit based on Fermat Number Transform (FNT), a pooling unit and an activation unit, and custom instructions are designed for the acceleration module. The Fermat Number Transform algorithm is chosen as the convolution acceleration algorithm in this paper. All calculations in the Fermat Number Transform are based on real numbers, which significantly reduces complexity compared to Fast FourierTransform (FFT) calculations based on complex numbers. Compared with the mainstream Winograd algorithm, this algorithm can select more convolution kernel sizes without additional resource consumption, including 3 x 3 and 5 x 5 convolution kernels. In addition, the convolution accelerator for fast Fermat number transform is designed with a new encoding method, which simplifies multiplication and modulo operations into bit operationsand is designed into a dedicated convolution accelerator. Then, corresponding RISC-V custom instructions are designed for the structure of the accelerator, reducing instruction overhead and accelerating the convolution computation process. This paper uses classic convolutional neural networks as test cases, uses Pytorch for network training and parameter quantization, encapsulates custom instructions into C language functions through embedded assembly, and uses C language to build convolutional neural networks. Experiments on the virtual verification platform core-v-verif and FPGA platform show that the Rl5CY-FNT processor instruction functions correctly and performs CNN inference tasks normally. The RI5CY-FNT processor is DC simulated based on the SMIC CMOS 55nm process. Compared with the original Rl5CY processor, it increases the area by about 12% and the power consumption by 23%. Experiments based on the FPGA platform show that when the RI5CY-FNT processor performs inference tasks on LeNet-5, it reduces energy consumption by about 40% compared to the original RI5CY processor. And the RISCY-FNT processor, compared to the original processor, achievesa speedup of 3.6x with a 3 x 3 convolution kernel and 10.6x with a 5 x 5 convolutionkernel when performing inference tasks on the LeNet-5 network, and achieves a speedupof 5.5x when performing inference tasks on the VGGl6 network.
关键词	卷积神经网络卷积加速器费马数变换 RISC-V 自定义指令
其他关键词	Convolutional Neural Network; Convolutional Accelerator; Fermat Number Transform; RISC-V; Custom Instruction
语种	中文
培养类别	独立培养
入学年份	2021
学位授予年份	2024-06
参考文献列表	[1] JIANG X, HE K, CHEN Y. Automatic information extraction in the AI chip domain using gated interactive attention and probability matrix encoding method[J]. Expert Systems with Applications, 2023, 227: 120182. [2] MESSAOUD S, BOUAAFIA S, MARAOUI A, et al. Deep convolutional neural networks based Hardware–Software on-chip system for computer vision application[J]. Computers & Electrical Engineering, 2022, 98: 107671. [3] ISOZAKI A, HARMON J, ZHOU Y, et al. AI on a chip[J]. Lab on a Chip, 2020, 20(17):3074-3090. [4] ZHANG J, TAO D. Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things[J]. IEEE Internet of Things Journal, 2020, 8(10): 7789-7817. [5] SCHMIDHUBER J. Deep learning in neural networks: An overview[J]. Neural networks, 2015, 61: 85-117. [6] GHAHRAMANI Z. Probabilistic machine learning and artificial intelligence[J]. Nature, 2015, 521(7553): 452-459. [7] PAN X, SHI J, LUO P, et al. Spatial as deep: Spatial cnn for traffic scene understanding[C]//Proceedings of the AAAI conference on artificial intelligence: volume 32. 2018. [8] MOOSAVI S, MAHAJAN P D, PARTHASARATHY S, et al. Driving style representation in convolutional recurrent neural network model of driver identification[A]. 2021. [9] WANG Q, LI X, XU C, et al. Bubble recognizing and tracking in a plate heat exchanger by using image processing and convolutional neural network[J]. International Journal of Multiphase Flow, 2021, 138: 103593. [10] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [11] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convo lutional neural networks[J]. Advances in neural information processing systems, 2012, 25. [12] CAO K, LIU Y, MENG G, et al. An overview on edge computing research[J]. IEEE access, 2020, 8: 85714-85728. [13] DATTA D, MITTAL D, MATHEW N P, et al. Comparison of performance of parallel compu tation of CPU cores on CNN model[C]//2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE). IEEE, 2020: 1-8. [14] KIM Y, KONG J, MUNIR A. CPU-accelerator co-scheduling for CNN acceleration at the edge [J]. IEEE Access, 2020, 8: 211422-211433. [15] VELASCO-MONTERO D, FEMÁNDEZ-BEMI J, CARMONA-GÁLÁN R, et al. On the cor relation of CNN performance and hardware metrics for visual inference on a low-cost CPU based platform[C]//2019 International Conference on Systems, Signals and Image Processing (IWSSIP). IEEE, 2019: 249-254. [16] JIANG W, MA Y, LIU B, et al. Layup: Layer-adaptive and multi-type intermediate-oriented memory optimization for GPU-based CNNs[J]. ACM Transactions on Architecture and Code Optimization (TACO), 2019, 16(4): 1-23. [17] KIM H, NAM H, JUNG W, et al. Performance analysis of CNN frameworks for GPUs[C]//2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2017: 55-64. [18] POTLURI S, FASIH A, VUTUKURU L K, et al. CNN based high performance computing for real time image processing on GPU[C]//Proceedings of the Joint INDS'11 & ISTET'11. IEEE, 2011: 1-7. [19] MOOLCHANDANI D, KUMAR A, SARANGI S R. Accelerating CNN inference on ASICs: A survey[J]. Journal of Systems Architecture, 2021, 113: 101887. [20] GUO K, ZENG S, YU J, et al. [DL] A survey of FPGA-based neural network inference accel erators[J]. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2019, 12(1): 1-26. [21] ZENG K, MA Q, WU J W, et al. FPGA-based accelerator for object detection: a comprehensive survey[J]. The Journal of Supercomputing, 2022, 78(12): 14096-14136. [22] SHAWAHNA A, SAIT S M, EL-MALEH A. FPGA-based accelerators of deep learning net works for learning and classification: A review[J]. ieee Access, 2018, 7: 7823-7859. [23] WATERMAN A, LEE Y, PATTERSON D, et al. The RISC-V instruction set manual[J]. VolumeI: User-Level ISA, version, 2014, 2: 1-79. [24] MCCULLOCH W S, PITTS W. A logical calculus of the ideas immanent in nervous activity[J]. The bulletin of mathematical biophysics, 1943, 5: 115-133. [25] FUKUSHIMA K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position[J]. Biological cybernetics, 1980, 36(4):193-202. [26] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recog nition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [27] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[A]. 2014. [28] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9. [29] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:4700-4708. [30] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[A]. 2017. [31] SANDLER M, HOWARD A, ZHU M, et al. Mobilenetv2: Inverted residuals and linear bot tlenecks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4510-4520. [32] HOWARD A, SANDLER M, CHU G, et al. Searching for mobilenetv3[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 1314-1324. [33] TAN M, LE Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//International conference on machine learning. PMLR, 2019: 6105-6114. [34] CONG J, XIAO B. Minimizing computation in convolutional neural networks[C]//International conference on artificial neural networks. Springer, 2014: 281-290. [35] VASILYEV A. CNN optimizations for embedded systems and FFT[J]. Standford University Report, 2015. [36] WANG S, ZHU J, WANG Q, et al. Customized instruction on risc-v for winograd-based convo lution acceleration[C]//2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2021: 65-68. [37] BAOZHOU Z, AHMED N, PELTENBURG J, et al. Diminished-1 fermat number transform for integer convolutional neural networks[C]//2019 IEEE 4th International Conference on Big Data Analytics (ICBDA). IEEE, 2019: 47-52. [38] ZHANG Q, ZHANG M, CHEN T, et al. Recent advances in convolutional neural network acceleration[J]. Neurocomputing, 2019, 323: 37-51. [39] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: An Energy-Efficient Reconfigurable Ac celerator for Deep Convolutional Neural Networks[J/OL]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127-138. DOI: 10.1109/JSSC.2016.2616357. [40] TU F, YIN S, OUYANG P, et al. Deep Convolutional Neural Network Architecture With Re configurable Computation Patterns[J/OL]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017, 25(8): 2220-2233. DOI: 10.1109/TVLSI.2017.2688340. [41] ABTAHI T, SHEA C, KULKARNI A, et al. Accelerating convolutional neural network with FFT on embedded hardware[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018, 26(9): 1737-1749. [42] XU W, ZHANG Z, YOU X, et al. Reconfigurable and low-complexity accelerator for convo lutional and generative networks over finite fields[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(12): 4894-4907. [43] BLEM E, MENON J, SANKARALINGAM K. Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures[C]//2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2013: 1-12. [44] WATERMAN A, LEE Y, PATTERSON D A, et al. The risc-v instruction set manual, volume i: Base user-level isa[J]. EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-2011-62, 2011, 116: 1-32. [45] MONTESDEOCA G, ASANZA V, ESTRADA R, et al. Softprocessor RISCV-EC for Edge Computing Applications[C]//International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing. Springer, 2023: 209-220. [46] ANDRI R, HENRIKSSON T, BENINI L. Extending the RISC-V ISA for efficient RNN based 5G radio resource management[C]//2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 2020: 1-6. [47] AZAD Z, YANG G, AGRAWAL R, et al. RISE: RISC-V SoC for En/Decryption Acceler ation on the Edge for Homomorphic Encryption[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2023. [48] ALPERT D, AVNON D. Architecture of the Pentium microprocessor[J]. IEEE micro, 1993, 13(3): 11-21. [49] MAY C, SILHA E, SIMPSON R, et al. The PowerPC Architecture: A specification for a new family of RISC processors[M]. Morgan Kaufmann Publishers Inc., 1994. [50] KANE G, HEINRICH J. MIPS RISC architectures[M]. Prentice-Hall, Inc., 1992. [51] ASANOVIC K, AVIZIENIS R, BACHRACH J, et al. The rocket chip generator[J]. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17, 2016, 4: 6-2. [52] TRABER A, ZARUBA F, STUCKI S, et al. PULPino: A small single-core RISC-V SoC[C]//3rd RISCV Workshop. 2016: 15. [53] HATHWAY C. Tiny Quantum Chip Developed By Researchers[M]. MA Business London, 2019. [54] CHEN C, XIANG X, LIU C, et al. Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension : Industrial Product[C/OL]//2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 2020: 52-64. DOI: 10.1109/ISCA45697.2020.00016. [55] POLLARD J M. The fast Fourier transform in a finite field[J]. Mathematics of computation, 1971, 25(114): 365-374. [56] ARNOLD, SCHöNHAGE. Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2[J]. Acta Informatica, 1977. [57] CREUTZBURG R, GRUNDMANN H J. Die Fermattransformation und ihre Anwendung bei der schnellen Berechnung digitaler Faltungen[Z]. 1983. [58] SONG Y, JIA B, YANG B, et al. Hummingbird E203 RISC-V processor core-based traffic flow detection system design[C]//2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA). IEEE, 2022: 823-826. [59] LEIBOWITZ L. A simplified binary arithmetic for the Fermat number transform[J/OL]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1976, 24(5): 356-359. DOI: 10.1109/TASSP.1976.1162834.
所在学位评定分委会	电子科学与技术
国内图书分类号	TN47
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/766012
专题	南方科技大学工学院_电子与电气工程系
推荐引用方式 GB/T 7714	陈炳臻. 基于费马数变换的卷积神经网络加速器设计与优化[D]. 深圳. 南方科技大学,2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132103-陈炳臻-电子与电气工程（4315KB）	学位论文	--	限制开放	CC BY-NC-SA	请求全文