南方科技大学知识苑(SUSTech KC): 面向量化卷积神经网络加速的RISC-V专用指令集处理器

题名	面向量化卷积神经网络加速的RISC-V专用指令集处理器
其他题名	A RISC-V Application-Specific Instruction-set Processor for Quantized Convolutional Neural Network Acceleration
姓名	徐志远
姓名拼音	XU Zhiyuan
学号	12132153
学位类型	硕士
学位专业	080902 电路与系统
学科门类/专业学位类别	08 工学
导师	叶涛
导师单位	纳米科学与应用研究院
论文答辩日期	2024-05-08
论文提交日期	2024-06-24
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	物联网 (IoT) 设备和应用数量的迅速增加开启了一个被称为边缘人工智能 (Edge AI) 的新时代。在低功耗、资源受限的边缘设备上执行深度神经网络 (DNN) 的需求日益增长，带来了重大的技术挑战。边缘设备通常具有严格的资源限制，例如有限的存储容量和功耗预算。因此，在遵守这些约束条件的前提下，提高模型推理效率成为最有效的节能方法。目前主要的研究方向是通过设计高效的 DNN 模型以及改进硬件架构来加速边缘端的推理。其中，定点量化的卷积神经网络 (CNN) 能够在保持相当精度的同时显著降低计算和带宽开销。然而，边缘设备却缺乏对低位宽量化神经网络 (QNN) 推理的灵活支持。为此，本文从单指令多数据 (SIMD) 计算架构角度出发，面向边缘端对低位宽精度 QNN 计算需求，提出并设计了一种基于 RISC-V 开放指令集架构的专用指令集处理器，用于支持低精度定点 QNN 推理。本文的主要研究内容如下： (1) 指令集扩展：通过研究主流 SIMD 指令集，分析提高 QNN 计算并行度的方法，提出了一种轻量化且可配置的 RISC-V SIMD 自定义指令集。该指令集允许不同精度位宽的指令模块化组合，以满足各种资源配置需求。 (2) 微架构设计：基于扩展指令集，实现了可配置、针对 FPGA 时序优化的执行单元模块。通过与模块化的 VexRiscv 处理器高度耦合，在 Xilinx FPGA 平台上成功实现了流水线优化的高时钟频率与高性能处理器。 (3) 卷积加速与推理框架：将自定义指令集整合到开源微型机器学习框架中，实现了多种位宽的卷积计算函数。通过基准测试以及 QNN 模型推理任务，对执行单元的性能进行了评估，并与其他先进架构进行了比较。实验结果显示，对于 2 位、4 位、8 位和 16 位量化的卷积运算，平均加速比分别达到 20.8 倍、7.6 倍、3.0 倍和 1.7 倍；在 8 位量化的 MobileNetV1 推理任务中，端到端推理加速比超过 1.5 倍，查找表 (LUT) 和触发器 (FF) 资源开销分别增加了约 26% 和 7%，实现了与 ARM Cortex-M4 相当的推理性能。
其他摘要	The rapid increase in the number of Internet of Things (IoT) devices and applications has ushered in a new era known as Edge Artificial Intelligence (Edge AI). The growing demand for executing Deep Neural Networks (DNN) on low-power, resource-constrained edge devices poses significant technical challenges. Edge devices typically face strict resource limitations, such as limited storage capacity and power budget. Therefore, improving model inference efficiency becomes the most effective energy-saving method while adhering to these constraints. The current primary research direction focuses on accelerating edge inference by designing efficient DNN models and improving hardware architectures. Among them, fixed-point quantized Convolutional Neural Networks (CNN) can significantly reduce computation and bandwidth overhead while maintaining comparable accuracy. However, edge devices currently lack flexible support for inference with low bit-width Quantized Neural Networks (QNN). Therefore, this paper starts from the perspective of Single Instruction Multiple Data (SIMD) computing architecture, targeting the demand for low bit-width precision QNN calculations at the edge. It proposes and designs an application-specific instruction-set processor based on the RISC-V open instruction set architecture, aimed at supporting low-precision fixed-point QNN inference. The main research contents of this paper are as follows: (1) Instruction Set Extension: By studying mainstream SIMD instruction sets, methods to enhance QNN computation parallelism were analyzed. A lightweight and configurable RISC-V SIMD custom instruction set was proposed. This instruction set allows modular combinations of instructions with different precision bit-widths to meet various resource configuration requirements. (2) Microarchitecture Design: Based on the extended instruction set, configurable execution unit modules optimized for FPGA timing were implemented. Through tight integration with the modular VexRiscv processor, a pipeline-optimized, high clock frequency, and high-performance processor was successfully achieved on the Xilinx FPGA platform. (3) Convolution Acceleration and Inference Framework: The custom instruction set was integrated into an open-source lightweight machine learning framework, enabling the implementation of convolutional computation functions with various bit-widths. Performance evaluation of the execution units was conducted through benchmark testing and QNN model inference tasks, comparing it with other state-of-the-art architectures. The experimental results indicate that, for convolutional operations quantized to 2- bit, 4-bit, 8-bit, and 16-bit precision, the average speedup ratios are 20.8x, 7.6x, 3.0x, and 1.7x, respectively. In the case of 8-bit quantized MobileNetV1 inference tasks, the end-to-end inference speedup exceeds 1.5x. The resource overhead in terms of Look-Up Tables (LUT) and Flip-Flops (FF) increases by approximately 26% and 7%, respectively. Moreover, it achieves inference performance comparable to ARM Cortex-M4.
关键词	RISC-V 自定义指令专用指令集处理器卷积神经网络
其他关键词	RISC-V Custom Instructions Application-Specific Instruction-set Processor Convolutional Neural Networks
语种	中文
培养类别	独立培养
入学年份	2021
学位授予年份	2024-06
参考文献列表	[1] CHANG Z, LIU S, XIONG X, et al. A survey of recent advances in edge-computing-powered artificial intelligence of things[J]. IEEE Internet of Things Journal, 2021, 8(18): 13849-13875. [2] JIA L, ZHOU Z, XU F, et al. Cost-efficient continuous edge learning for artificial intelligence of things[J]. IEEE Internet of Things Journal, 2021, 9(10): 7325-7337. [3] BOROUJERDIAN B, GENC H, KRISHNAN S, et al. Why compute matters for UAV energy efficiency?[Z]. 2018. [4] HAO C, DOTZEL J, XIONG J, et al. Enabling design methodologies and future trends for edge AI: Specialization and codesign[J]. IEEE Design & Test, 2021, 38(4): 7-26. [5] LIANG T, GLOSSNER J, WANG L, et al. Pruning and quantization for deep neural network acceleration: A survey[J]. Neurocomputing, 2021, 461: 370-403. [6] LAI L, SUDA N, CHANDRA V. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus[A]. 2018. [7] HAN S, MAO H, DALLY W J. Deep compression: Compressing deep neural networks with runing, trained quantization and huffman coding[A]. 2015. [8] YAO S, ZHAO Y, ZHANG A, et al. Deepiot: Compressing deep neural network structures for sensing systems with a compressor-critic framework[C]//Proceedings of the 15th ACM conference on embedded network sensor systems. 2017: 1-14. [9] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[A]. 2015. [10] DAVID R, DUKE J, JAIN A, et al. Tensorflow lite micro: Embedded machine learning for tinyml systems[J]. Proceedings of Machine Learning and Systems, 2021, 3: 800-811. [11] ARM. Arm NN SDK–Arm®[EB/OL]. (2024-3-1) [2024-3-1]. https://www.arm.com/products/silicon-ip-cpu/ethos/arm-nn. [12] STMICROELECTRONICS. X-CUBE-AI - AI expansion pack for STM32CubeMX–STMicroelectronics[EB/OL]. (2024-3-1) [2024-3-1]. https://www.st.com/en/embedded-software/x-cube-ai.html. [13] CAPOTONDI A, RUSCI M, FARISELLI M, et al. CMix-NN: Mixed low-precision CNN library for memory-constrained edge devices[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2020, 67(5): 871-875. [14] IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size[A]. 2016. [15] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[A]. 2017. [16] ZHANG X, ZHOU X, LIN M, et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6848-6856. [17] STRUBELL E, GANESH A, MCCALLUM A. Energy and policy considerations for deep learning in NLP[A]. 2019. [18] GHOLAMI A, KWON K, WU B, et al. Squeezenext: Hardware-aware neural network design [C]//Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2018: 1638-1647. [19] LIN J, CHEN W M, LIN Y, et al. Mcunet: Tiny deep learning on iot devices[J]. Advances in Neural Information Processing Systems, 2020, 33: 11711-11722. [20] TAN M, CHEN B, PANG R, et al. Mnasnet: Platform-aware neural architecture search for mobile[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 2820-2828. [21] LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4013-4021. [22] YANG T, LIAO Y, SHI J, et al. A Winograd-based CNN accelerator with a fine-grained regular sparsity pattern[C]//2020 30th International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2020: 254-261. [23] LIU X, POOL J, HAN S, et al. Efficient sparse-winograd convolutional neural networks[A]. 2018. [24] OLSON L E, HILL M D, WOOD D A. Crossing guard: Mediating host-accelerator coherence interactions[J]. ACM SIGARCH Computer Architecture News, 2017, 45(1): 163-176. [25] LIM S H, SUH W W, KIM J Y, et al. RISC-V Virtual Platform-Based Convolutional Neural Network Accelerator Implemented in SystemC[J]. Electronics, 2021, 10(13): 1514. [26] MELONI P, GARUFI A, DERIU G, et al. CNN hardware acceleration on a low-power and lowcost APSoC[C]//2019 Conference on Design and Architectures for Signal and Image Processing (DASIP). IEEE, 2019: 7-12. [27] LOUIS M S, AZAD Z, DELSHADTEHRANI L, et al. Towards deep learning using tensorflow lite on risc-v[C]//Third Workshop on Computer Architecture Research with RISC-V (CARRV): Vol. 1. 2019: 6. [28] GAROFALO A, TAGLIAVINI G, CONTI F, et al. XpulpNN: Accelerating quantized neural networks on RISC-V processors through ISA extensions[C]//2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2020: 186-191. [29] LI Z, HU W, CHEN S. Design and implementation of CNN custom processor based on RISC-V architecture[C]//2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2019: 1945-1950. [30] LOU W, WANG C, GONG L, et al. RV-CNN: Flexible and efficient instruction set for CNNs based on RISC-V processors[C]//Advanced Parallel Processing Technologies: 13th International Symposium, APPT 2019, Tianjin, China, August 15–16, 2019, Proceedings 13. Springer, 2019: 3-14. [31] LI D Z, GONG H R, CHANG Y C. Implementing RISCV system-on-chip for acceleration of convolution operation and activation function based on FPGA[C]//2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT). IEEE, 2018: 1-3. [32] WU N, JIANG T, ZHANG L, et al. A reconfigurable convolutional neural network-accelerated coprocessor based on RISC-V instruction set[J]. Electronics, 2020, 9(6): 1005. [33] FENG S, WU J, ZHOU S, et al. The implementation of LeNet-5 with NVDLA on RISC-V SoC [C]//2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS). IEEE, 2019: 39-42. [34] 廖汉松. 基于 RISC-V 的卷积神经网络专用指令集处理器研究与设计[D]. 华南理工大学, 2020. [35] 王松. 基于 RISC-V 与 CNN 协处理器片上系统设计[D]. 西安电子科技大学, 2020. [36] LI W, CHEN H, HUANG M, et al. Winograd algorithm for addernet[C]//International Conference on Machine Learning. PMLR, 2021: 6307-6315. [37] XU W, ZHANG Z, YOU X, et al. Reconfigurable and low-complexity accelerator for convolutional and generative networks over finite fields[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(12): 4894-4907. [38] MA X, LIN S, YE S, et al. Non-structured DNN weight pruning—Is it beneficial in any platform?[J]. IEEE transactions on neural networks and learning systems, 2021, 33(9): 4930-4944. [39] HAO C, CHEN D. Deep neural network model and FPGA accelerator co-design: Opportunities and challenges[C]//2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT). IEEE, 2018: 1-4. [40] HAO C, ZHANG X, LI Y, et al. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge[C]//Proceedings of the 56th Annual Design Automation Conference 2019. 2019: 1-6. [41] PULP. PULP - An Open Parallel Ultra-Low-Power Processing-Platform[EB/OL]. (2024-3-1) [2024-3-1]. http://iis-projects.ee.ethz.ch/index.php/PULP. [42] GAROFALO A, RUSCI M, CONTI F, et al. PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors[J]. Philosophical Transactions of the Royal Society A, 2020, 378(2164): 20190155. [43] COUSSY P, GAJSKI D D, MEREDITH M, et al. An introduction to high-level synthesis[J]. IEEE Design & Test of Computers, 2009, 26(4): 8-17. [44] ML-COMMONS. Machine learning innovation to benefit everyone[EB/OL]. (2024-3-1) [2024-3-1]. https://mlcommons.org/en/. [45] ZHANG X, WANG J, ZHU C, et al. DNNBuilder: An automated tool for building highperformance DNN hardware accelerators for FPGAs[C]//2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2018: 1-8. [46] XU P, ZHANG X, HAO C, et al. AutoDNNchip: An automated DNN chip predictor and builder for both FPGAs and ASICs[C]//Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2020: 40-50. [47] DE PRADO M, MUNDY A, SAEED R, et al. Automated design space exploration for optimized deployment of dnn on arm cortex-a cpus[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 40(11): 2293-2305. [48] MIPS. MIPS DSP–MIPS[EB/OL]. (2024-3-1) [2024-3-1]. https://mips.com/products/architectures/ase/dsp/. [49] MIPS. MIPS SIMD–MIPS[EB/OL]. (2024-3-1) [2024-3-1]. https://mips.com/products/architectures/ase/simd/. [50] ARM. DSP capabilities of Cortex-M4 and Cortex-M7[EB/OL]. (2024-3-1) [2024-3-1]. https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/white-paper-dsp-capabilities-of-cortex-m4-and-cortex-m7. [51] RISC-V. riscv/riscv-p-spec: RISC-V Packed SIMD Extension[EB/OL]. (2022-3-30) [2024-3-1]. https://github.com/riscv/riscv-p-spec/. [52] ARM. CMSIS DSP Software Library[EB/OL]. (2024-3-1) [2024-3-1]. https://arm-software.github.io/CMSIS_5/DSP/html/index.html. [53] RISC-V. riscv/riscv-v-spec: Working draft of the proposed RISC-V V vector extension[EB/OL]. (2024-1-31) [2024-3-1]. https://github.com/riscv/riscv-v-spec/. [54] GAUTSCHI M, SCHIAVONE P D, TRABER A, et al. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017, 25(10): 2700-2713. [55] LOWRISC. lowRISC/ibex: Ibex is a small 32 bit RISC-V CPU core, previously known as zero-riscy.[EB/OL]. (2024-3-1) [2024-3-1]. https://github.com/lowRISC/ibex. [56] GROUP O. openhwgroup/cv32e40p: CV32E40P is an in-order 4-stage RISC-V RV32IMFCXpulp CPU based on RI5CY from PULP-Platform[EB/OL]. (2024-2-14) [2024-3-1]. https://github.com/openhwgroup/cv32e40p. [57] FLAMAND E, ROSSI D, CONTI F, et al. GAP-8: A RISC-V SoC for AI at the Edge of the IoT [C]//2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2018: 1-4. [58] SYSTEM N. riscv-mcu/e203_hbirdv2: The Ultra-Low Power RISC-V Core[EB/OL]. (2023-3-27) [2024-3-1]. https://github.com/riscv-mcu/e203_hbirdv2. [59] SPINALHDL. SpinalHDL/VexRiscv: A FPGA friendly 32 bit RISC-V CPU implementation [EB/OL]. (2024-2-1) [2024-3-1]. https://github.com/SpinalHDL/VexRiscv. [60] AYACHI R, AFIF M, SAID Y, et al. Strided convolution instead of max pooling for memory efficiency of convolutional neural networks[C]//Proceedings of the 8th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT’18), Vol. 1. Springer, 2020: 234-243. [61] HSIAO T Y, CHANG Y C, CHOU H H, et al. Filter-based deep-compression with global average pooling for convolutional networks[J]. Journal of Systems Architecture, 2019, 95: 9-18. [62] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25. [63] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[A]. 2014. [64] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [65] OF STANDARDS N I, TECHNOLOGY. THE MNIST DATABASE of handwritten digit [EB/OL]. (2024-3-1) [2024-3-1]. http://yann.lecun.com/exdb/mnist/. [66] ALEX KRIZHEVSKY I S. CIFAR-10 and CIFAR-100 datasets: The CIFAR-10 datase [EB/OL]. (2024-3-1) [2024-3-1]. http://www.cs.toronto.edu/~kriz/cifar.html. [67] CHOWDHERY A, WARDEN P, SHLENS J, et al. Visual wake words dataset[A]. 2019. [68] DENG J, DONG W, SOCHER R, et al. Imagenet: A large-scale hierarchical image database [C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255. [69] RESEARCH Z. Fashion-mnist: A MNIST-like fashion product database. Benchmark[EB/OL] (2022-3-21) [2024-3-1]. https://github.com/zalandoresearch/fashion-mnist. [70] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [71] MICROSOFT. COCO - Common Objects in Context[EB/OL]. (2022-3-21) [2024-3-1]. https://cocodataset.org/#home. [72] ML-COMMONS. Benchmark MLPerf Inference: Tiny \| MLCommons V1.1 Results[EB/OL]. (2024-3-1) [2024-3-1]. https://mlcommons.org/benchmarks/inference-tiny/. [73] SIPEED. Sipeed/TinyMaix: TinyMaix is a tiny inference library for microcontrollers (TinyML). [EB/OL]. (2023-4-26) [2024-3-1]. https://github.com/sipeed/TinyMaix. [74] SIPEED. TinyMaix/benchmark: Test Models and Test Record[EB/OL]. (2023-4-13) [2024-3-1]. https://github.com/sipeed/TinyMaix/blob/main/benchmark.md.
所在学位评定分委会	电子科学与技术
国内图书分类号	TN47
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/765892
专题	南方科技大学工学院_电子与电气工程系
推荐引用方式 GB/T 7714	徐志远. 面向量化卷积神经网络加速的RISC-V专用指令集处理器[D]. 深圳. 南方科技大学,2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132153-徐志远-电子与电气工程（2613KB）	--	--	限制开放	--	请求全文