南方科技大学知识苑(SUSTech KC): 基于查找表的深度卷积神经网络的模型压缩与推断加速

题名	基于查找表的深度卷积神经网络的模型压缩与推断加速
其他题名	CONVOLUTIONAL NEURAL NETWORK MODEL COMPRESSION AND INFERENCE ACCELERATION BASED ON LOOK UP TABLE
姓名	徐诗羽
学号	11849314
学位类型	硕士
学位专业	电子与通信工程
导师	叶涛
论文答辩日期	2020-05-25
论文提交日期	2020-07-22
学位授予单位	哈尔滨工业大学
学位授予地点	深圳
摘要	卷积神经网络（Convolutional Neural Network，CNN）在目标检测、图像分类领域有着广泛的应用，但由于其海量的参数量和计算量限制了在算力匮乏的移动终端上的部署。参数量化（Parameter Quantization）可以有效降低模型存储空间、提升运算速度，是降低CNN计算负载的方式之一。当CNN中乘法的乘数均被量化，所有乘数组合的乘积可在推断前预先计算并存储，原本的乘法操作可替换为在乘积查找表（Lookup Table，LUT）中的查值操作。相比于浮点乘法，基于查找表的乘法具有占用资源少、运算效率高的优点。然而由于模型的不同层级、不同通道之间参数的分布差异较大，此前基于查找表的CNN为维持模型量化后的性能，往往采用较大规模的查找表存储乘积，或者各卷积层独立进行量化，每层采用独立的乘法查找表存储乘积的结果。以上两个方式导致查找表内存占用过大、内存反复重载成本高等问题。为解决上述问题，本文通过引入权重标准化（Weight Standardization）操作使各层分布趋同，从而CNN的不同层次可以共用同一个查找表；同时，本文引入迭代式聚类的参数非均匀量化方式，补偿参数低精度量化带来的模型性能损失。通过上述训练策略的调整，采用单个16 × 16 的乘法查找表可取代CNN中所有乘法运算，在ResNet、VGG-Net、AlexNet上相比全精度模型性能几乎无损，相比同类采用查找表进行卷积中乘法计算的CNN在大幅缩小了查找表的尺寸以及数量。为验证算法在硬件层面的有效性，本文以FPGA为目标硬件平台，实现基于查找表的CNN加速推断系统。根据查找表乘法计算的特点，设计适用于查找表乘法的同步数据流计算架构，提出矩阵分割、输入重排序、并行查找矩阵乘法等多种优化方式以优化卷积实现。通过C++的模板功能实现参数可配置的基于查找表的卷积层、池化层、激活函数、全连接层等卷CNN基本模块，提升模型部署与验证的效率。本文所采用的基于查找表推断的CNN在资源占用率、功耗、速度上均优于同精度的定点数乘法实现方式，实验证明在同样的精度表现下，通过查找表进行CNN推断相比定点数实现可降低56.1%的BRAM和52.1%的DSP资源使用，并减少21%的功耗；在PYNQ-z2上可达到近4.5GOPs/s的计算吞吐量，相比于PYNQ上的ARM Cortex A9处理器提速近59倍。
其他摘要	Convolutional neural networks (CNNs) have been widely applied for computer vision related tasks and have achieved dramatic accuracy improvements. However, the massive parameters and heavy computation requirements needed for CNNs limit the deployment of mobile terminals which are lack of computing power. Parameter quantization with lower bit-width is the common approach to reduce the computation loads in CNN inference. With the parameters being replaced by fixed-width binaries, multiplication operations can be replaced by the lookup table (LUT), where the multiplier-multiplicand operands serve as the table index, and the pre-calculated products serve as table elements. Because the histogram profiles of the parameters in different layers/channels differ significantly in CNN, previous LUT-based computation methods have to use different LUTs for each layer/channel, and consequently demand larger memory space along with extra access time and power consumption. In this work, we first normalize the parameters’ Gaussian profiles of different layers/channels to have similar means and variances, and further quantize the normalized parameters into fixed-width through iteratively clustering. Because of the normalized parameters’ profile, only single compact LUT (16 × 16 entries) is needed to replace all multiplications in the whole network. Experiments in image classification tasks demonstrate that with a compact 256-entry LUT, we can achieve the accuracy comparable to the results from 32-bit ﬂoating-point calculation; while significantly reduce the computation loads and memory spaces. Compared to previous work used LUT-based convolution, the size and quantity of LUTs used for CNN are significantly reduced in this work.To verify the effectiveness of the algorithm at the hardware level, this work implements a CNN inference system based on single lookup table, using FPGA as the target hardware platform. Based on the characteristics of lookup table multiplication computation, a synchronous data flow computational architecture for LUT based CNN is designed. A novel set of optimizations e.g. memory partition, stream rearrangement which enables eﬃcient mapping of LUT based network to hardware, are proposed in this work. Basic CNN modules including LUT based convolution, pooling layer and fully connect layer are implemented with C++. Experiments show that LUT based CNN on PYNQ-Z2 FPGA platform superior to fixed-point implementation in resource usage, latency and throughput. Experiments show that LUT-based CNN implementation can save 56.1% of BRAM and 52.1% of DSP utilization and 21% of power consumption compared with fixed-point implementation, and achieve nearly 4.5GOPs/s computing throughput on PYNQ-z2, which is 59\times faster than ARM cortex A9 processor.
关键词	卷积神经网络神经网络量化算法 FPGA 加速器低功耗设计
其他关键词	Convolution Neural Network Network quantization FPGA Accelerator Power Efficient Design
语种	中文
培养类别	联合培养
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/142862
专题	工学院_电子与电气工程系
作者单位	南方科技大学
推荐引用方式 GB/T 7714	徐诗羽. 基于查找表的深度卷积神经网络的模型压缩与推断加速[D]. 深圳. 哈尔滨工业大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
基于查找表的深度卷积神经网络的模型压缩与（3547KB）	--	--	限制开放	--	请求全文