南方科技大学知识苑(SUSTech KC): 基于电荷域累加读出电路的多精度ReRAM存算加速器设计

题名	基于电荷域累加读出电路的多精度ReRAM存算加速器设计
其他题名	DESIGN OF MIXED-PRECISION RERAM COMPUTING-IN-MEMORY ACCELERATOR BASED ON CHARGE DOMAIN ACCUMULATION AND READOUT CIRCUIT
姓名	刘俊
姓名拼音	LIU Jun
学号	12132458
学位类型	硕士
学位专业	0856 材料与化工
学科门类/专业学位类别	0856 材料与化工
导师	毛伟
导师单位	深港微电子学院
论文答辩日期	2023-05-18
论文提交日期	2023-06-28
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	近年来，卷积神经网络（CNN）在图像识别、目标检测和自动驾驶等领域发挥出强大的功能，但是 CNN 的尺寸和结构也变得越来越复杂，传统的基于冯诺依曼架构的硬件加速器由于存在“内存墙”和“功耗墙”的瓶颈，算力的提升速度已经跟不上神经网络（NN）模型演进的速度。为了加速 CNN 的推理过程，在软件侧，本文采用神经网络架构搜索（NAS）算法来优化网络结构，压缩权重的数量和精度，将 CNN 量化为具有 2bit、4bit 和 8bit 权重的多精度网络；在硬件侧，利用非易失性存储器（NVM）ReRAM 的优势，本文提出了一种基于存算一体（CIM）架构的加速器，有望打破冯诺依曼瓶颈，CIM 架构将存储和计算融为一体，极大地较少了访存的功耗和延时。CNN 中近 99%的计算都为乘累加（MAC）计算，这需要存算加速器具有强大的并行计算能力，本文提出了一种可以隔离ReRAM 非线性阻值偏差的电荷域存算宏单元，从单元级改善了系统的吞吐量和能效。在传统方案中，累加读出电路包括数字移位累加模块和模数转换模块，占据了整个系统约 70%的功耗，本文对累加读出电路进行了优化，提出了一种基于电荷域计算的多精度累加器，可以实现对输入位和权重位的加权操作，同时支持并行读出；本文还提出了一种基于电荷域计算的读出电路，实现了模数转换的高能效输出，通过模数转换器（ADC）比较周期减少的方案，进一步降低读出电路中 ADC 的功耗。在 SMIC 28nm 的工艺下对各部分电路进行功能和性能仿真，仿真结果均与预期相符。基于存算宏单元构建的交叉阵列的最大行并行度可以达到1536。用于模拟后置加权的多精度累加器减少了 87.5%的 ADC 数量。ADC比较周期减少的方案实现了最高 84.3%的 ADC 功耗降低和 21.3%的累加读出电路功耗降低。本文采用基于 Pytorch 的模拟器 Memtorch 搭建网络验证环境，使用 NAS 优化的 ResNet-18 网络在数据集 CIFAR-10 上测试了所提出的加速器的网络推理性能，Top-1 推理准确率可以达到93.46%。所设计的存算加速器实现了 180.61TOPS/W 的多精度平均能效，相比其他多精度存算加速器，能效提升了7.81 倍~15.21 倍。
关键词	卷积神经网络 ReRAM 存算一体多精度加速器
语种	中文
培养类别	独立培养
入学年份	2021
学位授予年份	2023-06
参考文献列表	[1] 中共中央国务院. 扩大内需战略规划纲要（2022－2035年）[EB/OL]. (2022-12-14)[2023-03-27]. http://www.gov.cn/xinwen/2022-12/14/content_5732067.htm. [2] 科技部. 关于支持建设新一代人工智能示范应用场景的通知[EB/OL]. (2022-08-12)[2023-03-27]. https://www.most.gov.cn/xxgk/xinxifenlei/fdzdgknr/qtwj/qtwj2022/202208/ t20220815_181874.html. [3] 中国信息通信研究院. 人工智能白皮书（2022年）[EB/OL]. (2022-04-15) [2023-03-27]. https://www.xdyanbao.com/doc/d55bz8cnnb?bd_vid=8192290461634793982. [4] XIU L. Time Moore: exploiting Moore's law from the perspective of time[M]. IEEE Solid-State Circuits Magazine, 2019, 11(1): 39-55. [5] SHIN D, YOO H. The heterogeneous deep neural network processor with a non-von Neumann architecture[J]. Proceedings of the IEEE, 2020, 108(8): 1245-1260. [6] YANG Z, PAN K, ZHOU N, et al. Scalable 2T2R logic computation structure: design from digital logic circuits to 3-D stacked memory arrays[J]. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 2022, 8(2): 84-92. [7] CHENG C, TIW P, CAI Y, et al. In-memory computing with emerging nonvolatile memory devices[J]. Science China Information Sciences, 2021, 64(221402): 1-46. [8] 郭昕婕, 王绍迪. 端侧智能存算一体芯片概述[J]. 微纳电子与智能制造, 2019, 1(02): 72-82. [9] CHEN Y H, YANG T J, EMER J, et al. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices[J]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019, 9(2): 292-308. [10] KYEONGRYEOL B, SUNGPILL C, CHANGHYEON K, et al. A 0.62mW ultra-low-power convolutional-neural-network face-recognition processor and a CIS integrated with always-on haar-like face detector[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2017: 344-346. [11] TAIGMAN Y, YANG M, RANZATO M A, et al. DeepFace: closing the gap to human-level performance in face verification[C]// IEEE Conference on Computer Vision and Pattern Recognition. 2014: 1701-1708. [12] MOONS B, VERHELST M. An energy-efficient precision-scalable ConvNet processor in 40-nm CMOS[J]. IEEE Journal of Solid-State Circuits, 2017, 52(4): 903-914. [13] SZE V. CHEN Y H, EMER J, et al. Hardware for machine learning: challenges and opportunities[C]// IEEE Custom Integrated Circuits Conference (CICC). 2017: 1-8. [14] JOUPPI N P, YOUNG C, PATIL N. In-datacenter performance analysis of a tensor processing unit[C]// ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 2017: 1-12. [15] SU J W, CHOU Y C, LIU R H, et al. A 28nm 384kb 6T-SRAM computation-in-memory macro with 8b precision for AI edge chips[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2021: 250-252. [16] CHANG J, CHEN Y H, CHAN G, et al. A 5nm 135Mb SRAM in EUV and high-mobility-channel FinFET technology with metal coupling and charge-sharing write-assist circuitry schemes for high-density and low-VMIN applications[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2020: 238-240. [17] SI X, CHEN J, TU Y, et al. A twin-8T SRAM computation-in-memory macro for multiple-bit CNN-based machine learning[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2019: 396-398. [18] YIN S, JIANG Z, SEO J, et al. XNOR-SRAM: in-memory computing SRAM macro for binary/ternary deep neural networks[J]. IEEE Journal of Solid-State Circuits, 2020, 55(6): 1733-1743. [19] ZHANG J, WANG Z, VERMA N. In-memory computation of a machine-learning classifier in a standard 6T SRAM array[J]. IEEE Journal of Solid-State Circuits, 2017, 52(4): 915-924. [20] CHEN W, LI K, LIN W, et al. A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2018: 494-496. [21] LIU Q, GAO B, YAO P, et al. 33.2 A fully integrated analog ReRAM based 78.4TOPS/W compute-in-memory chip with fully parallel MAC computing[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2020: 500-502. [22] WU T F, LI H, HUANG P, et al. Brain-inspired computing exploiting carbon nanotube FETs and resistive RAM: Hyperdimensional computing case study[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2018: 492-494. [23] XUE C X, CHEN W, LIU J, et al. Embedded 1-Mb ReRAM-based computing-in-memory macro with multibit input and weight for CNN-based AI edge processors[J]. IEEE Journal of Solid-State Circuits, 2020, 55(1): 203-215. [24] XUE C X, HUANG T Y, LIU J S, et al. 15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2020: 244-246. [25] GALLO M L, SEBASTIAN A. An overview of phase-change memory device physics[J]. Journal of Physics D Applied Physics, 2020, 53(21). [26] MITTAL S, VETTER J S, LI D. A survey of architectural approaches for managing embedded DRAM and non-volatile on-chip caches[J]. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(6): 1524-1537. [27] GOLONZKA O, ALZATE J G, ARSLAN U, et al. MRAM as embedded non-volatile memory solution for 22FFL FinFET technology[C]// IEEE International Electron Devices Meeting (IEDM). 2018: 18.1.1-18.1.4. [28] KAUTZ W H. Cellular logic-in-memory arrays[J]. IEEE Transactions on Computers, 1969, 18(8): 719-727. [29] LI S, NIU D, MALLADI K T, et al. DRISA: a DRAM-based reconfigurable in-situ accelerator[C]// Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 2017: 288-301. [30] SESHADRI V, MOWRY T C, LEE D, et al. Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology[C]// Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 2017: 273-287. [31] AGRAWAL S R, AGARWAL N, SEDLAR E, et al. A many-core architecture for in-memory data processing[C]// Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 2017: 245-258. [32] WU H, XIAO H, GAO B, et al. Resistive random access memory for future information processing system[J]. Proceedings of the IEEE, 2017, 105(9): 1770-1789. [33] HUANG P, KANG J, ZHAO Y, et al. Reconfigurable nonvolatile logic operations in resistance switching crossbar array for large-scale circuits[J]. Advanced Materials, 2016, 28(44): 9758-9764. [34] ZHOU Y, YI L, DUAN N, et al. Boolean and sequential logic in a one-memristor-one-resistor(1M1R) structure for in-memory computing[J]. Advanced Electronic Materials, 2018, 4(9): 1-9. [35] SHAFIEE A, NAG A, MURALIMANOHAR N, et al. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars[C]// ACM/IEEE International Symposium on Computer Architecture. 2016: 14-26. [36] HU M, STRACHAN J P, LI Z Y, et al. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication[C]// ACM/EDAC/IEEE Design Automation Conference (DAC). 2016: 1-6. [37] XUE C X , CHEN W H, LIU J S, et al. A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2019: 388-390. [38] YOON J H, CHANG M, KHWA W S, et al. A 40-nm, 64-Kb, 56.67 TOPS/W voltage-sensing computing-in-memory/digital RRAM macro supporting iterative write with verification and online read-disturb detection[J]. IEEE Journal of Solid-State Circuits, 2021, 57(1): 68-79. [39] CHEN P Y, YU S. Compact modeling of RRAM devices and its applications in 1T1R and 1S1R array design[J]. IEEE Transactions on Electron Devices, 2015, 62(12): 4022-4028. [40] YU R C, LI A, CHEN C F, et al. NISP: pruning networks using neuron importance score propagation[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 9194-9203. [41] JACOB B, KLIGYS S, CHEN B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713. [42] ZHUANG B H, SHEN C H, TAN M K, et al. Towards effective low-bitwidth convolutional neural networks[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 7920-7928. [43] VERMA N, JIA H Y, VALAVI H, et al. In-memory computing: advances and prospects[J]. IEEE Solid-State Circuits Magazine, 2019, 11(3): 43-55. [44] JIA H Y, VALAVI H, TANG Y Q, et al. A programmable heterogeneous microprocessor based on bit-scalable in-memory computing[J]. IEEE Journal of Solid-State Circuits, 2020, 55(9): 2609-2621. [45] OMRAN H, ALAHMADI H, SALAMA K. Matching properties of femtofarad and sub-femtofarad MOM capacitors[J]. IEEE Trans. Circuits Syst. I, 2016, 63(6): 763-772. [46] ZHANG S, HUANG K J, SHEN H B. A robust 8-bit non-volatile computing-in-memory core for low-power parallel MAC operations[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2020, 67(6): 1867-1880. [47] XUE C X, HUNG J M, KAO H Y, et al. 16.1 A 22nm 4Mb 8b-precision ReRAM computing-in-memory macro with 11.91 to 195.7TOPS/W for tiny AI edge devices[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2021: 245-247. [48] ZHENG Q L, WANG Z W, FENG Z S, et al. Lattice: An ADC/DAC-less ReRAM-based processing-in-memory architecture for accelerating deep convolution neural networks[C]// ACM/IEEE Design Automation Conference (DAC). 2020: 1-6. [49] KHWA W S, CHIU Y C, JHANG C J, et al. A 40-nm, 2M-cell, 8b-precision, hybrid SLC-MLC PCM computing-in-memory macro with 20.5 - 65.0TOPS/W for tiny-Al edge devices[C]// IEEE International Solid-State Circuits Conference (ISSCC). 2022: 1-3.
所在学位评定分委会	材料与化工
国内图书分类号	TN492
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/544463
专题	南方科技大学-香港科技大学深港微电子学院筹建办公室
推荐引用方式 GB/T 7714	刘俊. 基于电荷域累加读出电路的多精度ReRAM存算加速器设计[D]. 深圳. 南方科技大学,2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132458-刘俊-南方科技大学-香（13267KB）	--	--	限制开放	--	请求全文