南方科技大学知识苑(SUSTech KC): 基于 ARM 架构的小型和不规则矩阵乘法的自动性能优化

题名	基于 ARM 架构的小型和不规则矩阵乘法的自动性能优化
其他题名	AUTOMATIC KERNEL GENERATION FOR IRREGULAR GEMM OPTIMIZATION ON ARM ARCHITECTURES
姓名	吴都
姓名拼音	WU Du
学号	12032265
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	潘毅 / 孟金涛
导师单位	深圳理工大学(筹) / 中国科学院深圳先进技术研究院
论文答辩日期	2023-05-19
论文提交日期	2023-07-05
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	通用矩阵乘法（GEneral Matrix Multiply，GEMM）是一个无论在传统科学模拟还是新兴深度学习中都必不可少的基本模块。随着不同应用算法程序的产生，GEMM 的输入矩阵的大小和形状产生变化，小型和不规则矩阵在算法程序中越来越常见，这对我们如何进行 GEMM 性能优化提出了第一个新挑战。如今，深度学习需要部署在多种硬件平台上，通常使用大型 CPU 集群、GPU集群和 TPU 集群进行训练，然后在个人电脑、手机、嵌入式设备和加速器 (如 fpga、asic) 上进行推理，以提供实时人工智能服务。ARM 架构是移动设备中最常见的硬件架构，并且越来越多地用于个人计算机和大型服务器，甚至作为超级计算机的主要架构。因此在 ARM 架构上保持矩阵乘法的高性能和高可移植成为了 GEMM性能优化的第二和第三个新挑战。为解决 GEMM 性能优化的三个新挑战，本文基于 ARM 架构嵌入式设备和服务器级 CPU 上实现高效计算矩阵乘法。我们提出了一个能够广泛使用在 ARM 架构上，突破现有不规则矩阵乘法计算性能极限的计算库 autoGEMM。autoGEMM通过代码生成和手工优化核心汇编代码片段，为各种形状和不同的硬件配置生成高性能矩阵计算内核，最大限度地提高计算库在各种 ARM 硬件设备上的性能。autoGEMM 还通过优化矩阵乘法微内核的加载和存储指令的延时来建立运行时成本模型，指导 TVM 框架实现动态寄存器分割算法 (RBSA) 以优化矩阵乘法内核的计算强度。autoGEMM 使用成本模型对 TVM 的搜索空间进行剪枝以在自动优化时寻找到更优的参数组合。autoGEMM 在四种 ARM CPU 上展示了相比于现有工作的优势。对于小型矩阵，autoGEMM 在华为鲲鹏 920、亚马逊 Graviton2、安培 Altra 和苹果 M2 上分别达到硬件理论峰值的 97.6%、98.3%、98.4% 和 96.5%，始终能够对最先进的计算库（如 LIBXSMM 和 LibSharom）保持 5% 以上的性能优势。对于不规则矩阵，单核运行的 autoGEMM 在四种设备上平均比 OpenBLAS 性能提高了 1.3x(最高 1.9x)，比 Eigen 提高 1.5x(最高 2.0x)。多核运行的 autoGEMM 对比 OpenBLAS 和 Eigen 计算库，在华为鲲鹏 920 上平均带来 2x（最高 4x）加速, 亚马逊 Graviton2 上 1.8x(最高 3.3x) 加速，安培 Altra 上 1.8x(最高 3.4x) 加速，苹果 M2 上 1.7x(最高 3.3x) 加速。
关键词	深度学习高性能计算矩阵乘法代码生成
语种	中文
培养类别	独立培养
入学年份	2020
学位授予年份	2023-07
参考文献列表	[1] GUENNEBAUD G, JACOB B, et al. Eigen v3[EB/OL]. 2010. http://eigen.tuxfamily.org. [2] VAN ZEE F G, VAN DE GEIJN R A. BLIS: A framework for rapidly instantiating BLASfunctionality[J]. ACM Transactions on Mathematical Software (TOMS), 2015, 41(3): 1-33. [3] ZHANG X. OpenBLAS library[EB/OL]. 2012. https://github.com/xianyi/OpenBLAS. [4] GOTO K, GEIJN R A V D. Anatomy of high-performance matrix multiplication[J]. ACMTransactions on Mathematical Software (TOMS), 2008, 34(3): 1-25. [5] INTEL. MKL[EB/OL]. https://software.intel.com/en-us/mkl. [6] AMD. AOCL[EB/OL]. https://developer.amd.com/amd-aocl/. [7] ARM. ARM Compute Library[EB/OL]. https://github.com/ARM-software/ComputeLibrary. [8] DUKHAN M. NNPACK library[EB/OL]. https://github.com/Maratyszcza/NNPACK. [9] ALIBABA. MNN[EB/OL]. https://github.com/alibaba/MNN. [10] LAN H, MENG J, HUNDT C, et al. FeatherCNN: Fast inference computation with TensorGEMM on ARM architectures[J]. IEEE Transactions on Parallel and Distributed Systems,2019, 31(3): 580-594. [11] TENCENT. NCNN[EB/OL]. https://github.com/Tencent/ncnn. [12] TENCENT. TNN[EB/OL]. https://github.com/Tencent/TNN. [13] HEINECKE A, HENRY G, HUTCHINSON M, et al. LIBXSMM: accelerating small matrixmultiplications by runtime code generation[C]//The International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 2016. [14] LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]//Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. 2016: 4013-4021. [15] FRISON G, KOUZOUPIS D, SARTOR T, et al. BLASFEO: Basic linear algebra subroutinesfor embedded optimization[J]. ACM Transactions on Mathematical Software (TOMS), 2018,44(4): 1-30. [16] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778. [17] DUKHAN M. The indirect convolution algorithm[A]. 2019. [18] YANG W, FANG J, DONG D, et al. LIBSHALOM: optimizing small and irregular-shapedmatrix multiplications on ARMv8 multi-cores[C]//The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC). IEEE, 2021. [19] MENG J, ZHUANG C, CHEN P, et al. Automatic Generation of High-Performance ConvolutionKernels on ARM CPUs for Deep Learning[J]. IEEE Transactions on Parallel and DistributedSystems, 2022, 33(11): 2885-2899. [20] STEPHENS N. Armv8-a next-generation vector architecture for HPC[C]//2016 IEEE Hot Chips28 Symposium (HCS). IEEE, 2016: 1-31. [21] FLUR S, GRAY K E, PULTE C, et al. Modelling the ARMv8 architecture, operationally: Concurrency and ISA[C]//Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposiumon Principles of Programming Languages. 2016: 608-621. [22] DONGARRA J. Report on the Fujitsu Fugaku system[J]. University of Tennessee-KnoxvilleInnovative Computing Laboratory, Tech. Rep. ICLUT-20-06, 2020. [23] BAER J L. Microprocessor architecture: from simple pipelines to chip multiprocessors[M].Cambridge University Press, 2009. [24] BAGHDADI R, RAY J, ROMDHANE M B, et al. Tiramisu: A polyhedral compiler for expressing fast and portable code[C]//2019 IEEE/ACM International Symposium on Code Generationand Optimization (CGO). IEEE, 2019: 193-205. [25] VASILACHE N, ZINENKO O, THEODORIDIS T, et al. Tensor comprehensions: Frameworkagnostic high-performance machine learning abstractions[A]. 2018. [26] LATTNER C, AMINI M, BONDHUGULA U, et al. MLIR: A compiler infrastructure for theend of Moore’s law[A]. 2020. [27] TANG S, ZHAI J, WANG H, et al. FreeTensor: a free-form DSL with holistic optimizationsfor irregular tensor programs[C]//Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2022: 872-887. [28] RAGAN-KELLEY J, BARNES C, ADAMS A, et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines[J]. Acm SigplanNotices, 2013, 48(6): 519-530. [29] CHEN T, MOREAU T, JIANG Z, et al. TVM: An automated End-to-End optimizing compilerfor deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018: 578-594. [30] CHEN T, MOREAU T, JIANG Z, et al. TVM: end-to-end optimization stack for deep learning:volume 11[A]. 2018: 20. [31] LATTNER C, ADVE V. LLVM: A compilation framework for lifelong program analysis& transformation[C]//International Symposium on Code Generation and Optimization, 2004.CGO 2004. IEEE, 2004: 75-86. [32] CHEN T, ZHENG L, YAN E, et al. Learning to optimize tensor programs[J]. Advances inNeural Information Processing Systems, 2018, 31. [33] CHEN T, HE T, BENESTY M, et al. Xgboost: extreme gradient boosting[J]. R package version0.4-2, 2015, 1(4): 1-4. [34] ZHENG L, JIA C, SUN M, et al. Ansor: Generating high-performance tensor programs fordeep learning[C]//Proceedings of the 14th USENIX Conference on Operating Systems Designand Implementation. 2020: 863-879. [35] GAO X, CUI W, ZHANG L, et al. OpEvo: an evolutionary method for tensor operator optimization[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 35. 2021:12320-12327. [36] LIMITED A. ARM architecture[EB/OL]. https://www.arm.com/en/. [37] 华为技术有限公司. 鲲鹏 920 芯片[EB/OL]. https://e.huawei.com/cn/products/computing/kunpeng/server-board. [38] AWS. AWS Graviton2 M6g[EB/OL]. https://aws.amazon.com/en/ec2/graviton/. [39] AMPERE. Ampere Altra[EB/OL]. https://amperecomputing.com/processors/ampere-altra. [40] WIKICHIP. Apple M1[EB/OL]. https://en.wikichip.org/wiki/apple/mx/m1. [41] LIMITED A. ARMv5 Architecture Reference Manual[EB/OL]. https://developer.arm.com/documentation/ddi0100/i/. [42] LIMITED A. Neon Programmer’s Guide for Armv8-A: Introducing Neon for Armv8- A[EB/OL]. https://developer.arm.com/documentation/102474. [43] STEPHENS N, BILES S, BOETTCHER M, et al. The ARM scalable vector extension[J]. IEEEmicro, 2017, 37(2): 26-39. [44] LIMITED A. The Scalable Matrix Extension (SME), for Armv9-A Arm Architecture ReferenceManual Supplement[EB/OL]. https://developer.arm.com/documentation/ddi0616. [45] WILLIAMS S, WATERMAN A, PATTERSON D. Roofline: an insightful visual performancemodel for multicore architectures[J]. Communications of the ACM, 2009, 52(4): 65-76. [46] KÅGSTRÖM B, LING P, VAN LOAN C. GEMM-based level 3 BLAS: high-performancemodel implementations and performance evaluation benchmark[J]. ACM Transactions on Mathematical Software (TOMS), 1998, 24(3): 268-302. [47] MAO Y, ZHOU H, GUI X, et al. Exploring convolution neural network for branch prediction[J]. IEEE Access, 2020, 8: 152008-152016. [48] EYERMAN S, HEIRMAN W, VAN DEN STEEN S, et al. Enabling Branch-Mispredict LevelParallelism by Selectively Flushing Instructions[C]//MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 2021: 767-778. [49] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2818-2826. [50] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: Efficient convolutional neural networksfor mobile vision applications[A]. 2017. [51] IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with50x fewer parameters and< 0.5 MB model size[A]. 2016.
所在学位评定分委会	电子科学与技术
国内图书分类号	520.3020
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/545090
专题	南方科技大学中国科学院深圳理工大学（筹）联合培养
推荐引用方式 GB/T 7714	吴都. 基于 ARM 架构的小型和不规则矩阵乘法的自动性能优化[D]. 深圳. 南方科技大学,2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12032265-吴都-中国科学院深圳理（3198KB）	--	--	限制开放	--	请求全文