[1] GUENNEBAUD G, JACOB B, et al. Eigen v3[EB/OL]. 2010. http://eigen.tuxfamily.org.
[2] VAN ZEE F G, VAN DE GEIJN R A. BLIS: A framework for rapidly instantiating BLASfunctionality[J]. ACM Transactions on Mathematical Software (TOMS), 2015, 41(3): 1-33.
[3] ZHANG X. OpenBLAS library[EB/OL]. 2012. https://github.com/xianyi/OpenBLAS.
[4] GOTO K, GEIJN R A V D. Anatomy of high-performance matrix multiplication[J]. ACMTransactions on Mathematical Software (TOMS), 2008, 34(3): 1-25.
[5] INTEL. MKL[EB/OL]. https://software.intel.com/en-us/mkl.
[6] AMD. AOCL[EB/OL]. https://developer.amd.com/amd-aocl/.
[7] ARM. ARM Compute Library[EB/OL]. https://github.com/ARM-software/ComputeLibrary.
[8] DUKHAN M. NNPACK library[EB/OL]. https://github.com/Maratyszcza/NNPACK.
[9] ALIBABA. MNN[EB/OL]. https://github.com/alibaba/MNN.
[10] LAN H, MENG J, HUNDT C, et al. FeatherCNN: Fast inference computation with TensorGEMM on ARM architectures[J]. IEEE Transactions on Parallel and Distributed Systems,2019, 31(3): 580-594.
[11] TENCENT. NCNN[EB/OL]. https://github.com/Tencent/ncnn.
[12] TENCENT. TNN[EB/OL]. https://github.com/Tencent/TNN.
[13] HEINECKE A, HENRY G, HUTCHINSON M, et al. LIBXSMM: accelerating small matrixmultiplications by runtime code generation[C]//The International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 2016.
[14] LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]//Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. 2016: 4013-4021.
[15] FRISON G, KOUZOUPIS D, SARTOR T, et al. BLASFEO: Basic linear algebra subroutinesfor embedded optimization[J]. ACM Transactions on Mathematical Software (TOMS), 2018,44(4): 1-30.
[16] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778.
[17] DUKHAN M. The indirect convolution algorithm[A]. 2019.
[18] YANG W, FANG J, DONG D, et al. LIBSHALOM: optimizing small and irregular-shapedmatrix multiplications on ARMv8 multi-cores[C]//The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC). IEEE, 2021.
[19] MENG J, ZHUANG C, CHEN P, et al. Automatic Generation of High-Performance ConvolutionKernels on ARM CPUs for Deep Learning[J]. IEEE Transactions on Parallel and DistributedSystems, 2022, 33(11): 2885-2899.
[20] STEPHENS N. Armv8-a next-generation vector architecture for HPC[C]//2016 IEEE Hot Chips28 Symposium (HCS). IEEE, 2016: 1-31.
[21] FLUR S, GRAY K E, PULTE C, et al. Modelling the ARMv8 architecture, operationally: Concurrency and ISA[C]//Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposiumon Principles of Programming Languages. 2016: 608-621.
[22] DONGARRA J. Report on the Fujitsu Fugaku system[J]. University of Tennessee-KnoxvilleInnovative Computing Laboratory, Tech. Rep. ICLUT-20-06, 2020.
[23] BAER J L. Microprocessor architecture: from simple pipelines to chip multiprocessors[M].Cambridge University Press, 2009.
[24] BAGHDADI R, RAY J, ROMDHANE M B, et al. Tiramisu: A polyhedral compiler for expressing fast and portable code[C]//2019 IEEE/ACM International Symposium on Code Generationand Optimization (CGO). IEEE, 2019: 193-205.
[25] VASILACHE N, ZINENKO O, THEODORIDIS T, et al. Tensor comprehensions: Frameworkagnostic high-performance machine learning abstractions[A]. 2018.
[26] LATTNER C, AMINI M, BONDHUGULA U, et al. MLIR: A compiler infrastructure for theend of Moore’s law[A]. 2020.
[27] TANG S, ZHAI J, WANG H, et al. FreeTensor: a free-form DSL with holistic optimizationsfor irregular tensor programs[C]//Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2022: 872-887.
[28] RAGAN-KELLEY J, BARNES C, ADAMS A, et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines[J]. Acm SigplanNotices, 2013, 48(6): 519-530.
[29] CHEN T, MOREAU T, JIANG Z, et al. TVM: An automated End-to-End optimizing compilerfor deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018: 578-594.
[30] CHEN T, MOREAU T, JIANG Z, et al. TVM: end-to-end optimization stack for deep learning:volume 11[A]. 2018: 20.
[31] LATTNER C, ADVE V. LLVM: A compilation framework for lifelong program analysis& transformation[C]//International Symposium on Code Generation and Optimization, 2004.CGO 2004. IEEE, 2004: 75-86.
[32] CHEN T, ZHENG L, YAN E, et al. Learning to optimize tensor programs[J]. Advances inNeural Information Processing Systems, 2018, 31.
[33] CHEN T, HE T, BENESTY M, et al. Xgboost: extreme gradient boosting[J]. R package version0.4-2, 2015, 1(4): 1-4.
[34] ZHENG L, JIA C, SUN M, et al. Ansor: Generating high-performance tensor programs fordeep learning[C]//Proceedings of the 14th USENIX Conference on Operating Systems Designand Implementation. 2020: 863-879.
[35] GAO X, CUI W, ZHANG L, et al. OpEvo: an evolutionary method for tensor operator optimization[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 35. 2021:12320-12327.
[36] LIMITED A. ARM architecture[EB/OL]. https://www.arm.com/en/.
[37] 华为技术有限公司. 鲲鹏 920 芯片[EB/OL]. https://e.huawei.com/cn/products/computing/kunpeng/server-board.
[38] AWS. AWS Graviton2 M6g[EB/OL]. https://aws.amazon.com/en/ec2/graviton/.
[39] AMPERE. Ampere Altra[EB/OL]. https://amperecomputing.com/processors/ampere-altra.
[40] WIKICHIP. Apple M1[EB/OL]. https://en.wikichip.org/wiki/apple/mx/m1.
[41] LIMITED A. ARMv5 Architecture Reference Manual[EB/OL]. https://developer.arm.com/documentation/ddi0100/i/.
[42] LIMITED A. Neon Programmer’s Guide for Armv8-A: Introducing Neon for Armv8- A[EB/OL]. https://developer.arm.com/documentation/102474.
[43] STEPHENS N, BILES S, BOETTCHER M, et al. The ARM scalable vector extension[J]. IEEEmicro, 2017, 37(2): 26-39.
[44] LIMITED A. The Scalable Matrix Extension (SME), for Armv9-A Arm Architecture ReferenceManual Supplement[EB/OL]. https://developer.arm.com/documentation/ddi0616.
[45] WILLIAMS S, WATERMAN A, PATTERSON D. Roofline: an insightful visual performancemodel for multicore architectures[J]. Communications of the ACM, 2009, 52(4): 65-76.
[46] KÅGSTRÖM B, LING P, VAN LOAN C. GEMM-based level 3 BLAS: high-performancemodel implementations and performance evaluation benchmark[J]. ACM Transactions on Mathematical Software (TOMS), 1998, 24(3): 268-302.
[47] MAO Y, ZHOU H, GUI X, et al. Exploring convolution neural network for branch prediction[J]. IEEE Access, 2020, 8: 152008-152016.
[48] EYERMAN S, HEIRMAN W, VAN DEN STEEN S, et al. Enabling Branch-Mispredict LevelParallelism by Selectively Flushing Instructions[C]//MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 2021: 767-778.
[49] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2818-2826.
[50] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: Efficient convolutional neural networksfor mobile vision applications[A]. 2017.
[51] IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with50x fewer parameters and< 0.5 MB model size[A]. 2016.
修改评论