南方科技大学知识苑(SUSTech KC): Heavy Tailed Regularization of Weight Matrices in Deep Neural Networks

题名	Heavy Tailed Regularization of Weight Matrices in Deep Neural Networks
其他题名	深度神经网络权重矩阵的重尾正则化
姓名	肖玄哲
姓名拼音	XIAO Xuanzhe
学号	12132907
学位类型	硕士
学位专业	0701 数学
学科门类/专业学位类别	07 理学
导师	李曾
导师单位	统计与数据科学系
论文答辩日期	2023-05-07
论文提交日期	2023-07-01
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	With the significant development of deep learning in recent years, Deep Neural Net- works (DNNs) have been widely applied and deployed in various fields in the real world. The question of why deep neural networks can achieve excellent performance in various fields, or why their generalization performance is so good, remains unresolved. Recent re- search has shown that the generalization performance of DNNs is related to the spectrum of their weight matrices. Under appropriate training, the more the spectrum of the weight matrices tend to a heavy-tailed distribution, the better the generalization performance of the deep neural network will be. Inspired by these theories, this thesis focuses on how to use regularization methods to quickly make the spectrum of the weight matrix of a deep neural network tend to a heavy-tailed distribution in order to improve the generalization performance of the network. This thesis proposes the concept of Heavy Tailed Regularization and proposes and implements heavy-tailed regularization methods from multiple perspectives. First, this thesis proposes heavy-tailed regularization methods that use complexity measures that can characterize the heavy-tailedness of weight matrices of the neural network as a regular- ization term. Specifically, this thesis proposes to use two complexity measures, Weighted Alpha and Stable Rank, as regularization terms to train neural networks. This thesis uses mini-batch SGD to train neural networks and derives their gradient calculation formulas separately. Then, starting from the perspective of Bayesian statistics and using knowl- edge of random matrices, this thesis designs two new types of heavy-tailed regularization methods by using power law distribution and Fréchet distribution as priors for the global spectrum and maximum eigenvalues, respectively. Finally, this thesis considers the prob- lem of the decrease in generalization performance after adversarial attacks and model compression in real-world scenarios and proposes a heavy-tailed regularization method to address it. This thesis compares the above heavy-tailed regularization methods with other reg- ularization methods on KMNIST and CIFAR10 dataset using a 3-layer neural network (NN3), LeNet5, and ResNet18. The results show that the heavy-tailed regularization methods proposed in this thesis are superior to other regularization methods and can im- prove the generalization performance of neural networks. In addition, this thesis visualizes the spectrum of the weight matrix of the trained neural network models using the above regularization methods, verifying that these regularization methods can indeed accelerate the process of heavy-tailing the spectrum of weight matrix of DNNs.
关键词	Heavy Tailed Regularization Neural Network Deep Learning Genenralization
语种	英语
培养类别	独立培养
入学年份	2021
学位授予年份	2023-05
参考文献列表	[1] JIANG Y, NEYSHABUR B, MOBAHI H, et al. Fantastic generalization measures and where to find them[A]. 2019. [2] BARTLETT P L, MENDELSON S. Rademacher and Gaussian complexities: Risk bounds and structural results[J]. Journal of Machine Learning Research, 2002, 3(Nov): 463-482. [3] GAO W, ZHOU Z H. Dropout Rademacher complexity of deep neural networks[J]. Science China Information Sciences, 2016, 59(7): 1-12. [4] BARTLETT P L, FOSTER D J, TELGARSKY M J. Spectrally-normalized margin bounds for neural networks[J]. Advances in neural information processing systems, 2017, 30. [5] GOLOWICH N, RAKHLIN A, SHAMIR O. Size-Independent Sample Complexity of Neural Networks[C/OL]//BUBECK S, PERCHET V, RIGOLLET P. Proceedings of Machine Learning Research: volume 75 Proceedings of the 31st Conference On Learning Theory. PMLR, 2018: 297-299. https://proceedings.mlr.press/v75/golowich18a.html. [6] VAPNIK V, LEVIN E, LE CUN Y. Measuring the VC-dimension of a learning machine[J]. Neural computation, 1994, 6(5): 851-876. [7] BARTLETT P, MAIOROV V, MEIR R. Almost linear VC dimension bounds for piecewise polynomial networks[J]. Advances in neural information processing systems, 1998, 11. [8] BARTLETT P L, HARVEY N, LIAW C, et al. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks[J/OL]. Journal of Machine Learning Research, 2019, 20(63): 1-17. http://jmlr.org/papers/v20/17-612.html. [9] NEYSHABUR B, BHOJANAPALLI S, MCALLESTER D, et al. Exploring generalization in deep learning[J]. Advances in neural information processing systems, 2017, 30. [10] LIANG T, POGGIO T, RAKHLIN A, et al. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks[C/OL]//CHAUDHURI K, SUGIYAMA M. Proceedings of Machine Learning Research: volume 89 Proceedings of the Twenty-Second International Conference on Artifi- cial Intelligence and Statistics. PMLR, 2019: 888-896. https://proceedings.mlr.press/v89/lian g19a.html. [11] ZHANG C, BENGIO S, HARDT M, et al. Understanding deep learning (still) requires rethink- ing generalization[J]. Communications of the ACM, 2021, 64(3): 107-115. [12] NAGARAJAN V, KOLTER J Z. Uniform convergence may be unable to explain generalization in deep learning[J]. Advances in Neural Information Processing Systems, 2019, 32. [13] MOHRI M, ROSTAMIZADEH A, TALWALKAR A. Foundations of machine learning[M]. MIT press, 2018. [14] ALLEN-ZHU Z, LI Y, LIANG Y. Learning and generalization in overparameterized neural networks, going beyond two layers[J]. Advances in neural information processing systems, 2019, 32. 38 [15] ARORA S, GE R, NEYSHABUR B, et al. Stronger generalization bounds for deep nets via a compression approach[C]//International Conference on Machine Learning. PMLR, 2018: 254- 263. [16] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9. [17] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [18] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolu- tional neural networks[J]. Advances in neural information processing systems, 2012, 25. [19] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners [J]. OpenAI blog, 2019, 1(8): 9. [20] BELKIN M, HSU D, MA S, et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off[J]. Proceedings of the National Academy of Sciences, 2019, 116(32): 15849-15854. [21] NAKKIRAN P, KAPLUN G, BANSAL Y, et al. Deep double descent: Where bigger models and more data hurt[J]. Journal of Statistical Mechanics: Theory and Experiment, 2021, 2021 (12): 124003. [22] LAFON M, STUDENT D M, THOMAS A. Understanding the double descent phenomenon [Z]. 2022. [23] LI Z, XIE C, WANG Q. Asymptotic Normality and Confidence Intervals for Prediction Risk of the Min-Norm Least Squares Estimator[C]//International Conference on Machine Learning. PMLR, 2021: 6533-6542. [24] LIAO Z, COUILLET R, MAHONEY M W. A random matrix analysis of random fourier fea- tures: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent[J]. Advances in Neural Information Processing Systems, 2020, 33: 13939-13950. [25] MARTIN C H, MAHONEY M W. Traditional and heavy-tailed self regularization in neural network models[A]. 2019. [26] MARTIN C H, MAHONEY M W. Heavy-tailed Universality predicts trends in test accuracies for very large pre-trained deep neural networks[C]//Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 2020: 505-513. [27] MARTIN C H, MAHONEY M W. Implicit self-regularization in deep neural networks: Evi- dence from random matrix theory and implications for learning[J]. Journal of Machine Learning Research, 2021, 22(165): 1-73. [28] MARTIN C H, PENG T S, MAHONEY M W. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data[J]. Nature Communications, 2021, 12 (1): 1-13. [29] MARTIN C H, MAHONEY M W. Post-mortem on a deep learning contest: a Simpson’s paradox and the complementary roles of scale metrics versus shape metrics[A]. 2021. [30] MARČENKO V A, PASTUR L A. Distribution of eigenvalues for some sets of random matrices [J]. Mathematics of the USSR-Sbornik, 1967, 1(4): 457. 39 [31] DAVIS R A, PFAFFEL O, STELZER R. Limit theory for the largest eigenvalues of sample covariance matrices with heavy-tails[J]. Stochastic Processes and their Applications, 2014, 124 (1): 18-50. [32] AUFFINGER A, BEN AROUS G, PÉCHÉ S. Poisson convergence for the largest eigenvalues of heavy tailed random matrices[C]//Annales de l’IHP Probabilités et statistiques: volume 45. 2009: 589-610. [33] SOSHNIKOV A. Poisson statistics for the largest eigenvalues of Wigner random matrices with heavy tails[J]. Electronic Communications in Probability, 2004, 9: 82-91. [34] DAVIS R A, MIKOSCH T, PFAFFEL O. Asymptotic theory for the sample covariance matrix of a heavy-tailed multivariate time series[J]. Stochastic Processes and their Applications, 2016, 126(3): 767-799. [35] DAVIS R A, HEINY J, MIKOSCH T, et al. Extreme value analysis for the sample autocovariance matrices of heavy-tailed multivariate time series[J]. Extremes, 2016, 19(3): 517-547. [36] BURDA Z, JURKIEWICZ J. Heavy-tailed random matrices[A]. 2009. [37] BELINSCHI S, DEMBO A, GUIONNET A. Spectral measure of heavy tailed band and covari- ance random matrices[J]. Communications in Mathematical Physics, 2009, 289(3): 1023-1055. [38] HEINY J, YAO J. Limiting distributions for eigenvalues of sample correlation matrices from heavy-tailed populations[A]. 2020. [39] YIN Y Q, BAI Z D, KRISHNAIAH P R. On the limit of the largest eigenvalue of the large dimensional sample covariance matrix[J]. Probability theory and related fields, 1988, 78(4): 509-521. [40] JOHNSTONE I M. On the distribution of the largest eigenvalue in principal components analysis [J]. The Annals of statistics, 2001, 29(2): 295-327. [41] BAIK J, SILVERSTEIN J W. Eigenvalues of large sample covariance matrices of spiked pop- ulation models[J]. Journal of multivariate analysis, 2006, 97(6): 1382-1408. [42] BAIK J, AROUS G B, PÉCHÉ S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices[J]. The Annals of Probability, 2005, 33(5): 1643-1697. [43] PAUL D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model [J]. Statistica Sinica, 2007: 1617-1642. [44] JOHNSTONE I M, LU A Y. On consistency and sparsity for principal components analysis in high dimensions[J]. Journal of the American Statistical Association, 2009, 104(486): 682-693. [45] BOUCHAUD J P, MÉZARD M. Universality classes for extreme-value statistics[J]. Journal of Physics A: Mathematical and General, 1997, 30(23): 7997. [46] MENG X, YAO J. Impact of classification diﬀiculty on the weight matrices spectra in Deep Learning and application to early-stopping[A]. 2021. [47] BLUM A L, RIVEST R L. Training a 3-node neural network is NP-complete[J]. Neural Net- works, 1992, 5(1): 117-127. [48] AUER P, HERBSTER M, WARMUTH M K. Exponentially many local minima for single neurons[J]. Advances in neural information processing systems, 1995, 8. 40 [49] KESKAR N S, SOCHER R. Improving generalization performance by switching from adam to sgd[A]. 2017. [50] ZHOU P, FENG J, MA C, et al. Towards theoretically understanding why sgd generalizes better than adam in deep learning[J]. Advances in Neural Information Processing Systems, 2020, 33: 21285-21296. [51] HODGKINSON L, MAHONEY M. Multiplicative noise and heavy tails in stochastic optimiza- tion[C]//International Conference on Machine Learning. PMLR, 2021: 4262-4274. [52] SIMSEKLI U, SENER O, DELIGIANNIDIS G, et al. Hausdorff dimension, heavy tails, and generalization in neural networks[J]. Advances in Neural Information Processing Systems, 2020, 33: 5138-5151. [53] BARSBEY M, SEFIDGARAN M, ERDOGDU M A, et al. Heavy tails in sgd and compress- ibility of overparametrized neural networks[J]. Advances in Neural Information Processing Systems, 2021, 34: 29364-29378. [54] NEYSHABUR B, BHOJANAPALLI S, SREBRO N. A pac-bayesian approach to spectrally- normalized margin bounds for neural networks[A]. 2017. [55] MANDT S, HOFFMAN M D, BLEI D M. Stochastic gradient descent as approximate bayesian inference[A]. 2017. [56] WANG J, WANG C, LIN Q, et al. Adversarial attacks and defenses in deep learning for image recognition: A survey[J]. Neurocomputing, 2022. [57] SZEGEDY C, ZAREMBA W, SUTSKEVER I, et al. Intriguing properties of neural networks [A]. 2013. [58] MAGNUS J R, NEUDECKER H. Matrix differential calculus with applications in statistics and econometrics[M]. John Wiley & Sons, 2019. [59] RAGHUNATHAN A, STEINHARDT J, LIANG P. Certified defenses against adversarial ex- amples[A]. 2018. [60] LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition[J]. Neural computation, 1989, 1(4): 541-551. [61] THAMM M, STAATS M, ROSENOW B. Random matrix analysis of deep neural network weight matrices[J]. Physical Review E, 2022, 106(5): 054124.
所在学位评定分委会	数学
国内图书分类号	TP183
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/545020
专题	理学院_统计与数据科学系
推荐引用方式 GB/T 7714	Xiao XZ. Heavy Tailed Regularization of Weight Matrices in Deep Neural Networks[D]. 深圳. 南方科技大学,2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132907-肖玄哲-统计与数据科学（1104KB）	--	--	限制开放	--	请求全文