南方科技大学知识苑(SUSTech KC): 深度强化学习的效率提升方法研究

题名	深度强化学习的效率提升方法研究
其他题名	METHODS FOR IMPROVING EFFICIENCY OF DEEP REINFORCEMENT LEARNING
姓名	杨琪
姓名拼音	YANG Qi
学号	11930392
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	唐珂
导师单位	计算机科学与工程系
论文答辩日期	2022-05-08
论文提交日期	2022-06-16
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	强化学习是构建决策智能的重要手段之一。深度强化学习等无模型方法无需依赖环境模型，而是通过数据驱动的方式优化策略模型，以获得行为策略。一方面，深度强化学习以其潜在的通用性近年来获得了广泛关注，在序列决策场景上取得了多个重要进展。但另一方面，现有方法却也面临着训练时间长，硬件需求条件要求高的缺陷，这些缺陷严重制约了其实际应用。因此，提升深度强化学习算法的效率显得尤为重要。此前的深度强化学习算法大多遵循着对给定策略模型进行“采样-优化”的迭代训练模式。本文分别从策略优化和数据采样两个角度入手开展研究，以期提升有限训练资源内获得策略的性能。策略参数优化方面，本文提出了一种基于随机嵌入的、可并行的训练算法。针对冗余变量神经网络的大规模优化，算法得以在较低的优化维度上充分发挥无梯度算法的优势。实验证明，算法实例与基线算法相比，在收敛速度更快的同时还能收敛于较好的解。数据采样选择方面，本文主要关注如何将计算资源优先分配给关键训练数据，从而在不降低策略泛化性能的同时，加速训练过程。本文提出了一种基于策略性能的训练数据价值度量，在此基础上进一步形成了训练数据选择性采样的机制。实验表明，将所提出的机制与策略优化算法相结合后，仅使用一半的训练资源和训练数据，就能达到和全部样例上训练相当的泛化性能。
其他摘要	Reinforcement learning (RL) is one of the most important methods in decision intelligence. Without knowing the environment, Deep Reinforcement Learning (DRL), i.e., model-free RL, learn an intelligent policy by data-driven optimization. On the one hand, DRL has gained widespread attention in the last decade for its promising generality, also has made great progress in various sequential decision-making scenarios. On the other hand, the existing RL methods suffer from long training time and the high requirement on hardware, which also hinder it from further real-world applications. Regard that, it is crucial to improving the efficiency of DRL algorithms. Given a policy, the previous DRL follows the iterative pattern of sampling and optimizing. The paper focuses on improving the performance in a limited computational resource in the two phases: policy optimization and data sampling. In policy optimization, the paper proposes an embedding-based and parallelable algorithm. On the large-scale optimization problems of an over-parameterized network, our algorithm takes full advantage of the gradient-free algorithms in a relatively small-scale problem. Experiments show that the proposed algorithm converges quicker and converges to a better policy than the state-of-the-art baselines. As for data sampling, this paper mainly discusses how to assign the computation resource to the key data, thus accelerate the training process and do not lose the generalization performance. This paper proposes a value metric of training data based on policy performance, and a further selective sampling mechanism on the training set. Based on experiments, this mechanism combined with the state-of-the-art policy optimizer, acquires competitive performance with only a half of training resources and data compared with the policy trained on all data.
关键词	深度强化学习演化算法大规模优化主动学习算法效率
其他关键词	Deep Reinforcement Learning Evolutionary Algorithm Large Scale Optimization Active Learning Algorithm Efficiency
语种	中文
培养类别	独立培养
入学年份	2019
学位授予年份	2022-06
参考文献列表	[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489. [2] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[J]. arXiv preprint, arXiv:1509.02971, 2016. [3] LIU X Y, YANG H, CHEN Q, et al. FinRL: A deep reinforcement learning library for automated stock trading in quantitative finance[J]. arXiv preprint, arXiv:2011.09607, 2020. [4] STAHL N, FALKMAN G, KARLSSON A, et al. Deep reinforcement learning for multiparameter optimization in de novo drug design[J]. Journal of Chemical Information and Modeling, 2019, 59(7): 3166-3176. [5] BELLMAN R. Dynamic programming and stochastic control processes[J]. Information and Control, 1958, 1(3): 228-239. [6] DEGRAVE J, FELICI F, BUCHLI J, et al. Magnetic control of tokamak plasmas through deep reinforcement learning[J]. Nature, 2022, 602(7897): 414-419. [7] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533. [8] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning [C]//BALCAN M, WEINBERGER K Q. Proceedings of the 33rd International Conference on Machine Learning. San Diego, CA, USA: JMLR, 2016: 1928-1937. [9] ECOFFET A, HUIZINGA J, LEHMAN J, et al. First return, then explore[J]. Nature, 2021,590: 580-586. [10] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning. San Diego, CA, USA: JMLR, 2015: 1889-1897. [11] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[J].arXiv preprint, arXiv:1707.06347, 2017. [12] SALIMANS T, HO J, CHEN X, et al. Evolution strategies as a scalable alternative to reinforcement learning[J]. arXiv preprint, arXiv:1703.03864, 2017. [13] CHRABASZCZ P, LOSHCHILOV I, HUTTER F. Back to basics: Benchmarking canonical evolution strategies for playing atari[C]//LANG J. Proceedings of the 27th International Joint Conference on Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufmann, 2018:1419-1426. [14] CONTI E, MADHAVAN V, SUCH F P, et al. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents[M]//BENGIO S, WALLACH H M, LAROCHELLE H, et al. Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2018: 5032-5043. [15] YANG P, ZHANG H, YU Y, et al. Evolutionary reinforcement learning via cooperative coevolutionary negatively correlated search[J]. Swarm and Evolutionary Computation, 2022, 68:100974. [16] LIU J, SNODGRASS S, KHALIFA A, et al. Deep learning for procedural content generation[J]. Neural Computing and Applications, 2021, 33(1): 19-37. [17] JIANG M, GREFENSTETTE E, ROCKTASCHEL T. Prioritized level replay[C]//MEILA M,ZHANG T. Proceedings of the 38th International Conference on Machine Learning. San Diego,CA, USA: JMLR, 2021: 4940-4950. [18] SUTTON R S, BARTO A G. Reinforcement learning: An introduction[M]. Cambridge, MA,USA: MIT Press, 1998. [19] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3): 229-256. [20] KONDA V R, TSITSIKLIS J N. Actor-critic algorithms[C]//SOLLA S A, LEEN T K, MULLER K R. Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press,1999: 1008-1014. [21] JONG K A D. Evolutionary computation: A unified approach[M]. Cambridge, MA, USA: MIT Press, 2006. [22] YAO X. Evolving artificial neural networks[J]. Proceedings of the IEEE, 1999, 87(9): 1423-1447. [23] 喻杨龙. 基于负相关搜索的演化强化学习算法研究[D]. 深圳: 哈尔滨工业大学, 2020. [24] MA X, LI X, ZHANG Q, et al. A survey on cooperative co-evolutionary algorithms[J]. IEEE Transaction of Evolution Computation, 2019, 23: 421-441. [25] CHEN B, CASTRO R M, KRAUSE A. Joint optimization and variable selection of high dimensional gaussian processes[C]//Proceedings of the 29th International Conference on Machine Learning. San Diego, CA, USA: JMLR, 2012. [26] CARPENTIER A, MUNOS R. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit[C]//LAWRENCE N D, GIROLAMI M A. Proceedings of the 15th International Conference on Artificial Intelligence and Statistics. San Diego, CA, USA: JMLR,2012: 190-198. [27] DJOLONGA J, KRAUSE A, CEVHER V. High-dimensional gaussian process bandits[C]//BURGES C J C, BOTTOU L, GHAHRAMANI Z, et al. Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2013: 1025-1033. [28] KABÁN A, BOOTKRAJANG J, DURRANT R J. Towards large scale continuous EDA: arandom matrix theory perspective[J]. Evolutionary computation, 2013, 24(3): 255-291. [29] WANG Z, ZOGHI M, HUTTER F, et al. Bayesian optimization in high dimensions via random embeddings[C]//ROSSI F. Proceedings of the 23rd international joint conference on Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufmann, 2013: 1778-1784. [30] BINOIS M, GINSBOURGER D, ROUSTANT O. On the choice of the low-dimensional domain for global optimization via random embeddings[J]. Journal of Global Optimization, 2020, 76(1): 69-90. [31] QIAN H, HU Y, YU Y. Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings[C]//KAMBHAMPATI S. Proceedings of the 25th International Joint Conference on Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufmann, 2016: 1946-1952. [32] QIAN H, YU Y. Scaling simultaneous optimistic optimization for high-dimensional non-convex functions with low effective dimensions[C]//SCHUURMANS D, WELLMAN M P. Proceedings of the 30th AAAI Conference on Artificial Intelligence. AAAI Press, 2016: 2000-2006. [33] QIAN H, YU Y. Solving high-dimensional multi-objective optimization problems with low effective dimensions[C]//SINGH S P, MARKOVITCH S. Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 2017: 875-881. [34] ČREPINšEK M, LIU S H, MERNIK M. Exploration and exploitation in evolutionary algorithms: A survey[J]. ACM computing surveys, 2013, 45(3): 1-33. [35] TANG K, YANG P, YAO X. Negatively correlated search[J]. IEEE Journal on Selected Areas in Communications, 2016, 34: 542-550. [36] CHATZILYGEROUDIS K I, RAMA R, KAUSHIK R, et al. Black-box data-efficient policy search for robotics[C]//IEEE International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA: IEEE, 2017: 51-58. [37] 杨鹏. 基于自动分治的智能优化方法及其应用研究[D]. 合肥: 中国科学技术大学, 2017. [38] SCHULMAN J, MORITZ P, LEVINE S, et al. High-dimensional continuous control using generalized advantage estimation[C]//BENGIO Y, LECUN Y. Proceedings of International Conference on Learning Representations. ICLR Organizing Committee, 2016. [39] SUCH F P, MADHAVAN V, CONTI E, et al. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning[J]. arXiv preprint, arXiv:1712.06567, 2017. [40] DUCHI J C, JORDAN M I, WAINWRIGHT M J, et al. Optimal rates for zero-order convex optimization: The power of two function evaluations[J]. IEEE Transaction of Information Theory, 2015, 61: 2788-2806. [41] BROCKMAN G, CHEUNG V, PETTERSSON L, et al. Openai gym[J]. arXiv preprint,arXiv:1606.01540, 2016. [42] LI C, FARKHOOR H, LIU R, et al. Measuring the intrinsic dimension of objective landscapes [C]//Proceedings of International Conference on Learning Representations. ICLR Organizing Committee, 2018. [43] AGHAJANYAN A, GUPTA S, ZETTLEMOYER L. Intrinsic dimensionality explains the effectiveness of language model fine-tuning[C]//ZONG C, XIA F, LI W, et al. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 2021: 7319-7328. [44] BOTTOU L. Large-scale machine learning with stochastic gradient descent[C]//LECHEVALLIER Y, SAPORTA G. Proceedings of the 19th International Conference on Computational Statistics. Proceedings of International Committee on Computational Linguistics, 2010: 177-186. [45] KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//BENGIO Y, LECUN Y. Proceedings of International Conference on Learning Representations. ICLR Organizing Committee, 2015. [46] RUDER S. An overview of gradient descent optimization algorithms[J]. arXiv preprint,arXiv:1609.04747, 2016. [47] WIERSTRA D, SCHAUL T, PETERS J, et al. Natural Evolution Strategies[J]. The Journal of Machine Learning Research, 2014, 15(1): 949-980. [48] MACHADO M C, BELLEMARE M G, TALVITIE E, et al. Revisiting the arcade learningenvironment: Evaluation protocols and open problems for general agents[J]. Journal of Artificial Intelligence Research, 2018, 61: 523-562. [49] KOSTRIKOV I. Pytorch implementations of asynchronous advantage actor critic[EB/OL]. 2019 [2019-03-20]. https://github.com/ikostrikov/pytorch-a3c. [50] TOBIN J, FONG R, RAY A, et al. Domain randomization for transferring deep neural networks from simulation to the real world[C]//International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA: IEEE, 2017: 23-30. [51] COBBE K, KLIMOV O, HESSE C, et al. Quantifying generalization in reinforcement learning[C]//CHAUDHURI K, SALAKHUTDINOV R. Proceedings of the 36th International Conference on Machine Learning. San Diego, CA, USA: JMLR, 2019: 1282-1289. [52] ROSIQUE F, NAVARRO P J, FERNÁNDEZ C, et al. A systematic review of perception system and simulators for autonomous vehicles research[J]. Sensors, 2019, 19(3): 648. [53] ZHANG C, VINYALS O, MUNOS R, et al. A study on overfitting in deep reinforcement learning[J]. arXiv preprint, arXiv:1804.06893, 2018. [54] PACKER C, GAO K, KOS J, et al. Assessing generalization in deep reinforcement learning[J]. arXiv preprint, arXiv:1810.12282, 2019. [55] ZHAO C, SIGAUD O, STULP F, et al. Investigating generalisation in continuous deep reinforcement learning[J]. arXiv preprint, 2019, arXiv:1902.07015. [56] YARATS D, KOSTRIKOV I, FERGUS R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels[C]//Proceedings of International Conference on Learning Representations. ICLR Organizing Committee, 2021. [57] LASKIN M, LEE K, STOOKE A, et al. Reinforcement learning with augmented data[C]//LAROCHELLE H, RANZATO M, HADSELL R, et al. Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2020: 19884-19895. [58] RISI S, TOGELIUS J. Increasing generality in machine learning through procedural content generation[J]. Nature Machine Intelligence, 2020, 2(8): 428-436. [59] SONAR A, PACELLI V, MAJUMDAR A. Invariant policy optimization: Towards stronger generalization in reinforcement learning[C]//JADBABAIE A, LYGEROS J, PAPPAS G J, et al. Proceedings of the 3rd Annual Conference on Learning for Dynamics and Control. San Diego,CA, USA: PMLR, 2021: 21-33. [60] AMIT R, MEIR R, CIOSEK K. Discount factor as a regularizer in reinforcement learning[C]//Proceedings of the 37th International Conference on Machine Learning. San Diego, CA, USA:JMLR, 2020: 269-278. 57 [61] IGL M, CIOSEK K, LI Y, et al. Generalization in reinforcement learning with selective noise injection and information bottleneck[C]//WALLACH H M, LAROCHELLE H, BEYGELZIMERA, et al. Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2019: 13956-13968. [62] RAJESWARAN A, LOWREY K, TODOROV E, et al. Towards generalization and simplicity in continuous control[C]//GUYON I, VON LUXBURG U, BENGIO S, et al. Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2017: 6550-6561. [63] GONG X, JIA H, ZHOU X, et al. Improving policy generalization for teacher-student reinforcement learning[C]//LI G, SHEN H T, YUAN Y, et al. Proceedings of the 13th International Conference Knowledge Science, Engineering and Management. Springer, 2020: 39-47. [64] COHN D A, ATLAS L E, LADNER R E. Improving generalization with active learning[J]. Machine Learning, 1994, 15: 201-221. [65] LEWIS D D, GALE W A. A sequential algorithm for training text classifiers[C]//Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM/Springer, 1994: 3-12. [66] EPSHTEYN A, VOGEL A, DEJONG G. Active reinforcement learning[C]//COHEN W W, MCCALLUM A, ROWEIS S T. Proceedings of the 25th International Conference on Machine learning. San Diego, CA, USA: JMLR, 2008: 296-303.
所在学位评定分委会	计算机科学与工程系
国内图书分类号	TP183.0
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/335889
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	杨琪. 深度强化学习的效率提升方法研究[D]. 深圳. 南方科技大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
11930392-杨琪-计算机科学与工程（3517KB）	--	--	限制开放	--	请求全文