南方科技大学知识苑(SUSTech KC): 种群协助的高效强化学习研究

题名	种群协助的高效强化学习研究
其他题名	Efficient Population-assisted Reinforcement Learning
姓名	郑博文
姓名拼音	ZHENG Bowen
学号	12032944
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	程然
导师单位	计算机科学与工程系
论文答辩日期	2023-05-13
论文提交日期	2023-07-01
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	近年来，随着深度学习的爆炸式发展，深度强化学习算法在众多领域取得了显著成果。无模型强化学习算法被广泛应用于策略搜索问题。然而，进一步研究发现，这类算法拥有的自举学习范式与监督学习范式不同，因此面临着样本多样性不足和收敛性脆弱的问题。另一方面，基于种群的神经进化(Neuroevolution)方法作为另一种策略搜索的解决方案开始崭露头角。然而，这些基于种群的优化方法因其相对低效的算子和较高的采样复杂度而受到限制。在最近的研究中，一些算法尝试将这两种方法融合，通过共享经验回放池连接它们。这些算法利用强化学习的优化参数来加速演化算法，同时利用种群个体采样的数据协助强化学习。本文将采用这种思想的算法统称为种群协助强化学习(Population-assisted Reinforcement Learning, PaRL)算法。本文探讨了目前种群协助的强化学习算法中存在的潜在问题，进而提出一种高效的混合算法。主要工作包括以下几个方面：我们提出了一个统一的可扩展学习框架。该框架适应种群的并行特性，实现了种群协助的强化学习算法的高效训练。该框架还兼容其他类型算法的学习，以便进行更公平的比较。基于此框架，我们提出了一种简单而高效的演化策略算法，用于种群协助的强化学习。与之前的工作相比，该方法显著提高了种群性能。针对离轨种群数据对强化学习的影响，我们分析了种群协助的强化学习中使用的离轨强化学习算法。观察到了一个此前被忽略的问题，即种群数据分布与目标行动器数据分布之间的差异导致的强化学习更新误差。我们设计了实验验证了这种误差的影响。为解决这个问题，我们提出了一种基于双经验回放池设计的补偿措施，利用同轨分布数据的修正效应来减少种群数据带来的误差，从而使强化学习更好地利用种群数据。
其他摘要	With the remarkable development of deep learning in recent years, deep reinforcement learning (RL) algorithms have demonstrated exceptional performance across various domains. Model-free reinforcement learning algorithms, in particular, are extensively employed for policy search problems. However, further research has revealed that these algorithms encounter issues such as insufficient diverse exploration and brittle convergence due to their bootstrap learning paradigm, which differs from supervised learning. On the other hand, population-based approaches such as Neuroevolution are gaining traction as alternative solutions for policy search. However, these population-based techniques possess relatively inefficient heuristic operators and high sampling complexity, constraining their potential. Recently, some algorithms have attempted to integrate these two approaches by connecting them through a shared replay buffer. These algorithms use the optimized parameters from the reinforcement learning method to expedite the evolutionary algorithm and utilize data sampled from individuals of the population to aid reinforcement learning. We collectively refer to algorithms following this concept as population-assisted reinforcement learning (PaRL). In this paper, we investigate potential issues in existing population-assisted reinforcement learning methods and subsequently propose a high-performance hybrid algorithm, under the following contributions: We introduce a unified and scalable learning framework that better aligns with the parallel nature of populations, facilitating faster algorithm training and enabling fairer training of algorithms for comparisons. Based on this learning framework, we propose a simple and efficient evolution strategy algorithm for PaRL, which significantly enhances the population's performance compared to previous work. We scrutinize the off-policy reinforcement learning algorithm used in population-assisted reinforcement learning and identify a previously overlooked update error, caused by the discrepancy in data distribution between the population and the RL agent. We conduct empirical analyses to ascertain the impact of this error. To address this issue, we further propose a dual replay buffer design for the PaRL framework that leverages the corrective effect of on-policy data to mitigate the error introduced by the off-policy population data, thereby allowing reinforcement learning to better utilize the information from the population data.
关键词	演化强化学习离轨学习演化算法
其他关键词	Evolutionary Reinforcement Learning Off-policy Learning Evolutionary Algorithm
语种	中文
培养类别	独立培养
入学年份	2020
学位授予年份	2023-06
参考文献列表	[1] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing Atari with Deep Reinforcement Learning[J]. CoRR, 2013, abs/1312.5602. [2] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489. [3] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nat., 2019, 575(7782): 350-354. [4] WILLIAMS R J. Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning[J]. Machine Learning, 1992, 8(3): 229-256. [5] RUMMERY G A, NIRANJAN M. On-line Q-learning using connectionist systems: 166[R]. Cambridge University Engineering Department, 1994. [6] WATKINS C J C H, DAYAN P. Q-learning[J]. Machine Learning, 1992, 8(3): 279-292. [7] KONDA V, TSITSIKLIS J. Actor-Critic Algorithms[C]//Advances in Neural Information Pro- cessing Systems: volume 12. Denver, CO, USA: MIT Press, 1999: 1008-1014. [8] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforce- ment learning[J]. Nature, 2015, 518(7540): 529-533. [9] SCHAUL T, QUAN J, ANTONOGLOU I, et al. Prioritized Experience Replay[C]//4th Inter- national Conference on Learning Representations, ICLR 2016. San Juan, Puerto Rico, 2016: 1-21. [10] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous Methods for Deep Reinforcement Learning[C]//Proceedings of Machine Learning Research: volume 48 Proceedings of the 33rd International Conference on Machine Learning, ICML 2016. New York, NY, USA: JMLR.org, 2016: 1928-1937. [11] SILVER D, LEVER G, HEESS N, et al. Deterministic Policy Gradient Algorithms[C]// Proceedings of Machine Learning Research: volume 32 Proceedings of the 31st International Conference on Machine Learning, ICML 2014. Beijing, China: JMLR.org, 2014: 387-395. [12] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous Control with Deep Reinforcement Learning[C]//4th International Conference on Learning Representations, ICLR 2016. San Juan, Puerto Rico, 2016: 1-14. [13] PLAPPERT M, HOUTHOOFT R, DHARIWAL P, et al. Parameter Space Noise for Exploration [C]//6th International Conference on Learning Representations, ICLR 2018. Vancouver, BC, Canada: OpenReview.net, 2018: 1-18. [14] FORTUNATO M, AZAR M G, PIOT B, et al. Noisy Networks For Exploration[C]//6th In- ternational Conference on Learning Representations, ICLR 2018. Vancouver, BC, Canada: OpenReview.net, 2018: 1-21. [15] OSTROVSKI G, BELLEMARE M G, VAN DEN OORD A, et al. Count-Based Exploration with Neural Density Models[C]//Proceedings of Machine Learning Research: volume 70 Pro- ceedings of the 34th International Conference on Machine Learning, ICML 2017. Sydney, NSW, Australia: PMLR, 2017: 2721-2730. [16] TANG H, HOUTHOOFT R, FOOTE D, et al. #Exploration: A Study of Count-Based Ex- ploration for Deep Reinforcement Learning[C]//Advances in Neural Information Processing Systems: volume 30. Long Beach, CA, USA: Curran Associates, Inc., 2017: 2753-2762. [17] BELLEMARE M, SRINIVASAN S, OSTROVSKI G, et al. Unifying Count-Based Exploration and Intrinsic Motivation[C]//Advances in Neural Information Processing Systems: volume 29. Barcelona, Spain: Curran Associates, Inc., 2016: 1471-1479. [18] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven Exploration by Self-supervised Prediction[C]//Proceedings of Machine Learning Research: volume 70 Proceedings of the 34th International Conference on Machine Learning, ICML 2017. Sydney, NSW, Australia: PMLR, 2017: 2778-2787. [19] STANLEY K O, MIIKKULAINEN R. Efficient Reinforcement Learning through Evolving Neural Network Topologies[C]//GECCO’02: Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation. New York, NY, USA: Morgan Kaufmann, 2002: 569– 577. [20] SALIMANS T, HO J, CHEN X, et al. Evolution Strategies as a Scalable Alternative to Rein- forcement Learning[J]. CoRR, 2017, abs/1703.03864. [21] LIU G, ZHAO L, YANG F, et al. Trust Region Evolution Strategies[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(01): 4352-4359. [22] FUKS L, AWAD N, HUTTER F, et al. An Evolution Strategy with Progressive Episode Lengths for Playing Games[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. Macao, China: International Joint Conferences on Artificial Intelligence Organization, 2019: 1234-1240. [23] JADERBERG M, DALIBARD V, OSINDERO S, et al. Population Based Training of Neural Networks[J]. CoRR, 2017, abs/1711.09846. [24] FAUST A, FRANCIS A G, MEHTA D. Evolving Rewards to Automate Reinforcement Learn- ing[J]. CoRR, 2019, abs/1905.07628. [25] HOUTHOOFT R, CHEN Y, ISOLA P, et al. Evolved Policy Gradients[C]//Advances in Neural Information Processing Systems: volume 31. Montréal, Canada: Curran Associates, Inc., 2018: 5405-5414. [26] KHADKA S, TUMER K. Evolution-Guided Policy Gradient in Reinforcement Learning[C]// Advances in Neural Information Processing Systems: volume 31. Montréal, Canada: Curran Associates, Inc., 2018: 1196-1208. [27] SUTTON R S. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding[C]//Advances in Neural Information Processing Systems: volume 8. Denver, CO, USA: MIT Press, 1995: 1038-1044. [28] SUTTON R S, BARTO A G. Reinforcement Learning: An Introduction[M]. Cambridge, MA, USA: A Bradford Book, 2018. [29] SUTTON R S, MCALLESTER D, SINGH S, et al. Policy Gradient Methods for Reinforce- ment Learning with Function Approximation[C]//Advances in Neural Information Processing Systems: volume 12. Denver, CO, USA: MIT Press, 1999: 1057-1063. [30] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust Region Policy Optimization[C]// Proceedings of Machine Learning Research: volume 37 Proceedings of the 32nd International Conference on Machine Learning, ICML 2015. Lille, France: PMLR, 2015: 1889-1897. [31] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms [J]. CoRR, 2017, abs/1707.06347. [32] DEGRIS T, WHITE M, SUTTON R S. Off-Policy Actor-Critic[C]//Proceedings of Machine Learning Research: Proceedings of the 29th International Conference on Machine Learning, ICML 2012. Madison, WI, USA: Omnipress, 2012: 179–186. [33] WANG Z, BAPST V, HEESS N, et al. Sample Efficient Actor-Critic with Experience Replay [C]//5th International Conference on Learning Representations, ICLR 2017. Toulon, France: OpenReview.net, 2017: 1-20. [34] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing Function Approximation Error in Actor- Critic Methods[C]//Proceedings of Machine Learning Research: volume 80 Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Stockholm, Sweden: PMLR, 2018: 1582-1591. [35] WIERSTRA D, SCHAUL T, GLASMACHERS T, et al. Natural Evolution Strategies[J]. Jour- nal of Machine Learning Research, 2014, 15(27): 949-980. [36] CHRABąSZCZ P, LOSHCHILOV I, HUTTER F. Back to Basics: Benchmarking Canoni- cal Evolution Strategies for Playing Atari[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Arti- ficial Intelligence Organization, 2018: 1419-1426. [37] HANSEN N, OSTERMEIER A. Completely Derandomized Self-Adaptation in Evolution Strategies[J]. Evolutionary Computation, 2001, 9(2): 159-195. [38] HEIDRICH-MEISNER V, IGEL C. Evolution Strategies for Direct Policy Search[C]//Parallel Problem Solving from Nature – PPSN X. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008: 428-437. [39] HEIDRICH-MEISNER V, IGEL C. Neuroevolution strategies for episodic reinforcement learn- ing[J]. Journal of Algorithms, 2009, 64(4): 152-168. [40] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning Representations by Back- Propagating Errors[M]. Cambridge, MA, USA: MIT Press, 1988: 696–699. [41] KINGMA D P, BA J. Adam: A Method for Stochastic Optimization[C]//3rd International Con- ference on Learning Representations, ICLR 2015. San Diego, CA, USA, 2015: 1-15. [42] LOSHCHILOV I, HUTTER F. Decoupled Weight Decay Regularization[C]//7th International Conference on Learning Representations, ICLR 2019. New Orleans, LA, USA: OpenRe- view.net, 2019: 1-8. [43] KHADKA S, MAJUMDAR S, NASSAR T, et al. Collaborative Evolutionary Reinforcement Learning[C]//Proceedings of Machine Learning Research: volume 97 Proceedings of the 36th International Conference on Machine Learning, ICML 2019. Long Beach, CA, USA: PMLR, 2019: 3341-3350. [44] BODNAR C, DAY B, LIÓ P. Proximal Distilled Evolutionary Reinforcement Learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence: volume 34. New York, NY, USA, 2020: 3283-3290. [45] POURCHOT, SIGAUD. CEM-RL: Combining evolutionary and gradient-based methods for policy search[C]//7th International Conference on Learning Representations, ICLR 2019. New Orleans, LA, USA: OpenReview.net, 2019: 1-19. [46] RUDOLPH G. Convergence properties of evolutionary algorithms[M]. Kovac, 1997. [47] MARCHESINI E, CORSI D, FARINELLI A. Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning[C]//9th International Conference on Learning Representations, ICLR 2021. Vienna, Austria: OpenReview.net, 2021: 1-15. [48] TANG Y. Guiding Evolutionary Strategies with Off-Policy Actor-Critic[C]//AAMAS ’21: Pro- ceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems. Richland, SC, USA: International Foundation for Autonomous Agents and Multiagent Systems, 2021: 1317–1325. [49] LEE K, LEE B U, SHIN U, et al. An Efficient Asynchronous Method for Integrating Evo- lutionary and Gradient-based Policy Search[C]//Advances in Neural Information Processing Systems: volume 33. Vancouver, British Columbia, Canada: Curran Associates, Inc., 2020: 10124-10135. [50] ZIMMER M, WENG P. Exploiting the Sign of the Advantage Function to Learn Deterministic Policies in Continuous Domains[C]//Proceedings of the Twenty-Eighth International Joint Con- ference on Artificial Intelligence, IJCAI-19. Macao, China: International Joint Conferences on Artificial Intelligence Organization, 2019: 4496-4502. [51] NACHUM O, NOROUZI M, TUCKER G, et al. Smoothed Action Value Functions for Learning Gaussian Policies[C]//Proceedings of Machine Learning Research: volume 80 Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Stockholm, Sweden: PMLR, 2018: 3692-3700. [52] SINHA S, SONG J, GARG A, et al. Experience Replay with Likelihood-free Importance Weights[C]//Proceedings of Machine Learning Research: volume 168 Proceedings of The 4th Annual Learning for Dynamics and Control Conference. Stanford, CA, USA: PMLR, 2022: 110-123. [53] TODOROV E, EREZ T, TASSA Y. MuJoCo: A physics engine for model-based control[C]// 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vilamoura, Al- garve, Portugal, 2012: 5026-5033. [54] BROCKMAN G, CHEUNG V, PETTERSSON L, et al. OpenAI Gym[J]. CoRR, 2016, abs/1606.01540. [55] NGUYEN H T, TRAN K, LUONG N H. Combining Soft-Actor Critic with Cross-Entropy Method for Policy Search in Continuous Control[C]//IEEE Congress on Evolutionary Compu- tation, CEC 2022. Padua, Italy: IEEE, 2022: 1-8. [56] HAARNOJA T, ZHOU A, HARTIKAINEN K, et al. Soft Actor-Critic Algorithms and Appli- cations[J]. CoRR, 2018, abs/1812.05905. [57] ZHANG S, SUTTON R S. A Deeper Look at Experience Replay[J]. CoRR, 2017, abs/1712.01275. [58] MORITZ P, NISHIHARA R, WANG S, et al. Ray: A Distributed Framework for Emerging AI Applications[C]//13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018. Carlsbad, CA, USA: USENIX Association, 2018: 561–577. [59] LIANG E, LIAW R, NISHIHARA R, et al. RLlib: Abstractions for Distributed Reinforcement Learning[C]//Proceedings of Machine Learning Research: volume 80 Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Stockholm, Sweden: PMLR, 2018: 3053-3062. [60] LEVINE S, KUMAR A, TUCKER G, et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems[J]. CoRR, 2020, abs/2005.01643. [61] FUJIMOTO S, GU S. A Minimalist Approach to Offline Reinforcement Learning[C]//Advances in Neural Information Processing Systems: volume 34. Virtual Event: Curran Associates, Inc., 2021: 20132-20145.
所在学位评定分委会	电子科学与技术
国内图书分类号	TP183
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/544753
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	郑博文. 种群协助的高效强化学习研究[D]. 深圳. 南方科技大学,2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12032944-郑博文-计算机科学与工（4009KB）	--	--	限制开放	--	请求全文