南方科技大学知识苑(SUSTech KC): 复杂强化学习策略搜索的分布式演化算法

题名	复杂强化学习策略搜索的分布式演化算法
其他题名	Distributed evolutionary algorithms for policy search of complex reinforcement learning
姓名	张友魁
姓名拼音	ZHANG Youkui
学号	11930585
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	史玉回
导师单位	计算机科学与工程系
论文答辩日期	2022-05-08
论文提交日期	2022-06-13
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	随着人工智能的快速发展，强化学习在序列决策问题上获得了越来越多的关注与应用。而对于许多现实世界中具有连续性、黑箱、稀疏奖励等特点的序列决策问题，其参数化的决策策略通常蕴含在一个高维的策略表征空间中，这些因素使得问题的策略学习变得更加困难。我们将这种具有连续性、黑盒、高维策略表征等特点的策略学习问题叫做复杂强化学习策略搜索问题。其中，直接策略搜索方法是将强化学习问题转化成一般的连续性黑箱优化问题进行求解。对于复杂强化学习策略搜索问题，本文将采用分布式演化算法来求解。本文的研究路线分为三个阶段，首先对现有的演化算法进行分布式研究，然后对大规模黑箱优化问题进行分布式演化算法研究，最后在此基础上对复杂强化学习策略搜索问题进行分布式演化算法的研究。具体来说，本文对现有演化算法利用分布式计算库 Ray 实现了一个分布式演化算法库。然后，针对于一般的高维黑盒优化问题，分析分布式演化算法的设计挑战（特别是通信成本），为了提升探索能力以及采样有效性，我们基于星形岛屿模型的分布式框架进行算法设计，提出了一个基于多种群的并行算法，并在 2000 维的带旋转矩阵的基准测试函数上进行了实验验证，初步证实了该并行算法在高维黑盒优化问题中的搜索潜能；最后针对于复杂强化学习策略搜索问题，我们使用分层的（解耦全局与局部探索）分布式演化算法框架，利用主从模型的分布式框架实现了一个分布式演化算法，并在强化学习算法测试库 Gym 上的六个连续控制任务上验证了算法的性能，实验验证了分布式算法探索能力以及时间效率的提升。
其他摘要	With the rapid development of artificial intelligence, reinforcement learning (RL) has gained more and more attention and applications in sequential decision-making. For many sequential decision problems with the characteristics of continuity, black-box and sparse rewards, the parameterized decision policies are usually contained in high-dimensional policy representation space. These factors make policy learning more difficult. We call these policy learning problems with the characteristics of continuity, black-box, and high-dimensional policy representation as complex reinforcement learning policy search problems. Direct policy search (DPS) method transforms RL problem into a general continuous black-box optimization problem. In this thesis, we use distributed evolutionary algorithm (DEA) to solve complex reinforcement learning policy search problems. The research route is divided into three stages: implement DEA with existing evolutionary algorithms; use DEA to solve large-scale black-box optimization problems; use DEA to solve complex reinforcement learning policy search problems. Specifically, in this thesis, we implement a DEA library based on distributed library Ray. Then, for general high-dimensional black-box optimization problems, we analyze the design challenges of DEA (especially communication costs). In order to improve the exploration ability and sampling effectiveness, we design algorithm based on star island distributed model and propose a parallel algorithm based on multi-populations and test it on 2000-dimensional benchmark functions. Experiment has preliminarily confirmed the search potential of the parallel algorithm in high-dimensional black-box optimization problems. At last, for continuous control tasks in complex reinforcement learning policy search problems, we use a hierarchical (decoupled global and local exploration) DEA framework and implement a DEA method using master-slave distributed model. We verify its performance by six continuous control tasks on the RL algorithm test library Gym. The experiments show that the distributed algorithm has great exploration ability and time efficiency.
关键词	演化算法分布式演化算法强化学习直接策略搜索
其他关键词	Evolutionary Algorithm Distributed Evolutionary Algorithm Reinforcement Learning Direct Policy Search
语种	中文
培养类别	独立培养
入学年份	2019
学位授予年份	2022-06
参考文献列表	[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489. [2] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning[A]. 2019. arXiv preprint arXiv: 1912.06680. [3] WIERSTRA D P. A study in direct policy search[D]. München: Technische Universität München, 2010. [4] SIGAUD O, STULP F. Policy search in continuous action domains: an overview[J]. Neural Networks, 2019, 113: 28-40. [5] HOLLAND J H. Genetic algorithms[J]. Scientific American, 1992, 267(1): 66-73. [6] BÄCK T, HOFFMEISTER F, SCHWEFEL H P. A survey of evolution strategies[C]//Proceedings of the Fourth International Conference on Genetic Algorithms. Citeseer, 1991. [7] GONG Y J, CHEN W N, ZHAN Z H, et al. Distributed evolutionary algorithms and their models: a survey of the state-of-the-art[J]. Applied Soft Computing, 2015, 34: 286-300. [8] DEISENROTH M P, NEUMANN G, PETERS J, et al. A survey on policy search for robotics [J]. Foundations and Trends in Robotics, 2013, 2(1-2): 388-403. [9] DEISENROTH M, RASMUSSEN C E. PILCO: a model-based and data-efficient approach to policy search[C]//Proceedings of the 28th International Conference on Machine Learning (ICML-11). Citeseer, 2011: 465-472. [10] PETERS J, MULLING K, ALTUN Y. Relative entropy policy search[C]//Twenty-Fourth AAAI Conference on Artificial Intelligence. 2010. [11] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3): 229-256. [12] RÜCKSTIESS T, SEHNKE F, SCHAUL T, et al. Exploring parameter space in reinforcement learning[J]. Paladyn, 2010, 1(1): 14-24. [13] SILVER D, LEVER G, HEESS N, et al. Deterministic policy gradient algorithms[C]//International Conference on Machine Learning. PMLR, 2014: 387-395. [14] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[A]. 2015. arXiv preprint arXiv: 1509.02971. [15] SUN Y, WIERSTRA D, SCHAUL T, et al. Efficient natural evolution strategies[C]//Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation. 2009: 539-546. [16] HWANGBO J, GEHRING C, SOMMER H, et al. ROCK∗—efficient black-box optimization for policy learning[C]//2014 IEEE-RAS International Conference on Humanoid Robots. IEEE, 2014: 535-540. [17] AMARI S I. Natural gradient works efficiently in learning[J]. Neural Computation, 1998, 10 (2): 251-276. [18] ROSENSTEIN M T, BARTO A G. Robot weightlifting by direct policy search[C]//International Joint Conference on Artificial Intelligence: volume 17. Citeseer, 2001: 839-846. [19] SALIMANS T, HO J, CHEN X, et al. Evolution strategies as a scalable alternative to reinforcement learning[A]. 2017. arXiv preprint arXiv: 1703.03864. [20] CONTI E, MADHAVAN V, PETROSKI SUCH F, et al. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents[J]. Advances in Neural Information Processing Systems, 2018, 31. [21] CHEN Z, ZHOU Y, HE X, et al. A restart-based rank-1 evolution strategy for reinforcement learning.[C]//International Joint Conferences on Artificial Intelligence (IJCAI). 2019: 2130-2136. [22] CHOROMANSKI K M, PACCHIANO A, PARKER-HOLDER J, et al. From complexity to simplicity: adaptive es-active subspaces for blackbox optimization[J]. Advances in Neural Information Processing Systems, 2019, 32. [23] SUCH F P, MADHAVAN V, CONTI E, et al. Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning[A]. 2017. arXiv preprint arXiv: 1712.06567. [24] PEREDO O, ORTIZ J M. Multiple-point geostatistical simulation based on genetic algorithms implemented in a shared-memory supercomputer[M]//Geostatistics Oslo 2012. Springer, 2012: 103-114. [25] SURAKHI O M, QATAWNEH M, AL OFEISHAT H A. A parallel genetic algorithm for maximum flow problem[J]. International Journal of Advanced Computer Science and Applications, 2017, 8(6): 159-164. [26] 吕奕清, 林锦贤. 基于MPI 的并行PSO 混合K 均值聚类算法[J]. 计算机应用, 2011, 31(2):428-431. [27] LI D, LI K, LIANG J, et al. A hybrid particle swarm optimization algorithm for load balancing of MDS on heterogeneous computing systems[J]. Neurocomputing, 2019, 330: 380-393. [28] 夏卫雷, 王立松. 基于MapReduce 的并行蚁群算法研究与实现[J]. 电子科技, 2013, 26(2): 146-149. [29] 王诏远, 王宏杰, 邢焕来, 等. 基于Spark 的蚁群优化算法[J]. 计算机应用, 2015, 35(10):2777-2780. [30] QIQI D, LIJUN S, YUHUI S. Spark clustering computing platform based parallel particle swarm optimizers for computationally expensive global optimization[C]//International Conference on Parallel Problem Solving from Nature. Springer, 2018: 424-435. [31] YANG C, LI H, REZGUI Y, et al. High throughput computing based distributed genetic algorithm for building energy consumption optimization[J]. Energy and Buildings, 2014, 76: 92-101. [32] SCHUMAN C D, DISNEY A, SINGH S P, et al. Parallel evolutionary optimization for neuromorphic network training[C]//2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC). IEEE, 2016: 36-46. [33] GONG Y, FUKUNAGA A. Distributed island-model genetic algorithms using heterogeneous parameter settings[C]//2011 IEEE Congress of Evolutionary Computation (CEC). IEEE, 2011: 820-827. [34] 赵凤强, 徐毅, 李广强. 基于岛屿群体模型的多目标演化算法研究[J]. 计算机科学, 2010,37(12): 190-192. [35] LIU Y, ZHAO R, ZHENG K, et al. A hybrid parallel genetic algorithm with dynamic migration strategy based on sunway many-core processor[C]//2017 IEEE 19th International Conference on High Performance Computing and Communications Workshops (HPCCWS). IEEE, 2017:9-15. [36] HEIDRICH-MEISNER V, IGEL C. Evolution strategies for direct policy search[C]//International Conference on Parallel Problem Solving from Nature. Springer, 2008: 428-437. [37] CHOROMANSKI K, ROWLAND M, SINDHWANI V, et al. Structured evolution with compact architectures for scalable policy optimization[C]//International Conference on Machine Learning. PMLR, 2018: 970-978. [38] TUNYASUVUNAKOOL S, MULDAL A, DORON Y, et al. Dm_control: software and tasks for continuous control[J]. Software Impacts, 2020, 6: 100022. [39] BEATTIE C, LEIBO J Z, TEPLYASHIN D, et al. Deepmind lab[A]. 2016. arXiv preprintarXiv: 1612.03801. [40] FORTUNATO M, TAN M, FAULKNER R, et al. Generalization of reinforcement learners with working and episodic memory[C]//Advances in Neural Information Processing Systems. 2019: 12448-12457. [41] LEIBO J Z, D’AUTUME C D M, ZORAN D, et al. Psychlab: a psychology laboratory for deep reinforcement learning agents[A]. 2018. arXiv preprint arXiv: 1801.08116. [42] KURACH K, RAICHUK A, STAŃCZYK P, et al. Google research football: a novel reinforcement learning environment[A]. 2019. arXiv preprint arXiv: 1907.11180. [43] YU T, QUILLEN D, HE Z, et al. Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning[C]//Conference on Robot Learning. PMLR, 2020: 1094-1100. [44] BAKER B, KANITSCHEIDER I, MARKOV T, et al. Emergent tool use from multi-agent autocurricula[A]. 2019. arXiv preprint arXiv: 1909.07528. [45] BROCKMAN G, CHEUNG V, PETTERSSON L, et al. OpenAI Gym[A]. 2016. arXiv preprint arXiv: 1606.01540. [46] NICHOL A, PFAU V, HESSE C, et al. Gotta learn fast: a new benchmark for generalization in RL[A]. 2018. arXiv preprint arXiv: 1804.03720. [47] WHITLEY D, STARKWEATHER T. Genitor II: a distributed genetic algorithm[J]. Journal of Experimental & Theoretical Artificial Intelligence, 1990, 2(3): 189-214. [48] ISHIMIZU T, TAGAWA K. A structured differential evolution for various network topologies [J]. Int. J. Comput. Commun, 2010, 4(1): 2-8. [49] TASOULIS D, PAVLIDIS N, PLAGIANAKOS V, et al. Parallel differential evolution[C]//Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753):volume 2. 2004: 2023-2029 Vol.2. [50] OSTERMEIER A, GAWELCZYK A, HANSEN N. A derandomized approach to selfadaptation of evolution strategies[J]. Evolutionary Computation, 1994, 2(4): 369-380. [51] LI Z, ZHANG Q. A simple yet efficient evolution strategy for large-scale black-box optimization[J]. IEEE Transactions on Evolutionary Computation, 2017, 22(5): 637-646. [52] HE X, ZHOU Y, CHEN Z, et al. Large-scale evolution strategy based on search direction adaptation[J]. IEEE Transactions on Cybernetics, 2019, 51(3): 1651-1665. [53] POLAND J, ZELL A. Main vector adaptation: a CMA variant with linear time and space complexity[C]//Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation. Citeseer, 2001: 1050-1055. [54] WHITLEY D, DOMINIC S, DAS R, et al. Genetic reinforcement learning for neurocontrol problems[J]. Machine Learning, 1993, 13(2): 259-284. [55] BROOKS S H. A discussion of random methods for seeking maxima[J]. Operations Research, 1958, 6(2): 244-251. [56] LIANG J J, QIN A K, SUGANTHAN P N, et al. Comprehensive learning particle swarm optimizer for global optimization of multimodal functions[J]. IEEE Transactions on Evolutionary Computation, 2006, 10(3): 281-295. [57] SHI Y. Brain storm optimization algorithm[C]//International Conference in Swarm Intelligence. Springer, 2011: 303-309. [58] SHI Y. Brain storm optimization algorithm in objective space[C]//2015 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2015: 1227-1234. [59] WIERSTRA D, SCHAUL T, PETERS J, et al. Fitness expectation maximization[C]//International Conference on Parallel Problem Solving from Nature. Springer, 2008: 337-346. [60] BEYER H G, SENDHOFF B. Simplify your covariance matrix adaptation evolution strategy [J]. IEEE Transactions on Evolutionary Computation, 2017, 21(5): 746-759. [61] SCHAUL T, GLASMACHERS T, SCHMIDHUBER J. High dimensions and heavy tails for natural evolution strategies[C]//Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation. 2011: 845-852. [62] GLASMACHERS T, SCHAUL T, YI S, et al. Exponential natural evolution strategies[C]// Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation. 2010: 393-400. [63] DUAN Y, CHEN X, HOUTHOOFT R, et al. Benchmarking deep reinforcement learning for continuous control[C]//International Conference on Machine Learning. PMLR, 2016: 1329-1338. [64] MEI Y, OMIDVAR M N, LI X, et al. A competitive divide-and-conquer algorithm for unconstrained large-scale black-box optimization[J]. ACM Transactions on Mathematical Software (TOMS), 2016, 42(2): 1-24. [65] OMIDVAR M N, YANG M, MEI Y, et al. DG2: a faster and more accurate differential grouping for large-scale black-box optimization[J]. IEEE Transactions on Evolutionary Computation, 2017, 21(6): 929-942. [66] YANG M, ZHOU A, LI C, et al. An efficient recursive differential grouping for large-scale continuous problems[J]. IEEE Transactions on Evolutionary Computation, 2020, 25(1): 159-171. [67] CORANA A, MARCHESI M, MARTINI C, et al. Minimizing multimodal functions of continuous variables with the simulated annealing algorithm[J]. ACM Transactions on Mathematical Software (TOMS), 1987, 13(3): 262-280. [68] BISCANI F, IZZO D. A parallel global multiobjective framework for optimization: pagmo[J]. Journal of Open Source Software, 2020, 5(53): 2338. [69] MANIA H, GUY A, RECHT B. Simple random search provides a competitive approach to reinforcement learning[A]. 2018. arXiv preprint arXiv: 1803.07055. [70] CHENG S, QIN Q, CHEN J, et al. Brain storm optimization algorithm: a review[J]. Artificial Intelligence Review, 2016, 46(4): 445-458. [71] HAMALAINEN P, TOIKKA J, BABADI A, et al. Visualizing movement control optimization landscapes[J]. IEEE Transactions on Visualization and Computer Graphics, 2020.
所在学位评定分委会	计算机科学与工程系
国内图书分类号	TP301.6
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/335731
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	张友魁. 复杂强化学习策略搜索的分布式演化算法[D]. 深圳. 南方科技大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
11930585-张友魁-计算机科学与工（6537KB）	--	--	限制开放	--	请求全文