南方科技大学知识苑(SUSTech KC): 多样性条件策略的训练方式研究

题名	多样性条件策略的训练方式研究
其他题名	EXPLORING TRAINING APPROACHES FOR DIVERSE CONDITIONAL POLICIES
姓名	吴培霖
姓名拼音	WU Peilin
学号	12132364
学位类型	硕士
学位专业	085410 人工智能
学科门类/专业学位类别	08 工学
导师	杨鹏
导师单位	计算机科学与工程系
论文答辩日期	2024-05-12
论文提交日期	2024-07-06
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	目前，深度强化学习依靠着神经网络的强大拟合能力和端对端训练特性，在游戏AI、自动驾驶和机器人控制等决策问题上取得了显著的成果。然而，经典强化学习算法通常只关心如何训练得到最优策略，缺乏对策略多样性的研究。多样性强化学习旨在训练具有多样性的策略种群，该技术能够为内容生成的应用场景提供创新性解决方案、增强系统应对环境扰动的鲁棒性、以及改善多智能体场景下的策略泛化性。近年来，策略多样性逐渐受到重视，目前已经有诸多关于多样性强化学习的研究。不过，现有的多数方法通过训练多套独立参数来表示多个策略，不同策略之间的知识无法共享，存在内存占用高、训练效率低下、无法泛化到新颖策略等问题。针对上述问题，本文研究了基于条件策略的多样性强化学习算法。该方法具有共享参数的优势，能够实现策略知识的复用和行为的泛化。基于现有的条件策略研究成果，本文将无监督强化学习和目标条件强化学习的技术优势应用于策略多样性的训练，提出了两种多样性条件策略算法：针对单智能体场景对多样解的需求，本文设计了奖励信号分配机制和鉴别器的噪声训练方式，促进状态多样性的条件策略生成，能够用于探索环境中潜在的不同解决方案；对于多智能体策略缺乏泛化性的问题，本文提出了奖励值多样性的条件策略训练方式，能够模拟现实应用中不同水平的对手或合作伙伴，并结合课程学习改进策略的训练效率。为了验证所提算法的有效性，本文首先在多模态迷宫导航、经典控制问题和Mujoco连续控制环境中展开了实验，验证了状态多样性的条件策略发现不同解决方案的能力。其次，本文在Overcooked厨房协作环境中训练奖励值多样性的条件策略，随后采样不同奖励值的策略与协作策略进行预演，结果表明协作策略的零试协作性能显著提高，说明奖励值多样性的条件策略能够有效改善多智能体策略的泛化性。
关键词	深度强化学习多样性强化学习多智能体强化学习目标条件策略零试协作
语种	中文
培养类别	独立培养
入学年份	2021
学位授予年份	2024-06
参考文献列表	[1] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533. [2] SHALEV-SHWARTZ S, SHAMMAH S, SHASHUA A. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving[J]. CoRR, 2016, abs/1610.03295. [3] SINGH B, KUMAR R, SINGH V P. Reinforcement learning in robotic applications: a comprehensive survey[J]. Artificial Intelligence Review, 2022, 55(2): 945-990. [4] LI J, MONROE W, RITTER A, et al. Deep Reinforcement Learning for Dialogue Generation [C]//Empirical Methods in Natural Language Processing. 2016: 1192-1202. [5] PEREIRA T, ABBASI M, RIBEIRO B, et al. Diversity oriented Deep Reinforcement Learning for targeted molecule generation[J/OL]. Journal of Cheminformatics, 2021, 13(1): 21. https: //doi.org/10.1186/s13321-021-00498-z. DOI: 10.1186/S13321-021-00498-Z. [6] MASOOD M, DOSHI-VELEZ F. Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies[C]//International Joint Conferences on Artificial Intelligence. 2019: 5923-5929. [7] CULLY A, CLUNE J, TARAPORE D, et al. Robots that can adapt like animals[J]. Nature, 2015, 521(7553): 503-507. [8] CHALUMEAU F, SURANA S, BONNET C, et al. Combinatorial Optimization with Policy Adaptation using Latent Space Search[C/OL]//Advances in Neural Information Processing Systems. 2023. http://papers.nips.cc/paper_files/paper/2023/hash/18d3a2f3068d6c669dcae19ceca 1bc24-Abstract-Conference.html. [9] KOH P W, SAGAWA S, MARKLUND H, et al. Wilds: A benchmark of in-the-wild distribution shifts[C]//International Conference on Machine Learning: volume 139. PMLR, 2021: 56375664. [10] FUJIMOTO T, SUETTERLEIN J, CHATTERJEE S, et al. Assessing the Impact of Distribution Shift on Reinforcement Learning Performance[J]. CoRR, 2024, abs/2402.03590. [11] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489. [12] SILVER D, HUBERT T, SCHRITTWIESER J, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play[J]. Science, 2018, 362(6419): 1140-1144. [13] CARROLL M, SHAH R, HO M K, et al. On the Utility of Learning about Humans for HumanAI Coordination[C]//Advances in Neural Information Processing Systems. 2019: 5175-5186. [14] HU H, LERER A, PEYSAKHOVICH A, et al. ”Other-Play” for Zero-Shot Coordination[C]// International Conference on Machine Learning: volume 119. PMLR, 2020: 4399-4410. [15] STROUSE D, MCKEE K, BOTVINICK M, et al. Collaborating with Humans without Human Data[C]//Advances in Neural Information Processing Systems. 2021: 14502-14515. [16] ZHAO R, SONG J, YUAN Y, et al. Maximum entropy population-based training for zeroshot human-ai coordination[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2023: 6145-6153. [17] SZOT A, JAIN U, BATRA D, et al. Adaptive Coordination in Social Embodied Rearrangement [C]//International Conference on Machine Learning: volume 202. PMLR, 2023: 33365-33380. [18] ZHOU Z, FU W, ZHANG B, et al. Continuously Discovering Novel Strategies via RewardSwitching Policy Optimization[C/OL]//International Conference on Learning Representations. OpenReview.net, 2022. https://openreview.net/forum?id=hcQHRHKfN_. [19] FU W, DU W, LI J, et al. Iteratively Learn Diverse Strategies with State Distance Information [C/OL]//Advances in Neural Information Processing Systems. 2023. http://papers.nips.cc/pap er_files/paper/2023/hash/473aadf077f8464dbae7e9600d9be6c4-Abstract-Conference.html. [20] XIE Z, LIN Z, LI J, et al. Pretraining in Deep Reinforcement Learning: A Survey[J]. CoRR, 2022, abs/2211.03959. [21] LIU M, ZHU M, ZHANG W. Goal-Conditioned Reinforcement Learning: Problems and Solutions[C]//International Joint Conferences on Artificial Intelligence. 2022: 5502-5511. [22] HUTSEBAUT-BUYSSE M, METS K, LATRÉ S. Pre-trained Word Embeddings for Goalconditional Transfer Learning in Reinforcement Learning[J]. CoRR, 2020, abs/2007.05196. [23] ANDRYCHOWICZ M, CROW D, RAY A, et al. Hindsight Experience Replay[C]//Advances in Neural Information Processing Systems. 2017: 5048-5058. [24] MANDERSON T, HIGUERA J C G, WAPNICK S, et al. Vision-Based Goal-Conditioned Policies for Underwater Navigation in the Presence of Obstacles[C/OL]//Robotics. 2020. https: //doi.org/10.15607/RSS.2020.XVI.048. [25] GHIGNONE E, BAUMANN N, BOSS M, et al. TC-Driver: Trajectory Conditioned Driving for Robust Autonomous Racing - A Reinforcement Learning Approach[J]. CoRR, 2022, abs/2205.09370. [26] DENG Y, REN Z, ZHANG A, et al. Towards Goal-oriented Intelligent Tutoring Systems in Online Education[J]. CoRR, 2023, abs/2312.10053. [27] WU Y, MACDONALD C, OUNIS I. Goal-Oriented Multi-Modal Interactive Recommendation with Verbal and Non-Verbal Relevance Feedback[C]//Proceedings of the 17th ACM Conference on Recommender Systems. 2023: 362-373. [28] DEREK K, ISOLA P. Adaptable agent populations via a generative model of policies[C]// Advances in Neural Information Processing Systems. 2021: 3902-3913. [29] LUPU A, CUI B, HU H, et al. Trajectory Diversity for Zero-Shot Coordination[C]//International Conference on Machine Learning: volume 139. PMLR, 2021: 7204-7213. [30] PARKER-HOLDER J, PACCHIANO A, CHOROMANSKI K M, et al. Effective diversity in population based reinforcement learning[C]//Advances in Neural Information Processing Systems. 2020: 18050-18062. [31] KUMAR S, KUMAR A, LEVINE S, et al. One solution is not all you need: Few-shot extrapolation via structured maxent rl[C]//Advances in Neural Information Processing Systems. 2020: 8198-8210. [32] CHALUMEAU F, BOIGE R, LIM B, et al. Neuroevolution is a Competitive Alternative to Reinforcement Learning for Skill Discovery[C/OL]//International Conference on Learning Representations. OpenReview.net, 2023. https://openreview.net/pdf?id=6BHlZgyPOZY. [33] CHEN W, HUANG S, CHIANG Y, et al. DGPO: discovering multiple strategies with diversityguided policy optimization[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024: 11390-11398. [34] CONTI E, MADHAVAN V, SUCH F P, et al. Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents[C]//Advances in Neural Information Processing Systems. 2018: 5032-5043. [35] WANG Y, XUE K, QIAN C. Evolutionary Diversity Optimization with Clustering-based Selection for Reinforcement Learning[C/OL]//International Conference on Learning Representations. OpenReview.net, 2022. https://openreview.net/forum?id=74x5BXs4bWD. [36] PIERROT T, MACÉ V, CHALUMEAU F, et al. Diversity policy gradient for sample efficient quality-diversity optimization[C]//Proceedings of the Genetic and Evolutionary Computation Conference. 2022: 1075-1083. [37] WU S, YAO J, FU H, et al. Quality-Similar Diversity via Population Based Reinforcement Learning[C/OL]//International Conference on Learning Representations. OpenReview.net, 2023. https://openreview.net/pdf?id=bLmSMXbqXr. [38] LIU Y, RAMACHANDRAN P, LIU Q, et al. Stein Variational Policy Gradient[C/OL]// Conference on Uncertainty in Artificial Intelligence. 2017. http://auai.org/uai2017/proceed ings/papers/239.pdf. [39] PARKER-HOLDER J, METZ L, RESNICK C, et al. Ridge rider: Finding diverse solutions by following eigenvectors of the hessian[C]//Advances in Neural Information Processing Systems. 2020: 753-765. [40] YANG P, YANG Q, TANG K, et al. Parallel exploration via negatively correlated search[J].Frontiers Comput. Sci., 2021, 15(5): 155333. [41] STREHL A L, LITTMAN M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of Computer and System Sciences, 2008, 74(8): 1309-1331. [42] BELLEMARE M G, SRINIVASAN S, OSTROVSKI G, et al. Unifying Count-Based Exploration and Intrinsic Motivation[C]//Advances in Neural Information Processing Systems. 2016: 1471-1479. [43] OSTROVSKI G, BELLEMARE M G, OORD A, et al. Count-based exploration with neural density models[C]//International Conference on Machine Learning: volume 70. PMLR, 2017: 2721-2730. [44] HAZAN E, KAKADE S, SINGH K, et al. Provably efficient maximum entropy exploration [C]//International Conference on Machine Learning: volume 97. PMLR, 2019: 2681-2691. [45] LEE L, EYSENBACH B, PARISOTTO E, et al. Efficient Exploration via State Marginal Matching[J]. CoRR, 2019, abs/1906.05274. [46] ZHANG C, CAI Y, HUANG L, et al. Exploration by maximizing Rényi entropy for reward-free RL framework[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 35. 2021: 10859-10867. [47] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//International Conference on Machine Learning: volume 70. PMLR, 2017: 27782787. [48] BURDA Y, EDWARDS H, STORKEY A J, et al. Exploration by random network distillation [C/OL]//International Conference on Learning Representations. OpenReview.net, 2019. https://openreview.net/forum?id=H1lJJnR5Ym. [49] PATHAK D, GANDHI D, GUPTA A. Self-supervised exploration via disagreement[C]// International Conference on Machine Learning: volume 97. PMLR, 2019: 5062-5071. [50] SEKAR R, RYBKIN O, DANIILIDIS K, et al. Planning to explore via self-supervised world models[C]//International Conference on machine Learning. 2020: 8583-8592. [51] GREGOR K, REZENDE D J, WIERSTRA D. Variational Intrinsic Control[C/OL]// International Conference on Learning Representations. OpenReview.net, 2017. https://openreview.net/forum?id=Skc-Fo4Yg. [52] EYSENBACH B, GUPTA A, IBARZ J, et al. Diversity is All You Need: Learning Skills without a Reward Function[C/OL]//International Conference on Learning Representations. OpenReview.net, 2019. https://openreview.net/forum?id=SJx63jRqFm. [53] SHARMA A, GU S, LEVINE S, et al. Dynamics-Aware Unsupervised Discovery of Skills [C/OL]//International Conference on Learning Representations. OpenReview.net, 2020. https://openreview.net/forum?id=HJgLZR4KvH. [54] ACHIAM J, EDWARDS H, AMODEI D, et al. Variational Option Discovery Algorithms[J].CoRR, 2018, abs/1807.10299. [55] WARDE-FARLEY D, DE WIELE T V, KULKARNI T D, et al. Unsupervised Control Through Non-Parametric Discriminative Rewards[C/OL]//International Conference on Learning Representations. OpenReview.net, 2019. https://openreview.net/forum?id=r1eVMnA9K7. [56] COLAS C, KARCH T, SIGAUD O, et al. Autotelic agents with intrinsically motivated goalconditioned reinforcement learning: a short survey[J]. Journal of Artificial Intelligence Research, 2022, 74: 1159-1199. [57] BARANES A, OUDEYER P Y. Active learning of inverse models with intrinsically motivated goal exploration in robots[J]. Robotics and Autonomous Systems, 2013, 61(1): 49-73. [58] NAIR A, PONG V, DALAL M, et al. Visual Reinforcement Learning with Imagined Goals[C]// Advances in Neural Information Processing Systems. 2018: 9209-9220. [59] PONG V, DALAL M, LIN S, et al. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning[C]//International Conference on Machine Learning: volume 119. PMLR, 2020: 77837792. [60] HARTIKAINEN K, GENG X, HAARNOJA T, et al. Dynamical Distance Learning for SemiSupervised and Unsupervised Skill Discovery[C/OL]//International Conference on Learning Representations. OpenReview.net, 2020. https://openreview.net/forum?id=H1lmhaVtvr. [61] EYSENBACH B, ZHANG T, LEVINE S, et al. Contrastive learning as goal-conditioned reinforcement learning[C]//Advances in Neural Information Processing Systems. 2022: 3560335620. [62] MCCARTHY R, REDMOND S J. Imaginary Hindsight Experience Replay: Curious Modelbased Learning for Sparse Reward Tasks[J]. CoRR, 2021, abs/2110.02414. [63] FANG M, ZHOU T, DU Y, et al. Curriculum-guided Hindsight Experience Replay[C]// Advances in Neural Information Processing Systems. 2019: 12602-12613. [64] TANG Y, KUCUKELBIR A. Hindsight expectation maximization for goal-conditioned reinforcement learning[C]//International Conference on Artificial Intelligence and Statistics. 2021: 2863-2871. [65] PITIS S, CHAN H, ZHAO S, et al. Maximum entropy gain exploration for long horizon multigoal reinforcement learning[C]//International Conference on Machine Learning: volume 119. PMLR, 2020: 7750-7761. [66] BAI C, LIU P, ZHAO W, et al. Guided goal generation for hindsight multi-goal reinforcement learning[J]. Neurocomputing, 2019, 359: 353-367. [67] FLORENSA C, HELD D, GENG X, et al. Automatic goal generation for reinforcement learning agents[C]//International Conference on Machine Learning: volume 80. PMLR, 2018: 15151528. [68] CAMPERO A, RAILEANU R, KÜTTLER H, et al. Learning with AMIGo: Adversarially Motivated Intrinsic Goals[C/OL]//International Conference on Learning Representations. OpenReview.net, 2021. https://openreview.net/forum?id=ETBc_MIMgoX. [69] ZHANG Y, ABBEEL P, PINTO L. Automatic curriculum learning through value disagreement [C]//Advances in Neural Information Processing Systems. 2020: 7648-7659. [70] REN Z, DONG K, ZHOU Y, et al. Exploration via Hindsight Goal Generation[C]//Advances in Neural Information Processing Systems. 2019: 13464-13474. [71] EYSENBACH B, SALAKHUTDINOV R, LEVINE S. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning[C]//Advances in Neural Information Processing Systems. 2019: 15220-15231. [72] TROTT A, ZHENG S, XIONG C, et al. Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards[C]//Advances in Neural Information Processing Systems. 2019: 10376-10386. [73] CHANE-SANE E, SCHMID C, LAPTEV I. Goal-conditioned reinforcement learning with imagined subgoals[C]//International Conference on Machine Learning: volume 139. PMLR, 2021: 1430-1440. [74] NAIR A, BAHL S, KHAZATSKY A, et al. Contextual imagined goals for self-supervised robotic learning[C]//Conference on Robot Learning. 2020: 530-539. [75] KHAZATSKY A, NAIR A, JING D, et al. What can i do here? learning new skills by imagining visual affordances[C]//International Conference on Robotics and Automation. 2021: 1429114297. [76] NAIR S, SAVARESE S, FINN C. Goal-aware prediction: Learning to model what matters[C]// International Conference on Machine Learning: volume 119. PMLR, 2020: 7207-7219. [77] CHAN H, WU Y, KIROS J, et al. ACTRCE: Augmenting Experience via Teacher’s Advice For Multi-Goal Reinforcement Learning[J]. CoRR, 2019, abs/1902.04546. [78] AKAKZIA A, COLAS C, OUDEYER P, et al. Grounding Language to AutonomouslyAcquired Skills via Goal Generation[C/OL]//International Conference on Learning Representations. OpenReview.net, 2021. https://openreview.net/forum?id=chPj_I5KMHG. [79] WANG W W, HAN D, LUO X, et al. Toward Open-ended Embodied Tasks Solving[J]. CoRR, 2023, abs/2312.05822. [80] ICHTER B, BROHAN A, CHEBOTAR Y, et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances[C]//Conference on Robot Learning. 2022: 287-318. [81] JIANG Y, GU S, MURPHY K, et al. Language as an Abstraction for Hierarchical Deep Reinforcement Learning[C]//Advances in Neural Information Processing Systems. 2019: 94149426. [82] KUMAR A, PENG X B, LEVINE S. Reward-Conditioned Policies[J]. CoRR, 2019, abs/1912.13465. [83] SRIVASTAVA R K, SHYAM P, MUTZ F, et al. Training Agents using Upside-Down Reinforcement Learning[J]. CoRR, 2019, abs/1912.02877. [84] CHEN L, LU K, RAJESWARAN A, et al. Decision transformer: Reinforcement learning via sequence modeling[C]//Advances in Neural Information Processing Systems. 2021: 1508415097. [85] YAN X, GUO J, LOU X, et al. An Efficient End-to-End Training Approach for Zero-Shot Human-AI Coordination[C/OL]//Advances in Neural Information Processing Systems. 2023. http://papers.nips.cc/paper_files/paper/2023/hash/07a363fd2263091c2063998e0034999c-Abs tract-Conference.html. [86] YU C, GAO J, LIU W, et al. Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased[C/OL]//International Conference on Learning Representations. OpenReview.net, 2023. https://openreview.net/pdf?id=TrwE8l9aJzs. [87] LOU X, GUO J, ZHANG J, et al. PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination[C]//International Conference on Autonomous Agents and Multiagent Systems. 2023: 679–688. [88] LONG Q, ZHOU Z, GUPTA A, et al. Evolutionary Population Curriculum for Scaling MultiAgent Reinforcement Learning[C/OL]//International Conference on Learning Representations. OpenReview.net, 2020. https://openreview.net/forum?id=SJxbHkrKDH. [89] XUE K, WANG Y, YUAN L, et al. Heterogeneous Multi-agent Zero-Shot Coordination by Coevolution[J]. CoRR, 2022, abs/2208.04957. [90] LI Y, ZHANG S, SUN J, et al. Cooperative Open-ended Learning Framework for Zero-Shot Coordination[C]//International Conference on Machine Learning: volume 202. PMLR, 2023: 20470-20484. [91] LI Y, ZHANG S, SUN J, et al. Tackling Cooperative Incompatibility for Zero-Shot Human-AI Coordination[J]. CoRR, 2023, abs/2306.03034. [92] HU H, LERER A, CUI B, et al. Off-belief learning[C]//International Conference on Machine Learning: volume 139. PMLR, 2021: 4369-4379. [93] CUI B, HU H, PINEDA L, et al. K-level reasoning for zero-shot coordination in hanabi[C]// Advances in Neural Information Processing Systems. 2021: 8215-8228. [94] CUI B, HU H, LUPU A, et al. Off-Team Learning[C]//Advances in Neural Information Processing Systems. 2022: 15407-15419. [95] RIBEIRO J G, MARTINHO C, SARDINHA A, et al. Assisting Unknown Teammates in Unknown Tasks: Ad Hoc Teamwork under Partial Observability[J]. CoRR, 2022, abs/2201.03538. [96] LI X, NI Z, RUAN J, et al. Mixture of personality improved spiking actor network for efficient multi-agent cooperation[J]. Frontiers in Neuroscience, 2023, 17: 1219405. [97] ZHANG A, BALLAS N, PINEAU J. A Dissection of Overfitting and Generalization in Continuous Reinforcement Learning[J]. CoRR, 2018, abs/1806.07937. [98] BURDA Y, EDWARDS H, PATHAK D, et al. Large-Scale Study of Curiosity-Driven Learning [C/OL]//International Conference on Learning Representations. OpenReview.net, 2019. https://openreview.net/forum?id=rJNwDjAqYX. [99] SCHAUL T, HORGAN D, GREGOR K, et al. Universal value function approximators[C]// International Conference on Machine Learning: volume 37. JMLR.org, 2015: 1312-1320. [100] SUTTON R S, MODAYIL J, DELP M, et al. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction[C]//International Conference on Autonomous Agents and Multiagent Systems. 2011: 761-768. [101] OSBAND I, VAN ROY B, WEN Z. Generalization and exploration via randomized value functions[C]//International Conference on Machine Learning: volume 48. JMLR.org, 2016: 2377-2386. [102] HIGGINS I, MATTHEY L, PAL A, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework[C/OL]//International Conference on Learning Representations. OpenReview.net, 2017. https://openreview.net/forum?id=Sy2fzU9gl. [103] BENGIO Y, LOURADOUR J, COLLOBERT R, et al. Curriculum learning[C]//International Conference on Machine Learning: volume 382. ACM, 2009: 41-48. [104] CRESWELL A, WHITE T, DUMOULIN V, et al. Generative adversarial networks: An overview[J]. IEEE Signal Processing Magazine, 2018, 35(1): 53-65. [105] KINGMA D P, WELLING M. Auto-Encoding Variational Bayes[C/OL]//International Conference on Learning Representations. 2014. http://org/abs/1312.6114. [106] VAN DEN OORD A, VINYALS O, KAVUKCUOGLU K. Neural Discrete Representation Learning[C]//Advances in Neural Information Processing Systems. 2017: 6306-6315. [107] CHANE-SANE E, SCHMID C, LAPTEV I. Learning video-conditioned policies for unseen manipulation tasks[C]//International Conference on Robotics and Automation. 2023: 909-916. [108] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All you Need[C]//Advances in Neural Information Processing Systems. 2017: 5998-6008. [109] YU C, VELU A, VINITSKY E, et al. The surprising effectiveness of ppo in cooperative multiagent games[C]//Advances in Neural Information Processing Systems. 2022: 24611-24624. [110] CAMERER C F, HO T H, CHONG J K. A cognitive hierarchy model of games[J]. The Quarterly Journal of Economics, 2004, 119(3): 861-898. [111] VAN OTTERLO M, WIERING M. Reinforcement learning and markov decision processes [M]//Reinforcement Learning. 2012: 3-42. [112] SUTTON R S, MCALLESTER D A, SINGH S, et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation[C]//Advances in Neural Information Processing Systems. 1999: 1057-1063. [113] 张伟楠, 沈键, 俞勇. 动手学强化学习[M]. 北京: 人民邮电出版社, 2022. [114] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8: 229-256. [115] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization[C]// International Conference on Machine Learning: volume 37. JMLR.org, 2015: 1889-1897. [116] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms [J]. CoRR, 2017, abs/1707.06347. [117] VAN HASSELT H, GUEZ A, SILVER D. Deep reinforcement learning with double q-learning [C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 30. 2016. [118] BARBER D, AGAKOV F. The im algorithm: a variational approach to information maximization[C]//Advances in Neural Information Processing Systems. 2004: 201. [119] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//International Conference on Machine Learning: volume 80. PMLR, 2018: 1861-1870. [120] ZHANG Y, YU W, TURK G. Learning novel policies for tasks[C]//International Conference on Machine Learning: volume 97. PMLR, 2019: 7483-7492. [121] HAARNOJA T, TANG H, ABBEEL P, et al. Reinforcement learning with deep energy-based policies[C]//International Conference on Machine Learning: volume 70. PMLR, 2017: 13521361. [122] TODOROV E, EREZ T, TASSA Y. MuJoCo: A physics engine for model-based control[C]// International Conference on Intelligent Robots and Systems. 2012: 5026-5033. [123] RAFFIN A, HILL A, GLEAVE A, et al. Stable-Baselines3: Reliable Reinforcement Learning Implementations[J]. Journal of Machine Learning Research, 2021, 22(268): 1-8. [124] AGARAP A F. Deep Learning using Rectified Linear Units (ReLU)[J]. CoRR, 2018, abs/1803.08375. [125] HE J Z Y, ERICKSON Z, BROWN D S, et al. Learning Representations that Enable Generalization in Assistive Tasks[C]//Conference on Robot Learning. 2023: 2105-2114. [126] TOGHI B, VALIENTE R, SADIGH D, et al. Social coordination and altruism in autonomous driving[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(12): 2479124804. [127] JANNER M, LI Q, LEVINE S. Offline reinforcement learning as one big sequence modeling problem[C]//Advances in Neural Information Processing Systems: volume 34. 2021: 12731286. [128] XU B, WANG N, CHEN T, et al. Empirical Evaluation of Rectified Activations in Convolutional Network[J]. CoRR, 2015, abs/1505.00853. [129] SCHULMAN J, MORITZ P, LEVINE S, et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation[C/OL]//International Conference on Learning Representations. 2016. http://org/abs/1506.02438. [130] CHEN Y C. A tutorial on kernel density estimation and recent advances[J]. Biostatistics & Epidemiology, 2017, 1(1): 161-187.
所在学位评定分委会	电子科学与技术
国内图书分类号	TP181
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/779036
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	吴培霖. 多样性条件策略的训练方式研究[D]. 深圳. 南方科技大学,2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132364-吴培霖-计算机科学与工（3863KB）	--	--	限制开放	--	请求全文