南方科技大学知识苑(SUSTech KC): 基于模型的强化学习预训练及在线部署时的探索问题研究

题名	基于模型的强化学习预训练及在线部署时的探索问题研究
其他题名	EXPLORATION PROBLEM FOR THE PRE-TRAINING AND FINE-TUNING OF MODEL-BASED REINFORCEMENT LEARNING
姓名	周国晨
姓名拼音	ZHOU Guochen
学号	12132378
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	史玉回
导师单位	计算机科学与工程系
论文答辩日期	2024-05
论文提交日期	2024-07-01
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	本文的研究主要是针对基于模型的预训练强化学习在线部署时的探索问题，即保障在线部署算法针对离线与在线环境存在差异时的适应能力。传统的强化学习算法往往需要大量的与环境交互来学习策略，而在现实环境中，与环境交互往往存在较大代价。因此，为了将强化学习应用到实际问题上，需要提高强化学习算法的采样有效性。基于模型的强化学习方法，通过构建环境模型减少智能体与真实环境的交互，从而提高采样有效性。基于预训练的强化学习通过利用预先收集的数据，直接对策略进行学习，减少智能体与环境的交互。本文设计的方法为了提高算法的采样有效性并且减少策略与环境的交互，采取了上述的两种思想，设计出一种通过较少的在线交互使得预训练策略得以快速适应下游任务的基于模型的强化学习方法。本文的方法由离线预训练和在线部署两个部分组成。在离线预训练阶段，本文提出了一种多目标强化学习方法，通过设置环境返回值和模型保守性这两个学习目标，获取了一组在这两个目标上的帕累托最优策略。在离线阶段的实验中，本文将算法在离线强化学习算法比较时常用的 D4RL 离线数据集上与其它离线强化学习算法做对比试验，验证了离线阶段算法的有效性。在在线部署阶段，本文引入多臂老虎机的思想，提出了一种基于模型的层级强化学习算法，指导离线阶段获得的算法在在线阶段的策略选取和训练，从而解决在线部署时的探索问题。在在线阶段的模拟实验中，本文将在线环境的实验分为离线与在线环境相一致，和离线与在线环境存在差异这两种环境。在一致的环境中本文采取了与离线阶段一致的 D4RL 对应的环境；在差异环境中，本文对离线环境的一些参数做出了改动，从而验证本文算法的探索能力。通过仿真实验，本文验证了设计的算法在在线阶段针对不同环境上的有效性。本文提出了一种基于模型的离线到在线的强化学习方法。通过在离线阶段获取对模型返回值和模型不确定度这两个目标具有不同权衡的策略，并且在在线部署阶段利用基于多臂老虎机的层级强化学习对这些策略进行选取优化的方法，可以对差异程度不同的环境快速适应，有效的减缓在线部署的初始性能下降的问题，改善在线部署时策略性能改进缓慢的问题，从而解决了离线强化学习算法在线部署时的探索问题。
关键词	强化学习离线预训练在线部署探索问题多臂老虎机
语种	中文
培养类别	独立培养
入学年份	2021
学位授予年份	2024-07
参考文献列表	[1] MARCH J G. Exploration and exploitation in organizational learning[J]. Organization science, 1991, 2(1): 71-87. [2] SUTTON R S, BARTO A G. Reinforcement learning: An introduction[M]. MIT press, 2018. [3] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. nature, 2016, 529(7587): 484-489. [4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354. [5] JIN C, LIU Q, MIRYOOSEFI S. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms[J]. Advances in neural information processing systems, 2021, 34: 13406-13418. [6] BERNARDIN K, STIEFELHAGEN R. Evaluating multiple object tracking performance: the clear mot metrics[J]. EURASIP Journal on Image and Video Processing, 2008, 2008: 1-10. [7] KAPTUROWSKI S, CAMPOS V, JIANG R, et al. Human-level Atari 200x faster[A]. 2022. [8] YU F, XIAN W, CHEN Y, et al. Bdd100k: A diverse driving video database with scalable annotation tooling: volume 2[A]. 2018: 6. [9] GOTTESMAN O, JOHANSSON F, KOMOROWSKI M, et al. Guidelines for reinforcement learning in healthcare[J]. Nature medicine, 2019, 25(1): 16-18. [10] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607. [11] GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent-a new approach to self supervised learning[J]. Advances in neural information processing systems, 2020, 33: 21271-21284. [12] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901. [13] TAJBAKHSH N, SHIN J Y, GURUDU S R, et al. Convolutional neural networks for medical image analysis: Full training or fine tuning?[J]. IEEE transactions on medical imaging, 2016, 35(5): 1299-1312. [14] O’NEILL M. Riccardo Poli, William B. Langdon, Nicholas F. McPhee: A Field Guide to Genetic Programming: Lulu. com, 2008, 250 pp, ISBN 978-1-4092-0073-4[M]. Springer, 2009. [15] ABBEEL P, QUIGLEY M, NG A Y. Using inaccurate models in reinforcement learning[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 1-8. [16] BURNS K, YU T, FINN C, et al. Offline reinforcement learning at multiple frequencies[C]//Conference on Robot Learning. PMLR, 2023: 2041-2051. [17] LEE S, LEE D Y, IM S, et al. Clinical decision transformer: Intended treatment recommendation through goal prompting[A]. 2023. [18] ZHAO K, ZOU L, ZHAO X, et al. User Retention-oriented Recommendation with Decision Transformer[C]//Proceedings of the ACM Web Conference 2023. 2023: 1141-1149. [19] FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep reinforcement learning without exploration[C]//International conference on machine learning. PMLR, 2019: 2052-2062. [20] KUMAR A, FU J, SOH M, et al. Stabilizing off-policy q-learning via bootstrapping error reduction[J]. Advances in Neural Information Processing Systems, 2019, 32. [21] JAQUES N, GHANDEHARIOUN A, SHEN J H, et al. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog[A]. 2019. [22] WU Y, TUCKER G, NACHUM O. Behavior regularized offline reinforcement learning[A]. 2019. [23] PENG X B, KUMAR A, ZHANG G, et al. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning[A]. 2019. [24] SIEGEL N Y, SPRINGENBERG J T, BERKENKAMP F, et al. Keep doing what worked:Behavioral modelling priors for offline reinforcement learning[A]. 2020. [25] JANNER M, FU J, ZHANG M, et al. When to trust your model: Model-based policy optimization[J]. Advances in neural information processing systems, 2019, 32. [26] LU C, BALL P J, PARKER-HOLDER J, et al. Revisiting design choices in model-based offline reinforcement learning[A]. 2021. [27] LEE B J, LEE J, KIM K E. Representation balancing offline model-based reinforcement learning[C]//9th International Conference on Learning Representations, ICLR 2021. 2021. [28] HISHINUMA T, SENDA K. Weighted model estimation for offline model-based reinforcement learning[J]. Advances in neural information processing systems, 2021, 34: 17789-17800. [29] ARGENSON A, DULAC-ARNOLD G. Model-based offline planning[A]. 2020. [30] MATSUSHIMA T, FURUTA H, MATSUO Y, et al. Deployment-efficient reinforcement learning via model-based offline optimization[A]. 2020. [31] YU T, KUMAR A, RAFAILOV R, et al. Combo: Conservative offline model-based policy optimization[J]. Advances in neural information processing systems, 2021, 34: 28954-28967. [32] KIDAMBI R, RAJESWARAN A, NETRAPALLI P, et al. Morel: Model-based offline reinforcement learning[J]. Advances in neural information processing systems, 2020, 33: 21810-21823. [33] YU T, THOMAS G, YU L, et al. Mopo: Model-based offline policy optimization[J]. Advances in Neural Information Processing Systems, 2020, 33: 14129-14142. [34] LEVINE S, KUMAR A, TUCKER G, et al. Offline reinforcement learning: Tutorial, review, and perspectives on open problems[A]. 2020. [35] KUMAR A, ZHOU A, TUCKER G, et al. Conservative q-learning for offline reinforcement learning[J]. Advances in Neural Information Processing Systems, 2020, 33: 1179-1191. [36] FUJIMOTO S, GU S S. A minimalist approach to offline reinforcement learning[J]. Advances in neural information processing systems, 2021, 34: 20132-20145. [37] ASHVIN N, ABHISHEK G, MURTAZA D, et al. AWAC: Accelerating Online Reinforcement Learning with Offline Datasets[A]. 2020. [38] UCHENDU I, XIAO T, LU Y, et al. Jump-start reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2023: 34556-34583. [39] LEE S, SEO Y, LEE K, et al. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble[C]//Conference on Robot Learning. PMLR, 2022: 1702-1712. [40] REZAEIFAR S, DADASHI R, VIEILLARD N, et al. Offline reinforcement learning as antiexploration[C]//Proceedings of the AAAI Conference on Artificial Intelligence: volume 36. 2022: 8106-8114. [41] MARK M S, GHADIRZADEH A, CHEN X, et al. Fine-tuning offline policies with optimistic action selection[C]//Deep Reinforcement Learning Workshop NeurIPS 2022. 2022. [42] ZHENG H, LUO X, WEI P, et al. Adaptive policy learning for offline-to-online reinforcement learning[A]. 2023. [43] ZHANG H, XU W, YU H. Policy Expansion for Bridging Offline-to-Online Reinforcement Learning[A]. 2023. [44] MAO Y, WANG C, WANG B, et al. MOORe: Model-based Offline-to-Online Reinforcement Learning[A]. 2022. [45] RAFAILOV R, HATCH K B, KOLEV V, et al. MOTO: Offline to Online Fine-tuning forModel-Based Reinforcement Learning[C]//Workshop on Reincarnating Reinforcement Learning at ICLR 2023. 2023. [46] GUO S, SUN Y, HU J, et al. A Simple Unified Uncertainty-Guided Framework for Offline-to Online Reinforcement Learning[A]. 2023. [47] LATTIMORE T, SZEPESVÁRI C. Bandit algorithms[M]. Cambridge University Press, 2020. [48] GARIVIER A, MOULINES E. On upper-confidence bound policies for switching bandit problems[C]//International Conference on Algorithmic Learning Theory. Springer, 2011: 174-188. [49] KAUFMANN E, CAPPÉ O, GARIVIER A. On Bayesian upper confidence bounds for bandit problems[C]//Artificial intelligence and statistics. PMLR, 2012: 592-600. [50] BROCKMAN G, CHEUNG V, PETTERSSON L, et al. Openai gym[A]. 2016. [51] NG A Y, HARADA D, RUSSELL S. Policy invariance under reward transformations: Theory and application to reward shaping[C]//Icml: volume 99. Citeseer, 1999: 278-287. [52] STREHL A L, LI L, WIEWIORA E, et al. PAC model-free reinforcement learning[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 881-888. [53] RAMESH N, MOURYA S. Reinforcement Learning using the CarRacing-v0 environment from OpenAI Gym[Z]. 2020. [54] LANGE S, GABEL T, RIEDMILLER M. Batch reinforcement learning[M]//Reinforcement learning: State-of-the-art. Springer, 2012: 45-73. [55] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. nature, 2015, 521(7553): 436-444. [56] PERTSCH K, LEE Y, LIM J. Accelerating reinforcement learning with learned skill priors[C]// Conference on robot learning. PMLR, 2021: 188-204. [57] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//International conference on machine learning. PMLR, 2017: 2778-2787. [58] DEB K, PRATAP A, AGARWAL S, et al. A fast and elitist multiobjective genetic algorithm: NSGA-II[J]. IEEE transactions on evolutionary computation, 2002: 182-197. [59] AERTA K V M, NOWE A. Multi-objective reinforcement learning using sets of pareto dominating policies[J]. The Journal of Machine Learning Research, 2014: 3843-3512. [60] YANG R, SUN X, NARASIMHAN K. A generalized algorithm for multi-objective reinforcement learning and policy adaptation[J]. Advances in neural information processing systems, 2019: 3843-3512. [61] FU J, KUMAR A, NACHUM O, et al. D4RL: Datasets for Deep Data-Driven Reinforcement Learning[A]. 2020. arXiv: 2004.07219. [62] GOTTESMAN O, JOHANSSON F, KOMOROWSKI M, et al. Guidelines for reinforcement learning in healthcare[J]. Nature medicine, 2019, 25(1): 16-18. [63] SCHULMAN J, MORITZ P, LEVINE S, et al. High-dimensional continuous control using generalized advantage estimation[J]. International Conference on Learning Representations, 2016. [64] JANNER M, MORDATCH I, LEVINE S. Generative temporal difference learning for infinitehorizon prediction[A]. 2020. [65] ROIJERS D M, VAMPLEW P, WHITESON S, et al. A survey of multi-objective sequential decision-making[J]. Journal of Artificial Intelligence Research, 2013: 67-113. [66] RAFAILOV R, YU T, RAJESWARAN A, et al. Offline reinforcement learning from images with latent space models[C]//Learning for Dynamics and Control. PMLR, 2021: 1154-1168. [67] YANG Y, JIANG J, ZHOU T, et al. Pareto policy pool for model-based offline reinforcement learning[J]. International Conference on Learning Representations, 2021. [68] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//International conference on machine learning. PMLR, 2018: 1861-1870. [69] XU J, TIAN Y, MA P, et al. Prediction-guided multi-objective reinforcement learning for continuous robot control[C]//International conference on machine learning. PMLR, 2020: 10607-10616. [70] MA P, DU T, MATUSIK W. Efficient continuous pareto exploration in multi-task learning[C]//International Conference on Machine Learning. PMLR, 2020: 6522-6531. [71] DÉSIDÉRI J A. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization [J]. Comptes Rendus Mathematique, 2012, 350(5-6): 313-318. [72] SENER O, KOLTUN V. Multi-task learning as multi-objective optimization[J]. Advances in neural information processing systems, 2018. [73] TUOMAS H, AURICK Z, KRISTIAN H, et al. Soft Actor-Critic Algorithms and Applications [A]. 2019. arXiv: 1812.05905. [74] JAGGI M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization[C]//International conference on machine learning. PMLR, 2013: 427-435. [75] TODOROV E, EREZ T, TASSA Y. Mujoco: A physics engine for model-based control[C]//2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012: 5026-5033. [76] AURéLIEN G, OLIVIER C. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond[J]. Annual Conference on Learning Theory, 2011: 359-376. [77] KOSTRIKOV I, FERGUS R, TOMPSON J, et al. Offline reinforcement learning with fisher divergence critic regularization[C]//International Conference on Machine Learning. PMLR, 2021: 5774-5783. [78] VERMOREL J, MOHRI M. Multi-armed bandit algorithms and empirical evaluation[C]//European conference on machine learning. Springer, 2005: 437-448. [79] SHERMAN J. Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix[J]. Annals of mathematical statistics, 1949, 20(4): 621. [80] VIDMANTAS B. On Hoeffding’s Inequalities[J]. The Annals of Probability, 2004: 1650-1673. [81] NAKAMOTO M, ZHAI S, SINGH A, et al. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning[J]. Advances in Neural Information Processing Systems, 2024, 36. [82] KOSTRIKOV I, NAIR A, LEVINE S. Offline Reinforcement Learning with Implicit Q-Learning[C]//International Conference on Learning Representations. 2022. [83] AVIRAL K, AURICK Z, GEORGE T, et al. Conservative q-learning for offline reinforcement learning[J]. Advances in neural information processing systems, 2020, 34. [84] SHENZHI W, QISEN Y, JIAWEI G, et al. Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning[J]. Advances in neural information processing systems, 2023, 37. [85] TARASOV D, NIKULIN A, AKIMOV D, et al. CORL: Research-oriented Deep Offline Reinforcement Learning Library[C/OL]//3rd Offline RL Workshop: Offline RL as a ”Launchpad. 2022. https://openreview.net/forum?id=SyAS49bBcv.
所在学位评定分委会	电子科学与技术
国内图书分类号	TP301.6
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/778718
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	周国晨. 基于模型的强化学习预训练及在线部署时的探索问题研究[D]. 深圳. 南方科技大学,2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132378-周国晨-计算机科学与工（3217KB）	--	--	限制开放	--	请求全文