南方科技大学知识苑(SUSTech KC): 面向动态物料运输调度的约束强化学习研究

题名	面向动态物料运输调度的约束强化学习研究
其他题名	DYNAMIC MATERIAL HANDLING VIA CONSTRAINED REINFORCEMENT LEARNING
姓名	胡呈鹏
姓名拼音	HU Chengpeng
学号	12132333
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	刘佳琳
导师单位	计算机科学与工程系
论文答辩日期	2024-05-12
论文提交日期	2024-06-25
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	近年来，人工智能技术在智慧物流领域中展现出了巨大的潜力。在柔性仓储和车间中建立基于人工智能的调度系统、利用自动引导车辆实现物料运输，不仅可以有效地降低人力成本，还可以提高作业流程的效率。然而，由于现实环境的复杂性，调度系统常常需要应对各类动态事件，例如新任务的出现和运输车辆突发损毁，同时还需考虑运输任务的时间约束等多重因素。传统的派送规则难以高效地应对这些动态的物流场景。因此，如何面对这种复杂多变的物流场景构建一个既高效又安全、具备适应性和鲁棒性的调度系统，已成为当前该研究领域面临的一个重要挑战。针对上述难点和挑战，本文面向动态物料运输调度问题展开了基于约束强化学习的算法研究。本文提出的基于约束强化学习的动态调度算法能够在多种动态物料运输场景中表现优异，实现了高效、安全、适应性强且鲁棒的调度方案，为智慧物流的研究发展提供了重要思路。本文的创新点主要有以下三点：（1）本文对包括了多种动态事件（如新任务和车辆损毁）和混合约束（如任务延迟和车辆可用性）的动态物料运输调度问题进行了数学建模，针对现有仿真环境的局限性开发了一个开源的可拓展仿真环境DMH-GYM，并提供了多样的问题样例集，为后续研究提供方便：（2）在上述数学建模基础上，本文将动态物料运输调度问题构建为约束马尔科夫决策过程，并定义其状态空间、动作空间、奖励函数和代价函数，提出了一种基于混合约束强化学习的动态调度算法（Reward ConstrainedPolicy Optimisation with Masking），用于处理任务延迟和车辆可用性的混合约束；（3）本文分析了动态物料运输调度问题中的稀疏反馈和样例不确定性问题，提出了一种基于自适应约束演化强化学习的动态调度算法（AdaptiveConstrainedEvolutionary Reinforcement Learning）。该算法通过基于序数的内部排序方法处理约束，通过基于种群的梯度搜索和自适应样例选择策略训练，与多种现有强化学习算法的比较实验、消融实验、抗噪实验和交叉验证的结果表明该算法能实现具有安全性、适应性和鲁棒性的高效调度决策。本文的研究不仅能提高解决动态物料运输调度问题的有效性，也能为类似的调度问题提供解决方案。
关键词	强化学习演化计算动态调度智慧物流动态物料运输
语种	中文
培养类别	独立培养
入学年份	2021
学位授予年份	2024-06
参考文献列表	[1] 赖李媛君, 张霖, 任磊, 等. 工业互联网智能调度建模与方法研究综述[J]. 计算机集成制造系统, 2022, 28(1966-1980). [2] 胡戎. 基于大数据技术在传统物流产业中的智慧化物流转型路径分析[J]. 数字通信世界, 2023(163-165). [3] 吴忠胜. 人工智能技术在智慧物流发展中的应用[J]. 中国航务周刊, 2023(75-77). [4] HU H, JIA X L, HE Q X, et al. Deep reinforcement learning based AGVs real-time scheduling with mixed rule for flexible shop floor in industry 4.0[J]. Computers & Industrial Engineering, 2020, 149: 106749. [5] 王凌, 王晶晶, 吴楚格. 绿色车间调度优化研究进展[J]. 控制与决策, 2018, 33(385-391). [6] 梁承姬, 张石东, 王钰, 等. 基于改进 DQN 算法的自动化码头 AGV 调度问题研究[J/OL].系统仿真学报, 2024: 1-11. DOI: 10.16182/j.issn1004731x.joss.23-0912. [7] OUELHADJ D, PETROVIC S. A survey of dynamic scheduling in manufacturing systems[J]. Journal of Scheduling, 2009, 12(4): 417-431. [8] SABUNCUOGLU I. A study of scheduling rules of flexible manufacturing systems: A simu- lation approach[J]. International Journal of Production Research, 1998, 36(2): 527-546. [9] CHRYSSOLOURIS G, SUBRAMANIAM V. Dynamic scheduling of manufacturing job shops using genetic algorithms[J]. Journal of Intelligent Manufacturing, 2001, 12(3): 281-293. [10] TAY J C, HO N B. Evolving dispatching rules using genetic programming for solving multi- objective flexible job-shop problems[J]. Computers & Industrial Engineering, 2008, 54(3): 453-473. [11] ZHANG Y F, ZHANG G, DU W, et al. An optimization method for shopfloor material handling based on real-time and multi-source manufacturing data[J]. International Journal of Production Economics, 2015, 165: 282-292. [12] SUTTON R S, BARTO A G. Reinforcement Learning: An Introduction[M]. MIT press, 2018. [13] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489. [14] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//International Conference on Machine Learning. PMLR, 2018: 1861-1870. [15] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforce- ment learning[J]. Nature, 2015, 518(7540): 529-533. [16] LI M J P, SANKARAN P, KUHL M E, et al. Simulation analysis of a deep reinforcement learning approach for task selection by autonomous material handling vehicles[C]//2018 Winter Simulation Conference (WSC). IEEE, 2018: 1073-1083. [17] JEONG Y, AGRAWAL T K, FLORES-GARCÍA E, et al. A reinforcement learning model for material handling task assignment and route planning in dynamic production logistics environ- ment[J]. Procedia CIRP, 2021, 104: 1807-1812. [18] 王凌, 潘子肖. 基于深度强化学习与迭代贪婪的流水车间调度优化[J/OL]. 控制与决策, 2021, 36(2609-2617). DOI: 10.13195/j.kzyjc.2020.0608. [19] GARCIA J, FERNÁNDEZ F. Safe exploration of state and action spaces in reinforcement learning[J]. Journal of Artificial Intelligence Research, 2012, 45: 515-564. [20] ALTMAN E. Constrained Markov Decision Processes: Stochastic Modeling[M]. Routledge, 1999. [21] SRINIVASAN K, EYSENBACH B, HA S, et al. Learning to be safe: Deep RL with a safety critic[A]. 2020. arXiv preprint arXiv:2010.14603. [22] LIU Y S, HALEV A, LIU X. Policy learning with constraints in model-free reinforcement learn- ing: A survey[C]//The 30th International Joint Conference on Artificial Intelligence (IJCAI). 2021: 1-8. [23] ZOU W Q, PAN Q K, WANG L. An effective multi-objective evolutionary algorithm for solving the AGV scheduling problem with pickup and delivery[J]. Knowledge-Based Systems, 2021, 218: 106881. [24] PEI J, HU C, LIU J, et al. Bi-Objective Splitting Delivery VRP with Loading Constraints and Restricted Access[C]//2021 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2021: 01-09. [25] 黄戈文, 蔡延光, 戚远航, 等. 自适应遗传灰狼优化算法求解带容量约束的车辆路径问题[J]. 电子学报, 2019, 47(2602-2610). [26] YAN Y, CHOW A H, HO C P, et al. Reinforcement learning for logistics and supply chain man- agement: Methodologies, state of the art, and future opportunities[J]. Transportation Research Part E: Logistics and Transportation Review, 2022, 162: 102712. [27] SINGH N, DANG Q V, AKCAY A, et al. A matheuristic for AGV scheduling with battery constraints[J]. European Journal of Operational Research, 2022, 298(3): 855-873. [28] KAPLANOĞLU V, ŞAHIN C, BAYKASOĞLU A, et al. A multi-agent based approach to dynamic scheduling of machines and automated guided vehicles (AGV) in manufacturing sys- tems by considering AGV breakdowns[J]. International Journal of Engineering Research & Innovation, 2015, 7(2): 32-38. [29] BLACKSTONE J H, PHILLIPS D T, HOGG G L. A state-of-the-art survey of dispatching rules for manufacturing job shop operations[J]. The International Journal of Production Research, 1982, 20(1): 27-45. [30] JEONG B H, RANDHAWA S U. A multi-attribute dispatching rule for automated guided ve- hicle systems[J]. International Journal of Production Research, 2001, 39(13): 2817-2832. [31] MIN H S, YIH Y. Selection of dispatching rules on multiple dispatching decision points in real-time scheduling of a semiconductor wafer fabrication system[J]. International Journal of Production Research, 2003, 41(16): 3921-3941. [32] CHIANG D M, GUO R S, PAI F Y. Improved customer satisfaction with a hybrid dispatching rule in semiconductor back-end factories[J]. International Journal of Production Research, 2008, 46(17): 4903-4923. [33] CHEN C, XI L F, ZHOU B H, et al. A multiple-criteria real-time scheduling approach for multiple-load carriers subject to LIFO loading constraints[J]. International Journal of Produc- tion Research, 2011, 49(16): 4787-4806. [34] SAHIN C, DEMIRTAS M, EROL R, et al. A multi-agent based approach to dynamic schedul- ing with flexible processing capabilities[J]. Journal of Intelligent Manufacturing, 2017, 28(8): 1827-1845. [35] LIU S D, TAN P H, KURNIAWAN E, et al. Dynamic scheduling for pickup and delivery with time windows[C]//2018 IEEE 4th World Forum on Internet of Things. IEEE, 2018: 767-770. [36] WANG W B, ZHANG Y F, ZHONG R Y. A proactive material handling method for CPS enabled shop-floor[J]. Robotics and Computer-integrated Manufacturing, 2020, 61: 101849. [37] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354. [38] CHEN C, XIA B, ZHOU B H, et al. A reinforcement learning based approach for a multiple-load carrier scheduling problem[J]. Journal of Intelligent Manufacturing, 2015, 26(6): 1233-1245. [39] XUE T, ZENG P, YU H. A reinforcement learning method for multi-AGV scheduling in man- ufacturing[C]//2018 IEEE International Conference on Industrial Technology (ICIT). IEEE, 2018: 1557-1561. [40] KARDOS C, LAFLAMME C, GALLINA V, et al. Dynamic scheduling in a job-shop produc- tion system with reinforcement learning[J]. Procedia CIRP, 2021, 97: 104-109. [41] STEPHEN A, ZHOU C H, CHEW E P, et al. Q-learning based automated guided vehicles recharging scheduling in container terminal[C]//IIE Annual Conference. Proceedings. Institute of Industrial and Systems Engineers (IISE), 2020: 215-220. [42] GOVINDAIAH S, PETTY M D. Applying reinforcement learning to plan manufacturing ma- terial handling[J]. Discover Artificial Intelligence, 2021, 1(1): 1-33. [43] GU W B, LI Y X, TANG D B, et al. Using real-time manufacturing data to schedule a smart factory via reinforcement learning[J]. Computers & Industrial Engineering, 2022, 171: 108406. [44] HUANG J P, GAO L, LI X Y, et al. A cooperative hierarchical deep reinforcement learning based multi-agent method for distributed job shop scheduling problem with random job arrivals [J]. Computers & Industrial Engineering, 2023, 185: 109650. [45] LIU S, ZHANG Y, TANG K, et al. How good is neural combinatorial optimization? A sys- tematic evaluation on the traveling salesman problem[J]. IEEE Computational Intelligence Magazine, 2023, 18(3): 14-28. [46] NG A Y, HARADA D, RUSSELL S. Policy invariance under reward transformations: Theory and application to reward shaping[C]//Icml: Vol. 99. 1999: 278-287. [47] BORKAR V S. An actor-critic algorithm for constrained Markov decision processes[J]. Sys- tems & Control letters, 2005, 54(3): 207-213. [48] TESSLER C, MANKOWITZ D J, MANNOR S. Reward constrained policy optimization [C/OL]//International Conference on Learning Representations. 2019: 1-15. https://openre view.net/forum?id=SkfrvsA9FX. [49] CALIAN D A, MANKOWITZ D J, ZAHAVY T, et al. Balancing constraints and rewards with meta-gradient D4PG[C/OL]//International Conference on Learning Representations. 2021: 1-22. https://openreview.net/forum?id=TQt98Ya7UMP. [50] LIU Y, DING J, LIU X. IPO: Interior-point policy optimization under constraints[C]//the AAAI Conference on Artificial Intelligence: Vol. 34. 2020: 4940-4947. [51] YANG T Y, ROSCA J, NARASIMHAN K, et al. Projection-based constrained policy opti- mization[C/OL]//International Conference on Learning Representations. 2020: 1-24. https://openreview.net/forum?id=rke3TJrtPS. [52] ALSHIEKH M, BLOEM R, EHLERS R, et al. Safe reinforcement learning via shielding[C]// the AAAI Conference on Artificial Intelligence: Vol. 32. 2018: 1-23. [53] HUANG S, ONTAÑÓN S. A closer look at invalid action masking in policy gradient algorithms [C/OL]//BARTÁK R, KESHTKAR F, FRANKLIN M. Thirty-Fifth International Florida Arti- ficial Intelligence Research Society Conference. 2022: 1-10. https://doi.org/10.32473/flairs.v3 5i.130584. [54] BROCKMAN G, CHEUNG V, PETTERSSON L, et al. OpenAI Gym[A]. 2016. arXiv:1606.01540. [55] QIU L, HSU W J, HUANG S Y, et al. Scheduling and routing algorithms for AGVs: A survey [J]. International Journal of Production Research, 2002, 40(3): 745-760. [56] SIMULATION T P. Siemens[EB/OL]. https://plm.sw.siemens.com/en-US/tecnomatix/. [57] ANYLOGIC. AnyLogic[EB/OL]. https://www.anylogic.cn/. [58] SIMUL8. SIMUL8[EB/OL]. https://www.simul8.com/. [59] MATHWORKS. Simulink[EB/OL]. https://ww2.mathworks.cn/products/simulink.html. [60] CHEVALIER-BOISVERT M, WILLEMS L, PAL S. Minimalistic Gridworld Environment for Gymnasium[J/OL]. GitHub repository, 2018. https://github.com/Farama-Foundation/Minigrid. [61] TERRY J K, BLACK B, HARI A, et al. PettingZoo: Gym for Multi-Agent Reinforcement Learning[C]//International Conference on Neural Information Processing Systems: Vol. 34. 2021: 15032-15043. [62] PEREZ-LIEBANA D, SAMOTHRAKIS S, TOGELIUS J, et al. General video game AI: Com- petition, challenges and opportunities[C]//Thirtieth AAAI Conference on Artificial Intelligence: Vol. 30. 2016: 1-3. [63] PEREZ-LIEBANA D, LIU J, KHALIFA A, et al. General Video Game AI: A Multitrack Frame- work for Evaluating Agents, Games, and Content Generation Algorithms[J]. IEEE Transactions on Games, 2019, 11(3): 195-214. [64] COBBE K, HESSE C, HILTON J, et al. Leveraging procedural generation to benchmark rein- forcement learning[C]//International Conference on Machine Learning. 2020: 2048-2056. [65] ZHANG H, CUI Y. AI Olympics Competition[EB/OL]. 2022. https://github.com/jidiai/Comp etition_Olympics-Integrated. [66] BEATTIE C, LEIBO J Z, TEPLYASHIN D, et al. DeepMind lab[A]. 2016. arXiv preprint arXiv:1612.03801. [67] CHEVALIER-BOISVERT M. MiniWorld: Minimalistic 3D Environment for RL & Robotics Research[EB/OL]. 2018. https://github.com/maximecb/gym-miniworld. [68] GENESERETH M, LOVE N, PELL B. General game playing: Overview of the AAAI compe- tition[J]. AI magazine, 2005, 26(2): 62-62. [69] ZHOU H, ZHOU Y, ZHANG H, et al. Botzone: A competitive and interactive platform for game AI education[C]//ACM Turing 50th Celebration Conference-China. 2017: 1-5. [70] LANCTOT M, LOCKHART E, LESPIAU J B, et al. OpenSpiel: A framework for reinforce- ment learning in games[A]. 2019. arXiv preprint arXiv:1908.09453. [71] STEPHENSON M, PIETTE E, SOEMERS D J, et al. Ludii as a competition platform[C]//IEEE Conference on Games. IEEE, 2019: 1-8. [72] GAINA R, BALLA M, DOCKHORN A, et al. TAG: A tabletop games framework[C]//CEUR Workshop Proceedings. 2020: 1-7. [73] DOCKHORN A, GRUESO J H, JEURISSEN D, et al. STRATEGA: A General Strategy Games Framework[C]//AAAI Conference on Artificial Intelligence and Interactive Digital Entertain- ment Workshop on Artificial Intelligence for Strategy Games. 2020: 1-7. [74] BAMFORD C. Griddly: A platform for AI research in games[J]. Software Impacts, 2021, 8: 100066. [75] BHONKER N, ROZENBERG S, HUBARA I. Playing SNES in the Retro Learning Environ- ment[J/OL]. ICLR Workshop, 2017. https://openreview.net/forum?id=HysBZSqlx. [76] JULIANI A, BERGES V P, TENG E, et al. Unity: A general platform for intelligent agents[A]. 2018. arXiv preprint arXiv:1809.02627. [77] ZHOU Z H, YU Y, QIAN C. Evolutionary Learning: Advances in Theories and Algorithms [M]. Springer, 2019. [78] YE D H, LIU Z, SUN M F, et al. Mastering complex control in MOBA games with deep reinforcement learning[C]//the AAAI Conference on Artificial Intelligence: Vol. 34. 2020: 6672-6679. [79] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning [C]//International Conference on Machine Learning. PMLR, 2016: 1928-1937. [80] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms [A]. 2017. arXiv preprint arXiv:1707.06347. [81] WENG J Y, CHEN H Y, YAN D, et al. Tianshou: A highly modularized deep reinforcement learning library[J/OL]. Journal of Machine Learning Research, 2022, 23(267): 1-6. http://jmlr.org/papers/v23/21-1127.html. [82] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//International Conference on Machine Learning. PMLR, 2017: 2778-2787. [83] OSTROVSKI G, BELLEMARE M G, OORD A, et al. Count-based exploration with neural density models[C]//International Conference on Machine Learning. PMLR, 2017: 2721-2730. [84] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[J]. Advances in Neural Information Processing Systems, 2017, 30: 1-11. [85] KHADKA S, TUMER K. Evolution-guided policy gradient in reinforcement learning[C]// International Conference on Neural Information Processing Systems. Curran Associates Inc., 2018: 1196–1208. [86] BODNAR C, DAY B, LIÓ P. Proximal distilled evolutionary reinforcement learning[C]//the AAAI Conference on Artificial Intelligence: Vol. 34. 2020: 3283-3290. [87] YANG P, ZHANG H, YU Y L, et al. Evolutionary reinforcement learning via cooperative coevolutionary negatively correlated search[J]. Swarm and Evolutionary Computation, 2022, 68: 100974. [88] RAILEANU R, GOLDSTEIN M, YARATS D, et al. Automatic data augmentation for gener- alization in reinforcement learning[J]. Advances in Neural Information Processing Systems, 2021, 34: 5402-5415. [89] JIANG M, GREFENSTETTE E, ROCKTÄSCHEL T. Prioritized level replay[C]//International Conference on Machine Learning. PMLR, 2021: 4940-4950. [90] SALIMANS T, HO J, CHEN X, et al. Evolution strategies as a scalable alternative to reinforce- ment learning[A]. 2017. arXiv preprint arXiv:1703.03864. [91] CONTI E, MADHAVAN V, PETROSKI SUCH F, et al. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents[J]. Ad- vances in Neural Information Processing Systems, 2018, 31. [92] WATANABE K, HASHEM M, WATANABE K, et al. Evolutionary optimization of constrained problems[J]. Evolutionary Computations: New Algorithms and their Applications to Evolution- ary Robots, 2004: 53-64. [93] RUNARSSON T P, YAO X. Stochastic ranking for constrained evolutionary optimization[J]. IEEE Transactions on Evolutionary Computation, 2000, 4(3): 284-294. [94] RUNARSSON T, YAO X. Constrained evolutionary optimization[M]//Evolutionary Optimiza- tion. Springer, 2003: 87-113. [95] TANG K, MEI Y, YAO X. Memetic algorithm with extended neighborhood search for capac- itated arc routing problems[J]. IEEE Transactions on Evolutionary Computation, 2009, 13(5): 1151-1166. [96] LI B D, TANG K, LI J L, et al. Stochastic ranking algorithm for many-objective optimization based on multiple indicators[J]. IEEE Transactions on Evolutionary Computation, 2016, 20(6): 924-938. [97] WIERSTRA D, SCHAUL T, GLASMACHERS T, et al. Natural evolution strategies[J]. The Journal of Machine Learning Research, 2014, 15(1): 949-980.
所在学位评定分委会	电子科学与技术
国内图书分类号	TP18
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/766049
专题	南方科技大学工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	胡呈鹏. 面向动态物料运输调度的约束强化学习研究[D]. 深圳. 南方科技大学,2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12132333-胡呈鹏-计算机科学与工（5045KB）	--	--	限制开放	--	请求全文