南方科技大学知识苑(SUSTech KC): 面向演化优化云服务的在离线任务混部调度及其集成平台

题名	面向演化优化云服务的在离线任务混部调度及其集成平台
其他题名	COLOCATION SCHEDULING AND INTEGRATION PLATFORM FOR ONLINE-OFFLINE TASKS ORIENTED TO EVOLUTIONARY OPTIMIZATION OF CLOUD SERVICES
姓名	曹建琦
姓名拼音	CAO Jianqi
学号	12032467
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	杨鹏
导师单位	统计与数据科学系
论文答辩日期	2023-05-13
论文提交日期	2023-06-26
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	演化算法是解决优化问题的重要算法，因其天然可并行化的特点可以充分利用服务器集群的算力扩大搜索范围并提升优化效果，因其智能性可以适用于不同复杂问题的优化，进而可以作为解决优化问题的云服务对外提供计算服务。一方面，现有演化算法框架或算法平台并没有针对演化算法提供便利的分布式容器化途径，使得演化算法难以快速在服务器集群中分布式计算。另一方面，不同类型的在离线演化算法任务有不同的运行时资源需求和服务质量需求，现有的算法平台并没有针对不同类型的演化算法实现混部调度。本课题将对以上两方面的问题进行研究。首先，本课题按照分布式演化算法的运行特征将其分为在离线算法任务，设计基于强化学习的混部调度方法提高在服务器集群上的混部调度效率，并使用开源数据集在模拟环境下与其他混部方法进行比较验证。然后，本课题将基于容器编排平台 Kubernete 开发针对分布式演化算法调度运行的容器云平台并基于云原生技术集成运维和观测组件。最后，本课题将在真实容器云环境中部署基于强化学习的混部调度方法并在不同类型的演化算法任务上进行调度验证。本课题设计和开发的分布式演化算法调度平台可以优化算法运行和调度效率，方便算法部署和调度。基于强化学习的在离线任务混部调度方法在模拟环境下可以降低服务器 CPU 资源不可利用率 45.74% 并提高使用率 7.84%。在真实容器云环境下，相比其他混部方法可以提高服务器 CPU 使用率 4%~10%，降低 CPU 不可利用率 4%~20%。
关键词	分布式演化算法优化云服务强化学习 Kubernetes 混部调度
语种	中文
培养类别	独立培养
入学年份	2020
学位授予年份	2023-06
参考文献列表	[1] BÄCK T, SCHWEFEL H P. An overview of evolutionary algorithms for parameter optimization[J]. Evolutionary computation, 1993, 1(1): 1-23. [2] BARROSO L A, DEAN J, HOLZLE U. Web search for a planet: The Google cluster architecture[J]. IEEE micro, 2003, 23(2): 22-28. [3] KASTURE H, SANCHEZ D. Tailbench: A benchmark suite and evaluation methodology forlatency-critical applications[C]//2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2016: 1-10. [4] GAREFALAKIS P, KARANASOS K, PIETZUCH P, et al. Medea: Scheduling of long runningapplications in shared production clusters[C]//Proceedings of the thirteenth EuroSys conference.2018: 1-13. [5] MASANET E, SHEHABI A, LEI N, et al. Recalibrating global data center energy-use estimates[J]. Science, 2020, 367(6481): 984-986. [6] LI Y, SUN D, LEE B C. Dynamic colocation policies with reinforcement learning[J]. ACMTransactions on Architecture and Code Optimization (TACO), 2020, 17(1): 1-25. [7] DURILLO J J, NEBRO A J. jMetal: A Java framework for multi-objective optimization[J].Advances in Engineering Software, 2011, 42(10): 760-771. [8] BENÍTEZ-HIDALGO A, NEBRO A J, GARCÍA-NIETO J, et al. jMetalPy: A Python framework for multi-objective optimization with metaheuristics[J]. Swarm and Evolutionary Computation, 2019, 51: 100598. [9] WILHELMSTÖTTER F. Jenetics[EB/OL]. 2012 [2022-01-08]. https://jenetics.io/. [10] FORTIN F A, De Rainville F M, GARDNER M A, et al. DEAP: Evolutionary algorithms madeeasy[J]. Journal of Machine Learning Research, 2012, 13: 2171-2175. [11] JAZZBIN J. Geatpy: The genetic and evolutionary algorithm toolbox with high performance inpython[J]. http://www. geatpy. com, 2020. [12] TONDA A. Inspyred: Bio-inspired algorithms in Python[J]. Genetic Programming and Evolvable Machines, 2020, 21(1-2): 269-272. [13] TIAN Y, CHENG R, ZHANG X, et al. PlatEMO: A MATLAB platform for evolutionarymulti-objective optimization [educational forum][J]. IEEE Computational Intelligence Magazine, 2017, 12(4): 73-87. [14] GAGNÉ C, PARIZEAU M. Genericity in evolutionary computation software tools: Principlesand case study[J]. International Journal on Artificial Intelligence Tools, 2006, 15(2): 173-194. [15] CHEN A, CHOW A, DAVIDSON A, et al. Developments in mlflow: A system to acceleratethe machine learning lifecycle[C]//Proceedings of the fourth international workshop on datamanagement for end-to-end machine learning. 2020: 1-4. [16] BISONG E, BISONG E. Kubeflow and kubeflow pipelines[J]. Building Machine Learningand Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners,2019: 671-685. [17] CARRIÓN C. Kubernetes scheduling: Taxonomy, ongoing issues and challenges[J]. ACMComputing Surveys, 2022, 55(7): 1-37. [18] MORITZ P, NISHIHARA R, WANG S, et al. Ray: A distributed framework for emerging AIapplications[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18). 2018: 561-577. [19] PASZKE A, GROSS S, MASSA F, et al. Pytorch: An imperative style, high-performance deeplearning library[J]. Advances in neural information processing systems, 2019, 32. [20] ABADI M, BARHAM P, CHEN J, et al. Tensorflow: A system for large-scale machine learning.[C]//12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16):volume 16. Savannah, GA, USA, 2016: 265-283. [21] NETTO H V, LUIZ A F, CORREIA M, et al. Koordinator: A service approach for replicatingDocker containers in Kubernetes[C]//2018 IEEE Symposium on Computers and Communications (ISCC). IEEE, 2018: 00058-00063. [22] HOENISCH P, WEBER I, SCHULTE S, et al. Four-fold auto-scaling on a contemporary deployment platform using docker containers[C]//Service-Oriented Computing: 13th InternationalConference, ICSOC. Springer, 2015: 316-323. [23] Cloud Native Computing Foundation. Cloud native batch scheduling system for computeintensive workloads.[EB/OL]. 2018 [2023-02-08]. https://volcano.sh/. [24] RATTIHALLI G, GOVINDARAJU M, LU H, et al. Exploring potential for non-disruptivevertical auto scaling and resource estimation in kubernetes[C]//2019 IEEE 12th InternationalConference on Cloud Computing (CLOUD). IEEE, 2019: 33-40. [25] ALAM A B, ZULKERNINE M, HAQUE A. A reliability-based resource allocation approachfor cloud computing[C]//2017 IEEE 7th International Symposium on Cloud and Service Computing (SC2). IEEE, 2017: 249-252. [26] ZHOU H C, BAI H, CAI Z G, et al. Container quota optimization algorithm based on GRNNand LSTM[J]. ACTA ELECTONICA SINICA, 2022, 50(2): 366. [27] YU Y, SI X, HU C, et al. A review of recurrent neural networks: LSTM cells and networkarchitectures[J]. Neural computation, 2019, 31(7): 1235-1270. [28] SPECHT D F, et al. A general regression neural network[J]. IEEE transactions on neuralnetworks, 1991, 2(6): 568-576. [29] CHEN X, WANG H, MA Y, et al. Self-adaptive resource allocation for cloud-based softwareservices based on iterative QoS prediction model[J]. Future Generation Computer Systems,2020, 105: 287-296. [30] CHENG Y L, LIN C C, LIU P, et al. High resource utilization auto-scaling algorithms for heterogeneous container configurations[C]//2017 IEEE 23rd International Conference on Paralleland Distributed Systems (ICPADS). IEEE, 2017: 143-150. [31] LI C, TANG J, LUO Y. Elastic edge cloud resource management based on horizontal and verticalscaling[J]. The Journal of Supercomputing, 2020, 76: 7707-7732. [32] ROY N, DUBEY A, GOKHALE A. Efficient autoscaling in the cloud using predictive modelsfor workload forecasting[C]//2011 IEEE 4th International Conference on Cloud Computing.IEEE, 2011: 500-507. [33] YANG J, LIU C, SHANG Y, et al. Workload predicting-based automatic scaling in serviceclouds[C]//2013 IEEE Sixth International Conference on Cloud Computing. IEEE, 2013: 810-815. [34] BERRAL J L, BUCHACA D, HERRON C, et al. Theta-Scan: Leveraging behavior-drivenforecasting for vertical auto-Scaling in container cloud[C]//2021 IEEE 14th International Conference on Cloud Computing (CLOUD). IEEE, 2021: 404-409. [35] GIBSON J, RONDEAU R, EVELEIGH D, et al. Benefits and challenges of three cloud computing service models[C]//2012 Fourth International Conference on Computational Aspects ofSocial Networks (CASoN). IEEE, 2012: 198-205. [36] BLUME C. Optimized collision free robot move statement generation by the evolutionary software gleam[C]//Real-World Applications of Evolutionary Computing: EvoWorkshops 2000.Springer, 2000: 330-341. [37] MIKA M, WALIGÓRA G, WĘGLARZ J. Modelling and solving grid resource allocation problem with network resources for workflow applications[J]. Journal of Scheduling, 2011, 14:291-306. [38] ARANHA C, IBA H. Application of a memetic algorithm to the portfolio optimization problem[C]//AI 2008: Advances in Artificial Intelligence: 21st Australasian Joint Conference onArtificial Intelligence Auckland. Springer, 2008: 512-521. [39] PUTERMAN M L. Markov decision processes[J]. Handbooks in operations research and management science, 1990, 2: 331-434. [40] BARRON E, ISHII H. The Bellman equation for minimizing the maximum cost.[J]. NONLINEAR ANAL. THEORY METHODS APPLIC., 1989, 13(9): 1067-1090. [41] LAI M. Giraffe: Using deep reinforcement learning to play chess[A]. 2015. [42] CHATZILYGEROUDIS K, RAMA R, KAUSHIK R, et al. Black-box data-efficient policysearch for robotics[C]//2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017: 51-58. [43] GAO X. Deep reinforcement learning for time series: Playing idealized trading games[A]. 2018. [44] SUTTON R S. Dyna, an integrated architecture for learning, planning, and reacting[J]. ACMSigart Bulletin, 1991, 2(4): 160-163. [45] WATKINS C J, DAYAN P. Q-learning[J]. Machine learning, 1992, 8: 279-292. [46] KONDA V, TSITSIKLIS J. Actor-critic algorithms[J]. Advances in neural information processing systems, 1999, 12. [47] LUO F M, XU T, LAI H, et al. A survey on model-based reinforcement learning[A]. 2022. [48] GU S, LILLICRAP T, SUTSKEVER I, et al. Continuous deep q-learning with model-basedacceleration[C]//International conference on machine learning. PMLR, 2016: 2829-2838. [49] FEINBERG V, WAN A, STOICA I, et al. Model-based value estimation for efficient model-freereinforcement learning[A]. 2018. [50] RACANIÈRE S, WEBER T, REICHERT D, et al. Imagination-augmented agents for deepreinforcement learning[J]. Advances in neural information processing systems, 2017, 30. [51] WANG Z, SCHAUL T, HESSEL M, et al. Dueling network architectures for deep reinforcementlearning[C]//International conference on machine learning. PMLR, 2016: 1995-2003. [52] VAN HASSELT H, GUEZ A, SILVER D. Deep reinforcement learning with double q-learning[C]//Proceedings of the AAAI conference on artificial intelligence: volume 30. 2016. [53] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8: 229-256. [54] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization[C]//International conference on machine learning. PMLR, 2015: 1889-1897. [55] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[A]. 2017. [56] RASHID A, CHATURVEDI A. Cloud computing characteristics and services: A brief review[J]. International Journal of Computer Sciences and Engineering, 2019. [57] HERZFELDT A B, RAUER H P, WEISSBACH R, et al. Cloud computing as the next utility:Market strategies for cloud service providers[J]. Int. J. Cloud Appl. Comput., 2020, 10: 28-47. [58] VERMA A, PEDROSA L, KORUPOLU M R, et al. Large-scale cluster management at Googlewith Borg[C]//Proceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France, 2015. [59] SCHWARZKOPF M, KONWINSKI A, ABD-EL-MALEK M, et al. Omega: Flexible, scalableschedulers for large compute clusters[C]//European Conference on Computer Systems. 2013. [60] WANG K, LI Y, WANG C, et al. Characterizing job microarchitectural profiles at Scale: Datasetand analysis[C]//Proceedings of the 51st International Conference on Parallel Processing. 2022:1-11. [61] Open Infrastructure Foundation, Inc. The speed of containers, the security of VMs[EB/OL].2015 [2022-01-08]. https://katacontainers.io/. [62] Cloud Native Computing Foundation. An industry-standard container runtime with an emphasison simplicity, robustness and portability[EB/OL]. 2016 [2023-02-08]. https://containerd.io/. [63] Docker.Inc. Docker.[EB/OL]. 2013 [2023-02-08]. https://www.docker.com/. [64] Cloud Native Computing Foundation. Flannel is a simple and easy way to configure a layer 3network fabric designed for Kubernetes.[EB/OL]. 2017 [2023-02-08]. https://github.com/flannel-io/flannel. [65] Cloud Native Computing Foundation. Power your metrics and alerting with the leading opensource monitoring solution.[EB/OL]. 2015 [2023-02-08]. https://prometheus.io/. [66] Cloud Native Computing Foundation. Our mission is to be the trusted cloud native repositoryfor Kubernetes.[EB/OL]. 2016 [2023-02-08]. https://goharbor.io/. [67] GitLabInc. GitLab is the most comprehensive DevSecOps Platform.[EB/OL]. 2013 [2023-02-08]. https://about.gitlab.com/,. [68] Software in the Public Interest. Build great things at any scale.[EB/OL]. 2006 [2023-02-08].https://www.jenkins.io/. [69] HUANG H, LIANG Y, YANG X, et al. Pixel-level discrete multiobjective sampling for imagematting[J]. IEEE Transactions on Image Processing, 2019, 28(8): 3739-3751. [70] DORIGO M, BIRATTARI M, STUTZLE T. Ant colony optimization[J]. IEEE computationalintelligence magazine, 2006, 1(4): 28-39.
所在学位评定分委会	电子科学与技术
国内图书分类号	TP311.1
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/544032
专题	工学院_计算机科学与工程系
推荐引用方式 GB/T 7714	曹建琦. 面向演化优化云服务的在离线任务混部调度及其集成平台[D]. 深圳. 南方科技大学,2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12032467-曹建琦-计算机科学与工（4143KB）	--	--	限制开放	--	请求全文