Intelligent Resource Scheduling at Scale: A Machine Learning Perspective

Submitted by grigby1 on Tue, 12/01/2020 - 4:37pm

Title	Intelligent Resource Scheduling at Scale: A Machine Learning Perspective
Publication Type	Conference Paper
Year of Publication	2018
Authors	Yang, R., Ouyang, X., Chen, Y., Townend, P., Xu, J.
Conference Name	2018 IEEE Symposium on Service-Oriented System Engineering (SOSE)
Date Published	March 2018
Publisher	IEEE
ISBN Number	978-1-5386-5207-7
Keywords	ad-hoc heuristics, cloud computing, cloud-scale, Collaboration, composability, data centers, exhibited heterogeneity, Human Behavior, human factors, intelligent resource scheduling, Internet-scale Computing Security, Internet-scale systems, large-scale resource scheduling, Large-scale systems, learning (artificial intelligence), machine learning, Metrics, ML, multidimensional resource requirements, nonfunctional constraints, performance-centric node classification, Policy Based Governance, Processor scheduling, pubcrawl, quality of service, resilience, Resiliency, resource allocation, Resource management, Resource Scheduling, Scalability, scheduling, server characteristics, Servers, straggler, straggler mitigation, Task Analysis, workload
Abstract	Resource scheduling in a computing system addresses the problem of packing tasks with multi-dimensional resource requirements and non-functional constraints. The exhibited heterogeneity of workload and server characteristics in Cloud-scale or Internet-scale systems is adding further complexity and new challenges to the problem. Compared with,,,, existing solutions based on ad-hoc heuristics, Machine Learning (ML) has the potential to improve further the efficiency of resource management in large-scale systems. In this paper we,,,, will describe and discuss how ML could be used to understand automatically both workloads and environments, and to help to cope with scheduling-related challenges such as consolidating co-located workloads, handling resource requests, guaranteeing application's QoSs, and mitigating tailed stragglers. We will introduce a generalized ML-based solution to large-scale resource scheduling and demonstrate its effectiveness through a case study that deals with performance-centric node classification and straggler mitigation. We believe that an MLbased method will help to achieve architectural optimization and efficiency improvement.
URL	https://ieeexplore.ieee.org/document/8359158
DOI	10.1109/SOSE.2018.00025
Citation Key	yang_intelligent_2018

Groups:

Science of Security VO