Proteus: Distributed machine learning task scheduling based on Lyapunov optimization

  • Yongan Xiang
  • , Bin Shi
  • , Chongle Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

With the prevalence of machine learning applications, an increasing number of machine learning tasks is transplanted into cloud computing platform. The cloud for machine learning consists of heterogeneous resources such as GPUs, CPUs, memory, etc, which brings challenges for resource management. So, how to schedule machine learning tasks and allocate appropriate GPU resources for computing, so that the cluster can maximize the use of resources and reduce task computing time has become a concern in the industry and academia. This paper proposes a scheduling strategy named Proteus, which is based on Lyapunov optimization. Through the Proteus, we can make our tasks have a minimum turnaround time in a long time sequence, which is the time from the submission of the task to the completion of the task. By performing a comprehensive analysis, we implement the scheduling algorithm and conducts several simulation experiments. The experimental result shows that our scheduling strategy can achieve significant results in most task scheduling environments, reducing the turnaround time of tasks to 40%-50% of the original. This shows that the Proteus can provide higher resource utilization and performance of cloud clusters and reduce task turnaround time.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE International Conference on Joint Cloud Computing, JCC 2021 and 2021 9th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, MobileCloud 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages9-15
Number of pages7
ISBN (Electronic)9781665434799
DOIs
StatePublished - 2021
Event12th IEEE International Conference on Joint Cloud Computing, JCC 2021 and 2021 9th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, MobileCloud 2021 - Virtual, Online, United Kingdom
Duration: 23 Aug 202126 Aug 2021

Publication series

NameProceedings - 2021 IEEE International Conference on Joint Cloud Computing, JCC 2021 and 2021 9th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, MobileCloud 2021

Conference

Conference12th IEEE International Conference on Joint Cloud Computing, JCC 2021 and 2021 9th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, MobileCloud 2021
Country/TerritoryUnited Kingdom
CityVirtual, Online
Period23/08/2126/08/21

Keywords

  • Distributed machine learning
  • Lyapunov optimization
  • Task scheduling

Fingerprint

Dive into the research topics of 'Proteus: Distributed machine learning task scheduling based on Lyapunov optimization'. Together they form a unique fingerprint.

Cite this