跳到主要导航 跳到搜索 跳到主要内容

Efficient multi-job federated learning scheduling with fault tolerance

  • Boqian Fu
  • , Fahao Chen
  • , Shengli Pan
  • , Peng Li
  • , Zhou Su
  • The University of Aizu
  • Beijing University of Posts and Telecommunications

科研成果: 期刊稿件文章同行评审

摘要

Federated Learning (FL) has emerged as a promising learning approach for utilizing data distributed across edge devices. However, existing works mainly focus on single-job FL systems. In practice, multiple FL jobs will be submitted simultaneously. How to schedule multiple FL jobs is crucial for client resource utilization and job efficiency. In addition, existing works assume that clients are always available during FL jobs, which is often not a reality since clients could be unavailable for FL jobs due to various reasons. To address these challenges, in this paper, we introduce a novel fault-tolerance multi-job scheduling strategy aimed at optimizing job efficiency and resource utilization. The basic idea of our approach is a redundancy-based fault tolerance mechanism, which is designed to ensure the robustness of FL jobs even with insufficient clients. The mechanism strategically selects clients for redundant model training. Based on the mechanism, the scheduling algorithm prioritizes urgent FL jobs, facilitating their completion and obviating the need for prolonged waiting periods for additional client availability. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, which can significantly outperform other baseline methods.

源语言英语
文章编号71
期刊Peer-to-Peer Networking and Applications
18
2
DOI
出版状态已出版 - 4月 2025

学术指纹

探究 'Efficient multi-job federated learning scheduling with fault tolerance' 的科研主题。它们共同构成独一无二的指纹。

引用此