TY - JOUR
T1 - Efficient multi-job federated learning scheduling with fault tolerance
AU - Fu, Boqian
AU - Chen, Fahao
AU - Pan, Shengli
AU - Li, Peng
AU - Su, Zhou
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/4
Y1 - 2025/4
N2 - Federated Learning (FL) has emerged as a promising learning approach for utilizing data distributed across edge devices. However, existing works mainly focus on single-job FL systems. In practice, multiple FL jobs will be submitted simultaneously. How to schedule multiple FL jobs is crucial for client resource utilization and job efficiency. In addition, existing works assume that clients are always available during FL jobs, which is often not a reality since clients could be unavailable for FL jobs due to various reasons. To address these challenges, in this paper, we introduce a novel fault-tolerance multi-job scheduling strategy aimed at optimizing job efficiency and resource utilization. The basic idea of our approach is a redundancy-based fault tolerance mechanism, which is designed to ensure the robustness of FL jobs even with insufficient clients. The mechanism strategically selects clients for redundant model training. Based on the mechanism, the scheduling algorithm prioritizes urgent FL jobs, facilitating their completion and obviating the need for prolonged waiting periods for additional client availability. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, which can significantly outperform other baseline methods.
AB - Federated Learning (FL) has emerged as a promising learning approach for utilizing data distributed across edge devices. However, existing works mainly focus on single-job FL systems. In practice, multiple FL jobs will be submitted simultaneously. How to schedule multiple FL jobs is crucial for client resource utilization and job efficiency. In addition, existing works assume that clients are always available during FL jobs, which is often not a reality since clients could be unavailable for FL jobs due to various reasons. To address these challenges, in this paper, we introduce a novel fault-tolerance multi-job scheduling strategy aimed at optimizing job efficiency and resource utilization. The basic idea of our approach is a redundancy-based fault tolerance mechanism, which is designed to ensure the robustness of FL jobs even with insufficient clients. The mechanism strategically selects clients for redundant model training. Based on the mechanism, the scheduling algorithm prioritizes urgent FL jobs, facilitating their completion and obviating the need for prolonged waiting periods for additional client availability. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, which can significantly outperform other baseline methods.
KW - Fault tolerance
KW - Federated learning
KW - Heterogeneity
KW - Multi-job scheduling
UR - https://www.scopus.com/pages/publications/85217651303
U2 - 10.1007/s12083-024-01847-z
DO - 10.1007/s12083-024-01847-z
M3 - 文章
AN - SCOPUS:85217651303
SN - 1936-6442
VL - 18
JO - Peer-to-Peer Networking and Applications
JF - Peer-to-Peer Networking and Applications
IS - 2
M1 - 71
ER -