Accelerating Distributed Training on Parameter Server Architecture With Path-Aware Multicast

  • Chuanying Yuan
  • , Tian Pan
  • , Yutong Li
  • , Junkai Huang
  • , Guohao Ruan
  • , Haonan Li
  • , Hao Li
  • , Yan Zou
  • , Jiao Zhang
  • , Tao Huang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

It is observed that the bottleneck in distributed training has shifted from computation to communication due to contention in concurrent transmissions and substantial redundant traffic. In the Parameter Server (PS) architecture, the server aggregates gradients from multiple workers and then distributes updated model parameters back to the workers in a one-to-many manner. Currently, model parameters are distributed via unicast, sending multiple identical copies of the data, which leads to significant bandwidth waste. Although multicast can save bandwidth, current approaches have two main drawbacks: on one hand, many protocols require maintaining excessive multicast state inside the network; on the other hand, the lack of coordination among multiple multicast trees can still lead to path conflicts. In this work, we propose path-aware multicast, which includes innetwork multicast tree reservation and per-hop control multicast. Specifically, before each round of model parameter distribution, the server queries the network for a multicast tree that satisfies the bandwidth requirement. The calculated multicast tree is then returned with bandwidth reserved at its tree nodes. Next, model parameters are forwarded with hop-by-hop control along the multicast tree. After the multicast is completed, the reserved network resources are released. Our evaluation shows that in an 8 × 8 spine-leaf topology, path-aware multicast improves link load balancing by 32.6% compared to random multicast and accelerates model parameter distribution by up to nearly N × compared to unicast, where N is the number of workers.

Original languageEnglish
Title of host publicationICC 2025 - IEEE International Conference on Communications
EditorsMatthew Valenti, David Reed, Melissa Torres
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1396-1401
Number of pages6
ISBN (Electronic)9798331505219
DOIs
StatePublished - 2025
Event2025 IEEE International Conference on Communications, ICC 2025 - Montreal, Canada
Duration: 8 Jun 202512 Jun 2025

Publication series

NameIEEE International Conference on Communications
ISSN (Print)1550-3607

Conference

Conference2025 IEEE International Conference on Communications, ICC 2025
Country/TerritoryCanada
CityMontreal
Period8/06/2512/06/25

Fingerprint

Dive into the research topics of 'Accelerating Distributed Training on Parameter Server Architecture With Path-Aware Multicast'. Together they form a unique fingerprint.

Cite this