TY - GEN
T1 - Accelerating Distributed Training on Parameter Server Architecture With Path-Aware Multicast
AU - Yuan, Chuanying
AU - Pan, Tian
AU - Li, Yutong
AU - Huang, Junkai
AU - Ruan, Guohao
AU - Li, Haonan
AU - Li, Hao
AU - Zou, Yan
AU - Zhang, Jiao
AU - Huang, Tao
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - It is observed that the bottleneck in distributed training has shifted from computation to communication due to contention in concurrent transmissions and substantial redundant traffic. In the Parameter Server (PS) architecture, the server aggregates gradients from multiple workers and then distributes updated model parameters back to the workers in a one-to-many manner. Currently, model parameters are distributed via unicast, sending multiple identical copies of the data, which leads to significant bandwidth waste. Although multicast can save bandwidth, current approaches have two main drawbacks: on one hand, many protocols require maintaining excessive multicast state inside the network; on the other hand, the lack of coordination among multiple multicast trees can still lead to path conflicts. In this work, we propose path-aware multicast, which includes innetwork multicast tree reservation and per-hop control multicast. Specifically, before each round of model parameter distribution, the server queries the network for a multicast tree that satisfies the bandwidth requirement. The calculated multicast tree is then returned with bandwidth reserved at its tree nodes. Next, model parameters are forwarded with hop-by-hop control along the multicast tree. After the multicast is completed, the reserved network resources are released. Our evaluation shows that in an 8 × 8 spine-leaf topology, path-aware multicast improves link load balancing by 32.6% compared to random multicast and accelerates model parameter distribution by up to nearly N × compared to unicast, where N is the number of workers.
AB - It is observed that the bottleneck in distributed training has shifted from computation to communication due to contention in concurrent transmissions and substantial redundant traffic. In the Parameter Server (PS) architecture, the server aggregates gradients from multiple workers and then distributes updated model parameters back to the workers in a one-to-many manner. Currently, model parameters are distributed via unicast, sending multiple identical copies of the data, which leads to significant bandwidth waste. Although multicast can save bandwidth, current approaches have two main drawbacks: on one hand, many protocols require maintaining excessive multicast state inside the network; on the other hand, the lack of coordination among multiple multicast trees can still lead to path conflicts. In this work, we propose path-aware multicast, which includes innetwork multicast tree reservation and per-hop control multicast. Specifically, before each round of model parameter distribution, the server queries the network for a multicast tree that satisfies the bandwidth requirement. The calculated multicast tree is then returned with bandwidth reserved at its tree nodes. Next, model parameters are forwarded with hop-by-hop control along the multicast tree. After the multicast is completed, the reserved network resources are released. Our evaluation shows that in an 8 × 8 spine-leaf topology, path-aware multicast improves link load balancing by 32.6% compared to random multicast and accelerates model parameter distribution by up to nearly N × compared to unicast, where N is the number of workers.
UR - https://www.scopus.com/pages/publications/105018468108
U2 - 10.1109/ICC52391.2025.11160989
DO - 10.1109/ICC52391.2025.11160989
M3 - 会议稿件
AN - SCOPUS:105018468108
T3 - IEEE International Conference on Communications
SP - 1396
EP - 1401
BT - ICC 2025 - IEEE International Conference on Communications
A2 - Valenti, Matthew
A2 - Reed, David
A2 - Torres, Melissa
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Conference on Communications, ICC 2025
Y2 - 8 June 2025 through 12 June 2025
ER -