SWARM Parallelism - Training Large Models Can Be Surprisingly Communication-Efficient

HSE•

M. Ryabinin, T. Dettmers, M. Diskin, M. Diskin, A. Borzunov

(2023)

13th September 2024Last Updated 13th September 2024

Introduction
HSE paper devising an adaptive load-balancing strategy to enable pipeline parallelism for unreliable, heterogeneous workers connected over the Internet
Motivation
Utilise ubiquitous heterogeneous computing resources for training large-scale foundation models, overcoming limitations of traditional data centre approaches
Problem
Traditional pipeline parallelism suffers from bottlenecks due to the slowest worker and lack of robustness to worker failures, leading to inefficient resource utilisation
Methodology
The SWARM framework partitions a set of workers into swarms, where each peer in a swarm handles the same subset of layers. There are two main components enabling robustness and efficiency:
1. Stochastic wiring: Dynamically routes requests between pipeline stages based on worker performance (e.g. higher throughput workers are assigned more requests)
2. Adaptive swarm balancing: Reallocates workers between stages to balance workload and handle failures/additions (e.g. move workers from underutilised to overutilised stages)
Results
Experimental results show higher GPU utilisation and good convergence in heterogeneous, unreliable distributed setup