(2023)
HSE paper devising an adaptive load-balancing strategy to enable pipeline parallelism for unreliable, heterogeneous workers connected over the Internet
Utilise ubiquitous heterogeneous computing resources for training large-scale foundation models, overcoming limitations of traditional data centre approaches
Traditional pipeline parallelism suffers from bottlenecks due to the slowest worker and lack of robustness to worker failures, leading to inefficient resource utilisation
The SWARM framework partitions a set of workers into swarms, where each peer in a swarm handles the same subset of layers. There are two main components enabling robustness and efficiency:
Experimental results show higher GPU utilisation and good convergence in heterogeneous, unreliable distributed setup
© Mika Senghaas 2024