(2022)
Introduces a novel scheduling algorithm for decentralised, parallelised training of foundation models, achieving up to 4.8x speedup over state-of-the-art approaches by optimising communication between heterogeneous compute resources
Unlock untapped decentralised, heterogeneous compute resources for training huge-scale foundation models
Optimises communication requirements by solving an allocation problem over a graph, minimising communication costs for data and pipeline parallelism
Proposes a hybrid genetic algorithm to solve allocation problem
Experimental results demonstrate up to 4.8x throughput gain in worldwide geo-distributed workers setup when compared to state-of-the-art methods like Megatron and DeepSpeed
© Mika Senghaas 2024