(2023)
Develops novel distributed algorithm for inferencing large-scale LLMs in a distributed, unreliable setting over the Internet
Auto-regressive decoding in LLMs requires significant computation time and redundant recalculations, making distributed inference challenging, especially in unreliable environments
Existing methods for LLM inference do not account for geographically distributed and unreliable setups, leading to inefficient communication or computation redundancy in failure-prone systems for
Introduces a new swarm-based algorithm, separating devices into clients and servers, using attention caches to reduce communication and computational requirements. It ensures fault-tolerance by caching activations on both client and server sides, enabling efficient recovery in case of failure.
Outperforms caching-with-restarts and cache-less inference strategies, achieving a balance between performance and fault tolerance across all tested environments, including LLaMA 2 (70B) and BLOOM (176B), with 10x higher throughput compared to parameter offloading
© Mika Senghaas 2024