Distributed Inference and Fine-tuning of Large Language Models Over The Internet

HSE•

A. Borzunov, M. Ryabinin, A. Chumachenko

(2023)

14th September 2024Last Updated 14th September 2024

Introduction
Develops novel distributed algorithm for inferencing large-scale LLMs in a distributed, unreliable setting over the Internet
Motivation
Auto-regressive decoding in LLMs requires significant computation time and redundant recalculations, making distributed inference challenging, especially in unreliable environments
Problem
Existing methods for LLM inference do not account for geographically distributed and unreliable setups, leading to inefficient communication or computation redundancy in failure-prone systems for
Methodology
Introduces a new swarm-based algorithm, separating devices into clients and servers, using attention caches to reduce communication and computational requirements. It ensures fault-tolerance by caching activations on both client and server sides, enabling efficient recovery in case of failure.
Results
Outperforms caching-with-restarts and cache-less inference strategies, achieving a balance between performance and fault tolerance across all tested environments, including LLaMA 2 (70B) and BLOOM (176B), with 10x higher throughput compared to parameter offloading