Home

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

HSE
A. Borzunov, M. Ryabinin, A. Chumachenko

(2023)

14th September 2024Last Updated 14th September 2024
  • Introduction

    Develops novel distributed algorithm for inferencing large-scale LLMs in a distributed, unreliable setting over the Internet

  • Motivation

    Auto-regressive decoding in LLMs requires significant computation time and redundant recalculations, making distributed inference challenging, especially in unreliable environments

  • Problem

    Existing methods for LLM inference do not account for geographically distributed and unreliable setups, leading to inefficient communication or computation redundancy in failure-prone systems for

  • Methodology

    Introduces a new swarm-based algorithm, separating devices into clients and servers, using attention caches to reduce communication and computational requirements. It ensures fault-tolerance by caching activations on both client and server sides, enabling efficient recovery in case of failure.

  • Results

    Outperforms caching-with-restarts and cache-less inference strategies, achieving a balance between performance and fault tolerance across all tested environments, including LLaMA 2 (70B) and BLOOM (176B), with 10x higher throughput compared to parameter offloading


Reach me

© Mika Senghaas 2024