Papers
arxiv:2509.21841

Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training

Published on Sep 26
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Zeppelin addresses load imbalance in training large language models by optimizing sequence partitioning, inter-node communication, and module-specific parallel strategies.

AI-generated summary

Training large language models (LLMs) with increasingly long and varying sequence lengths introduces severe load imbalance challenges in large-scale data-parallel training. Recent frameworks attempt to mitigate these issues through data reorganization or hybrid parallel strategies. However, they often overlook how computational and communication costs scale with sequence length, resulting in suboptimal performance. We identify three critical challenges: (1) varying computation-to-communication ratios across sequences of different lengths in distributed attention, (2) mismatch between static NIC-GPU affinity and dynamic parallel workloads, and (3) distinct optimal partitioning strategies required for quadratic attention versus linear components. To address these challenges, we present Zeppelin, a novel training system that integrates three key techniques: (1) a hierarchical sequence partitioning method for the attention module that reduces communication overhead and balances computation, supported by an efficient attention engine that applies divergent parallel strategies; (2) a routing layer that orchestrates inter-node transfers to fully utilize NIC bandwidth; and (3) a remapping layer that transforms sequence layouts between attention and linear modules, ensuring high computational efficiency across both. Comprehensive evaluations across diverse configurations show that Zeppelin delivers an average 2.80x speedup over state-of-the-art methods.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.21841 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.21841 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.21841 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.