What Happens When You Start Adding GPUs?
Introduction
It seems that open-source models scale to 100B+ parameters (LLMs, video diffusion transformers, multimodal architectures). Because of this scale, the question changes from "how do I train a model?" to "how do I train it efficiently?" especially since the source code is already available. Moreover, since those models come with the weights, it also becomes "how do I inference efficiently?" which are problems of optimization.
In this article, I benchmark what happens when you start adding GPUs for inference diving into a parallelism strategy commonly found in open-source optimizations and investigate the trade-off.
Background
A method to add GPUs is called parallelism. It is used because a single GPU eventually runs out of memory. The one eating up that memory is mostly attention, which is one of the most memory-intensive components in those models. Memory usage grows with sequence length, making long-context inference a challenge.
It is generally advised that if a model fits comfortably on a single GPU and the sequence length is small, the parallelism strategy is unnecessary1. In that case, attention can be efficiently optimized on a single GPU (as discussed in FlashAttention in my previous article).
Case Example
There are many ways to distribute computation, but today we will focus on one example. We will cover Sequence Parallelism (SP), applied to attention.
In a typical single-GPU setup, attention operates on the entire sequence at once. The data is represented as:
[batch_size, seq_len, hidden_size]
You can think of this as a large input holding all the data. The model then creates projections (Q, K, and V), where all computations are performed on a single GPU.
However, when SP is used, the input is split across the sequence dimension:
[batch_size, seq_len, hidden_size]
becomes
[batch_size, seq_len / N, hidden_size] per GPU
Each GPU processes only its chunk of tokens (divide it by N). The effect of this is that it reduces per-GPU memory and enabling even longer sequences 2
Once you split along the sequence dimension, tokens depend on other tokens stored on different GPUs. This means that when distributing Q, K, and V, devices need to communicate to access information from one another. This approach then introduces a trade-off between memory savings and communication overhead.
Benchmark
- I will try to experiment to dive into detail.

Referecnes