Maximizing generation efficiency during RL training

RL is back. After a period of skepticism, techniques like RLHF and the more recent RLAIF have become central to pushing the state of the art, transforming promising base models into capable, aligned agents. But as the foundational ScaleRL paper clearly outlined, our understanding of how to scale RL lags well behind our understanding of how to scale pre-training for LLMs.

One particular area of active research is how to maximize hardware efficiency during RL training, balancing generation and training to utilize GPUs as efficiently as possible. For many top labs, as much as 90% of the GPU fleet dedicated to an RL run isn't actually training the model. It's generating the data the model will train on via running inference on a sampled set of examples. The trainer, often a comparatively small cluster of GPUs, sits idle, waiting for the massive generation fleet to complete its work.

Between asynchronous inference and engine switching, there is a lot of interesting work being done in this space. This post will dive a bit deeper into the problem and survey several research papers that propose (or more often, hint at) potential solutions, as well as some of the work that went into Ai2's latest release, Olmo 3.

‍

Running synchronously vs. parallelization

The most straightforward approach to RL training is to run this generation-training loop synchronously. You generate a complete batch of data with the current model, stop the generators, train the model on that fresh data, and then restart the generators with the newly updated model weights. This is a purely on-policy approach: the policy that generates the data is the exact same policy that you are training on.

The benefit of this approach is stability, and simplicity – it is fairly intuitive. The training data is a perfect reflection of the model's current capabilities, its strengths, and its weaknesses. But the downside, as we’ve discussed, is significant inefficiency. Your expensive training cluster spends most of its time sitting idle.

The obvious solution is to somehow run these two tasks simultaneously: have the generators run continuously, creating data for future steps while the trainer is busy with the current step. Noukhovitch et al. explored this idea in their 2024 paper "Asynchronous RLHF: faster and more efficient off-policy RL for language models." As you probably gathered from the title, this setup turns the algorithm into an off-policy one – which adds instability – because the generators will be running an older version of the model (the behavior policy) to create data that will be used to train the new version of the model (the target policy). You can tune the degree of asynchronicity here as well, choosing different values of N offset steps that the generators are “allowed” to be behind the trainers.

This introduces a new, challenging variable: data staleness. The data being generated in step N+3 is based on a model that hasn't learned the lessons from steps N, N+1, and N+2. If the model made an obvious mistake in that stale data—a mistake it has since learned to correct—is that data still useful? Or is it actively harmful, teaching the model lessons it has already moved past? How stale is too stale?

‍

A survey of modern solutions and research for maximizing generation efficiency

Answering the question of "how stale is too stale?" is an active area of research, blending empirical experimentation with some clever infrastructure design. There is no silver bullet, but several compelling approaches are emerging, ranging from algorithmic adjustments to what can only be described as unusual infrastructure hacks.

Outside of what’s published here, many labs (such as Ai2) are taking an empirical approach, running several experiments and essentially determining how stale is too stale. Speaking of which…

1) Embracing staleness and managing stragglers

It turns out you can get away with a surprising amount of staleness. The aforementioned ScaleRL paper, authored by a cross-discipline group of academics and Meta researchers, explored this trade-off directly in the context of their 70B Llama 3-V model. In their Asynchronous RL Setup, they allowed the generators to get up to k steps ahead of the trainer. They found that a k value of around 8 – meaning the data could be up to 8 training steps old – had minimal impact on performance while dramatically improving hardware utilization{{justin-rl-1}}.

This approach, however, has to contend with the "straggler" problem. Not all generations are created equal. A simple problem might be solved in 100 tokens, while a difficult one might take the full 32,000-token maximum allotted for chain of thought (CoT) reasoning. These long-running generations can become exceptionally stale relative to the average. If a batch N contains a straggler that only finishes when the trainer is on step N+50, that data point is far older than its peers.

The solution is often to simply throw it away. You set a staleness threshold and discard any data that's too old, accepting the wasted computation as a cost of doing business.

‍

2) Intelligent data filtering and curriculum

Beyond simply managing staleness, a different (and perhaps more sophisticated) approach is to be more selective about the data you train on in the first place. This is a form of automated curriculum learning.

The same ScaleRL paper discusses two powerful filtering techniques.

The first is Zero Variance Filtering. If you ask a model to generate 16 solutions to a problem and it gets the answer right 100% of the time, that problem is too easy. The model has nothing to learn. Conversely, if it gets it right 0% of the time, the problem may be too hard, and trying to learn from it could be inefficient. The researchers at Ai2 have found that the most valuable data comes from problems where the success rate is somewhere in the middle, between 25% and 75%. Filtering for this "Goldilocks zone" focuses the training on the most informative examples.

The second is Adaptive Prompt Filtering. This takes the idea a step further by maintaining a score for each prompt based on the model's historical performance. Prompts that are consistently solved are down-weighted, while those in the valuable learning zone are prioritized. This dynamically adjusts the curriculum to the model's evolving capabilities. In the ScaleRL paper, the threshold for prompt removal was a pass rate of ≥ 0.9.

3) The distributed systems approach: PipelineRL

While some teams focus on the data, others have tackled the problem at the infrastructure level. The goal is to keep every GPU busy, one way or another, leading to some novel ideas in system design.

The influential PipelineRL paper from ServiceNow reframes the problem entirely. Instead of viewing RL as a tight, synchronous loop, it treats it as a distributed computing problem: a producer-consumer pipeline.

In this model, the system is decoupled into two main components:

The actors (the data generators) are the producers. Their only job is to continuously run inference with their current version of the model policy, generating experiences and pushing them into a shared, centralized replay buffer.
The learner (the trainer) is the consumer. It runs on its own schedule, constantly pulling batches of data from the replay buffer to perform its training updates.

After an update, the learner broadcasts its new, improved policy weights back out to the actors, who then seamlessly adopt them for future generations. Critically, the actors do not stop any of their existing chains of thought, but instead continue token generation on top of an effectively stale KV cache. Even though the cached key value tensors are from an “older” model, in practice the end product of the CoT is usually good enough.

This asynchronous, decoupled architecture smooths out the entire process. The actor never has to sit idle waiting for previous generations to finish before updating its weights; instead can always be generating new data. Likewise, the generators never have to stop and wait for a training step to complete. Each component can operate at its maximum possible speed, dramatically increasing overall hardware utilization.

4) Kimi's Hybrid Deployment / Engine Switching

While PipelineRL offers an elegant architectural solution, the team at Moonshot AI faced a problem of such massive scale that it required a different, more audacious approach. As detailed in their Kimi K2 paper, their model was so large that it required a minimum of 256 GPUs just to hold it in memory. The financial implication of allowing a 256-GPU cluster to sit idle, even for a few minutes, is, as we say in the business, “not good.”

Their solution, which they call Hybrid Deployment or "engine switching" is unusual. Instead of having separate clusters for training and inference, they use a single, massive cluster for both, dynamically reconfiguring its purpose on the fly. The process is as follows:

First, the cluster is configured for training. The training engine, model weights, and massive optimizer states are loaded into the 256 GPUs' memory, and a single training step is performed.
The main issue here is sharding. Each engine has a different sharding paradigm, and yet the inference engine must obtain updated parameters from the training engine nonetheless.
To solve this, they created a Checkpoint Engine that is co-located on training nodes to manage these parameter states. Each checkpoint engine worker broadcasts all of its parameters to every other worker.
Then, the inference workers just pick and choose the parameter chunks they need from this large broadcasted “stream” of data.

The authors note the obvious tradeoff of broadcasting this much data even when only a subset is needed for inference. But they believe it’s the right decision because it’s significantly simpler than alternatives, and the ability to fully decouple training and inference makes maintenance and testing way easier.

Conclusion and further reading

Like I mentioned earlier, maximizing generation efficiency during RL training is an area of active research without any clear best practices at the moment. Once can glean snippets and ideas from popular papers about scaling RL, but like the rest of the scaling RL “stack” there is significantly less consensus on these topics than for core pre-training for LLMs.

Additionally, many of the details and examples in this post are taken from Ai2’s RL setup; they released their state of the art completely open reasoning model, Olmo 3, last week, along with all of weights, training code, etc. (when I say completely open I mean it). You can read their technical breakdown here.

Sources and further reading

Authors

Justin Gage

Editors

Rohan Virani

Acknowledgments

Thank you to Finbarr Timbers from Ai2 for the bulk of the ideas behind this post.