prime-rl as well as the included vf.RLTrainer, both of which can be orchestrated via a single TOML config.
If your primary goal is to train a model on a Verifiers Environment, we recommend using prime-rl, which distills the best practices from our research team’s experience and the broader community into a stable, easy-to-use recipe. If you want to hack on new training algorithms and are less concerned with maximum performance or advanced features, you can use the included RLTrainer (via vf-rl), whose core files are under 1000 lines of code and include only the most essential logic for fairly-performant async off-policy training (with the same core algorithm as prime-rl).
prime-rl
We recommend using the prime-rl trainer, and provide a basic setup guide below. See the prime-rl documentation for more information.
To get started, do:
prime-rl trainer and its dependencies, and set up a default configuration for training with the included wiki-search Environment.
Then, you can start training with:
Configuration
prime-rl can be used with a single TOML file via its native rl.py script, which is used by the uv run prime-rl command from verifiers.
Example configuration file for the primeintellect/wiki-search Environment with Qwen/Qwen3-4B-Instruct-2507:
vf.RLTrainer
The includedRLTrainer is a minimal, hackable training loop based on transformers.Trainer that supports both full-parameter finetuning and LoRA training. RLTrainer can be viewed as a “baby” prime-rl that adopts a similar default training recipe (async CISPO with one-step off-policy overlap), intended for single-node test runs with dense models. The primary files (trainer.py and orchestrator.py, located in verifiers/rl/trainer/) are under 1000 lines of code, and are designed to be a convenient starting point for writing your own training loop.
The feature set is intentionally kept minimal and focused. Users seeking maximum performance, MoE support, multi-node training, multidimensional parallelism, and other advanced features should use the prime-rl trainer.
To use vf.RLTrainer in your own project, install with RL extras:
configs/vf-rl/wiki-search.toml, and do:
uv run vf-setup.
Configuration
vf-rl can be used with a single TOML file, largely mirroring the configuration options for prime-rl but with some key differences in organization and feature sets.
Example configuration file for the primeintellect/wiki-search Environment with Qwen/Qwen3-4B-Instruct-2507:
Key Hyperparameters
Batch Configuration
Key fields in[trainer.args]:
rollouts_per_example: completions per prompt (group size)micro_batch_size: rollouts per GPU per stepbatch_size: rollouts per global batch (must be divisible bymicro_batch_size * world_size)
rollouts_per_example: Larger groups (16-32) increase reward diversity but increase training time and memory usagemicro_batch_size: Limited by GPU memory after model weightsbatch_size: Total rollouts per global batch (must be divisible bymicro_batch_sizeandrollouts_per_example)
Generation Parameters
Bothprime-rl and vf-rl support configurable generation parameters, including:
max_tokens: maximum number of tokens to generate per turntemperature: temperature for samplingtop_p: top-p samplingtop_k: top-k samplingmin_p: minimum probability for samplingrepetition_penalty: repetition penalty for sampling
prime-rl, these parameters are configured in the [orchestrator.sampling] section, and in vf-rl, they are configured in the [trainer.args] section.
Training Schedule
Core fields in[trainer.args]:
learning_rate,lr_scheduler_type,warmup_steps,max_stepsmax_grad_norm,bf16,gradient_checkpointing
LoRA Training
LoRA training is supported in bothprime-rl and vf-rl. In prime-rl, it can be configured via the [trainer.model.experimental.lora] section. In vf-rl it is enabled by default, and can be configured via the [trainer.args] section.
RL Rules of Thumb
RL is notoriously sensitive to implementation details. Here’s practical guidance:Before Training
- Evaluate baseline performance: If your model gets 0% reward after 10+ attempts, the task is too hard
- Check task difficulty: If baseline is already 80%+, consider harder examples
- Ensure reward diversity: You want varied scores within each generation group
Stability vs Performance Trade-offs
For more aggressive training (higher risk of collapse):- Increase learning rate (3e-5 to 1e-4 for LoRA, 3e-6 to 1e-5 for full finetuning)
- Decrease
rollouts_per_exampleandbatch_sizefor faster generation
- Increase
rollouts_per_example(16-32) - Increase
batch_size(512-1024) - Use larger models (14B+)
Troubleshooting
Common Issues
Non-Increasing Chat Templates: The Qwen3 and DeepSeek-R1 model series both remove<think> sections from messages when processing inputs, which violates the increasing context requirement for multi-turn training. We provide versions of many of these models with modified chat templates here.
OOM during generation:
- Reduce
rollouts_per_exampleormicro_batch_size - Use LoRA instead of full finetuning
- Check vLLM server has sufficient memory
- Decrease learning rate
- Increase
rollouts_per_example - Increase
batch_size
- Increase learning rate
- Leverage continuous rewards
- Use online difficulty filtering
- Calibrate difficulty appropriately via smarter models, easier tasks
Next Steps
- Explore Environments to create custom tasks
- Review Components for advanced patterns
- See the examples directory on GitHub for complete training scripts