Math Reasoning
Train models to solve mathematical problems step-by-step, using symbolic verification to reward correct answers. Environment type:SingleTurnEnv with MathRubric
Why RL works here: Models learn to produce correct final answers through trial and error. The reward signal is binary and cheap to compute — symbolic math verification checks whether the model’s \boxed{} answer matches the ground truth, without needing an LLM judge.
Example environment:
- Start with GSM8K for validation — baseline models typically score 40–70%, leaving room for improvement.
- For harder tasks (AIME, competition math), use a larger model like
Qwen/Qwen3-235B-A22B-Thinking-2507and increasemax_tokens.
Code Generation with Sandboxes
Train models to write correct code by executing their solutions in sandboxed environments and verifying outputs against test cases. Environment type:PythonEnv or SandboxEnv
Why RL works here: The model gets a concrete pass/fail signal from running code. Unlike static checking, execution-based verification catches subtle bugs and rewards solutions that actually work. Multi-turn interaction lets the model iteratively debug when tests fail.
Example environment:
- Use
PythonEnvfor Python-specific tasks — it provides a persistent REPL that the model can use across turns. - Use
SandboxEnvfor multi-language tasks or when you need shell access. - Set
max_turnsto 3–5 to let the model iterate on failing test cases. - Consider a partial reward for passing some but not all tests, rather than all-or-nothing scoring.
Multi-Turn Games and Puzzles
Train models on interactive tasks where they must take actions over multiple turns, receiving feedback after each move. Environment type: CustomMultiTurnEnv subclass
Why RL works here: Games provide dense, structured reward signals. The model learns strategies through repeated play — each rollout is a complete game, and the final score becomes the reward. Multi-turn structure naturally teaches planning and sequential decision-making.
Example environment (word guessing game):
- Games are excellent for validating your setup since they tend to show clear reward improvements within a small number of steps.
- Shape rewards to be gradient-rich — instead of just 0/1 for win/loss, give partial credit (e.g., reward based on number of turns taken to win).
- The built-in
alphabet-sortenvironment is a great starting point — install it withprime env install primeintellect/alphabet-sort.
Tool Use and Agentic Tasks
Train models to use tools effectively — calling the right tool with the right arguments to accomplish a goal. Environment type:ToolEnv or MCPEnv
Why RL works here: Tool use requires the model to reason about which tool to call, compose correct arguments, interpret results, and decide on next steps. RL training lets the model learn this decision-making loop through practice, improving both tool selection and argument construction.
Example environment (research assistant with search):
- Use
JudgeRubricwith an LLM judge for open-ended tasks where exact matching isn’t feasible. - Store API keys for external services (judge models, search APIs) in a
secrets.envfile and reference it withenv_file. - Monitor tool call counts via the automatic metrics — if the model isn’t calling tools, the task may need a clearer prompt.
MCPEnvis useful when your tools are already implemented as MCP servers.
Multi-Environment Training
Train a single model on multiple tasks simultaneously to improve generalization. Why RL works here: Training on diverse tasks prevents the model from overfitting to a single task’s reward surface. The model learns transferable skills (reasoning, tool use, instruction following) that improve performance across all tasks. Training config:- Run baseline evaluations on each environment before training to understand starting performance.
- Use W&B logging to compare per-environment reward curves during training.
Workflow Summary
Regardless of the use case, the typical Hosted Training workflow is:Build your environment
Create an environment with a dataset, harness, and rubric using the verifiers library.
Monitor and iterate
Watch reward curves on the dashboard, adjust your environment or config, and re-run.
Deploy
Download the trained LoRA adapter or deploy it for inference.
Getting Started
Launch your first Hosted Training run in minutes.
End-to-End Run
Detailed walkthrough of a complete training run.
Environments
Learn how to build custom environments with verifiers.
Advanced Configs
Multi-environment training, evals, and more.