Training Recipes - Prime Intellect Docs

Each example here covers a common RL use case: what kind of environment to build, a minimal working implementation, a training config, and practical tips. Use these as starting points or drop-in templates for your own runs on Lab. If you haven’t launched a training run yet, start with the Getting Started guide first.

Math Reasoning

Train models to solve mathematical problems step-by-step, using symbolic verification to reward correct answers. Environment type: SingleTurnEnv with MathRubric Why RL works here: Models learn to produce correct final answers through trial and error. The reward signal is binary and cheap to compute — symbolic math verification checks whether the model’s \boxed{} answer matches the ground truth, without needing an LLM judge. Example environment:

import verifiers as vf
from datasets import load_dataset

def load_environment(split: str = "train", num_examples: int = -1) -> vf.Environment:
    ds = load_dataset("openai/gsm8k", split=split)
    dataset = vf.Dataset.from_hf(ds, question_col="question", answer_col="answer")
    if num_examples > 0:
        dataset = dataset.select(range(num_examples))

    rubric = vf.MathRubric()
    return vf.SingleTurnEnv(
        dataset=dataset,
        rubric=rubric,
        system_prompt="Solve the problem step by step. Put your final answer in \\boxed{}.",
    )

Training config:

model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 1024

[[env]]
id = "your-username/gsm8k"

Tips:

Start with GSM8K for validation — baseline models typically score 40–70%, leaving room for improvement.
For harder tasks (AIME, competition math), use a larger model like Qwen/Qwen3-235B-A22B-Thinking-2507 and increase max_tokens.

Code Generation with Sandboxes

Train models to write correct code by executing their solutions in sandboxed environments and verifying outputs against test cases. Environment type: PythonEnv or SandboxEnv Why RL works here: The model gets a concrete pass/fail signal from running code. Unlike static checking, execution-based verification catches subtle bugs and rewards solutions that actually work. Multi-turn interaction lets the model iteratively debug when tests fail. Example environment:

import verifiers as vf
from datasets import Dataset

def load_environment() -> vf.Environment:
    dataset = Dataset.from_list([
        {
            "question": "Write a function `fibonacci(n)` that returns the nth Fibonacci number.",
            "info": '{"test_code": "assert fibonacci(0) == 0\\nassert fibonacci(1) == 1\\nassert fibonacci(10) == 55"}'
        },
        # ... more examples
    ])

    async def tests_pass(completion, info, state) -> float:
        code = completion[-1]["content"]
        test_code = info["test_code"]
        try:
            exec_result = state.get("exec_result", "")
            return 1.0 if "PASSED" in exec_result else 0.0
        except Exception:
            return 0.0

    rubric = vf.Rubric(funcs=[tests_pass])
    return vf.PythonEnv(
        dataset=dataset,
        rubric=rubric,
        max_turns=5,
    )

Training config:

model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 300
batch_size = 256
rollouts_per_example = 16

[sampling]
max_tokens = 2048

[[env]]
id = "your-username/code-gen"

Tips:

Use PythonEnv for Python-specific tasks — it provides a persistent REPL that the model can use across turns.
Use SandboxEnv for multi-language tasks or when you need shell access.
Set max_turns to 3–5 to let the model iterate on failing test cases.
Consider a partial reward for passing some but not all tests, rather than all-or-nothing scoring.

Multi-Turn Games and Puzzles

Train models on interactive tasks where they must take actions over multiple turns, receiving feedback after each move. Environment type: Custom MultiTurnEnv subclass Why RL works here: Games provide dense, structured reward signals. The model learns strategies through repeated play — each rollout is a complete game, and the final score becomes the reward. Multi-turn structure naturally teaches planning and sequential decision-making. Example environment (word guessing game):

import verifiers as vf
from datasets import Dataset
import random

class WordGameEnv(vf.MultiTurnEnv):
    async def setup_state(self, state, **kwargs):
        state["target"] = state["info"]["target_word"]
        state["guesses"] = []
        return await super().setup_state(state, **kwargs)

    async def env_response(self, messages, state):
        guess = messages[-1]["content"].strip().lower()
        target = state["target"]
        state["guesses"].append(guess)

        if guess == target:
            state["won"] = True
            return [{"role": "user", "content": "Correct! You found the word."}]

        # Give hints: which letters are in the right position
        hints = []
        for i, (g, t) in enumerate(zip(guess, target)):
            if g == t:
                hints.append(f"Position {i+1}: correct")
            elif g in target:
                hints.append(f"Position {i+1}: wrong position, letter is in the word")
            else:
                hints.append(f"Position {i+1}: letter not in word")

        return [{"role": "user", "content": "\n".join(hints) + "\nGuess again."}]

    @vf.stop
    async def game_won(self, state):
        return state.get("won", False)


def load_environment() -> vf.Environment:
    words = ["apple", "brain", "cloud", "dance", "eagle"]
    dataset = Dataset.from_list([
        {"question": "Guess the 5-letter word. I'll give you hints after each guess.",
         "info": f'{{"target_word": "{w}"}}'} for w in words
    ])

    async def win_reward(state) -> float:
        if state.get("won"):
            return max(0.2, 1.0 - 0.15 * len(state["guesses"]))
        return 0.0

    rubric = vf.Rubric(funcs=[win_reward])
    return WordGameEnv(dataset=dataset, rubric=rubric, max_turns=8)

Training config:

model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 100
batch_size = 128
rollouts_per_example = 8

[sampling]
max_tokens = 256

[[env]]
id = "your-username/word-game"

Tips:

Games are excellent for validating your setup since they tend to show clear reward improvements within a small number of steps.
Shape rewards to be gradient-rich — instead of just 0/1 for win/loss, give partial credit (e.g., reward based on number of turns taken to win).
The built-in alphabet-sort environment is a great starting point — install it with prime env install primeintellect/alphabet-sort.

Tool Use and Agentic Tasks

Train models to use tools effectively — calling the right tool with the right arguments to accomplish a goal. Environment type: ToolEnv or MCPEnv Why RL works here: Tool use requires the model to reason about which tool to call, compose correct arguments, interpret results, and decide on next steps. RL training lets the model learn this decision-making loop through practice, improving both tool selection and argument construction. Example environment (research assistant with search):

import verifiers as vf
from datasets import Dataset

async def web_search(query: str) -> str:
    """Search the web for information.

    Args:
        query: The search query to look up.

    Returns:
        Search results as text.
    """
    # your search implementation
    return await do_search(query)

async def calculate(expression: str) -> str:
    """Evaluate a mathematical expression.

    Args:
        expression: A math expression to evaluate (e.g. "2 + 2 * 3").

    Returns:
        The result of the evaluation.
    """
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

def load_environment() -> vf.Environment:
    dataset = Dataset.from_list([
        {
            "question": "What is the population of France divided by the population of Switzerland?",
            "answer": "approximately 8.3"
        },
        # ... more examples requiring tool use
    ])

    async def answer_quality(completion, answer, judge) -> float:
        verdict = await judge(completion, answer)
        return 1.0 if "correct" in verdict.lower() else 0.0

    rubric = vf.JudgeRubric(judge_model="gpt-4.1-mini")
    rubric.add_reward_func(answer_quality)

    return vf.ToolEnv(
        dataset=dataset,
        tools=[web_search, calculate],
        rubric=rubric,
        max_turns=10,
    )

Training config:

model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 16

[sampling]
max_tokens = 1024

[[env]]
id = "your-username/research-assistant"

env_file = ["secrets.env"]

Tips:

Use JudgeRubric with an LLM judge for open-ended tasks where exact matching isn’t feasible.
Store API keys for external services (judge models, search APIs) in a secrets.env file and reference it with env_file.
Monitor tool call counts via the automatic metrics — if the model isn’t calling tools, the task may need a clearer prompt.
MCPEnv is useful when your tools are already implemented as MCP servers.

Multi-Environment Training

Train a single model on multiple tasks simultaneously to improve generalization. Why RL works here: Training on diverse tasks prevents the model from overfitting to a single task’s reward surface. The model learns transferable skills (reasoning, tool use, instruction following) that improve performance across all tasks. Training config:

model = "Qwen/Qwen3-235B-A22B-Instruct-2507"
max_steps = 500
batch_size = 512
rollouts_per_example = 16

[sampling]
max_tokens = 2048

[[env]]
id = "primeintellect/gsm8k"
args = { split = "train" }

[[env]]
id = "your-username/code-gen"

[[env]]
id = "primeintellect/alphabet-sort"
args = { min_turns = 3, max_turns = 5 }

[wandb]
project = "multi-env-training"
name = "235b-multi-task"

[eval]
interval = 100

Tips:

Run baseline evaluations on each environment before training to understand starting performance.
Use W&B logging to compare per-environment reward curves during training.

Workflow Summary

Regardless of the use case, the typical Hosted Training workflow is:

Build your environment

Create an environment with a dataset, harness, and rubric using the verifiers library.

Evaluate baseline

Run prime eval run against your environment to measure where the model starts.

Configure and launch training

Write a .toml config and launch with prime train run.

Monitor and iterate

Watch reward curves on the dashboard, adjust your environment or config, and re-run.

Deploy

Download the trained LoRA adapter or deploy it for inference.

Getting Started

Launch your first Hosted Training run in minutes.

End-to-End Run

Detailed walkthrough of a complete training run.

Environments

Learn how to build custom environments with verifiers.

Advanced Configs

Multi-environment training, evals, and more.

​Math Reasoning

​Code Generation with Sandboxes

​Multi-Turn Games and Puzzles

​Tool Use and Agentic Tasks

​Multi-Environment Training

​Workflow Summary

Getting Started

End-to-End Run

Environments

Advanced Configs

Math Reasoning

Code Generation with Sandboxes

Multi-Turn Games and Puzzles

Tool Use and Agentic Tasks

Multi-Environment Training

Workflow Summary