Skip to main content

Overview

Benchmarks are not only useful for evaluation. They can also be used to improve an agent over time by measuring how it performs on a repeatable set of tasks and feeding those results into a broader learning workflow. On Runloop, benchmarks and scenarios are especially useful when you want to make an agent better at a specific class of work, such as fixing tests, completing coding tasks, or following internal development workflows. These workflows can take several forms. Some teams use reinforcement learning, while others use benchmarks to support data generation, model selection, prompt iteration, curriculum design, or other agent improvement strategies.

Reinforcement Learning Is One Common Pattern

At a high level, reinforcement learning workflows usually include four stages:
  1. Policy inference: A policy or model generates actions for a task.
  2. Rollouts: The agent is run on tasks so you can observe its behavior and outputs.
  3. Reward computation: Each rollout is scored to determine how well the agent performed.
  4. Policy optimization: Those rewards are used to update the policy so future behavior improves.
Runloop helps most directly with the rollout and reward computation parts of this loop. You can run agents on benchmark scenarios, capture the results, and collect scores or rewards for each run. The policy optimization step typically happens in your own training infrastructure. Other learning strategies can use the same benchmark runs and scores differently, even when they are not doing RL policy optimization.

How Runloop Fits In

Runloop gives you the infrastructure to execute agents against repeatable tasks in realistic environments:
  • Use scenarios to define the task, environment, and success criteria.
  • Use benchmarks to group scenarios into a reusable training or evaluation suite.
  • Use scorers to convert task outcomes into rewards or quality signals.
This makes it possible to use the same benchmark assets for both evaluation and improvement. You can establish a baseline, run repeated rollouts, inspect results, and measure whether a learning or training method is improving the behaviors you care about.

Improving Agents with Existing or Custom Scenarios

There are two common ways to use benchmarks for agent development:
  1. Start with existing scenarios when you want to improve performance on known public tasks or standard workflows.
  2. Create custom scenarios when you want to teach an agent to perform better on your own codebase, tools, or task patterns.
In both cases, the basic pattern is the same:
  1. Define the tasks you care about.
  2. Run the agent against those tasks.
  3. Score the outputs using your benchmark scorers.
  4. Use those scores to gauge agent performance and feed a broader learning workflow, whether that is reinforcement learning or another improvement strategy.
If you need to target a specialized behavior, you can create custom scenarios and custom scorers so the reward signal reflects the outcomes that matter for your use case.

Where to Go Next

Need Help Designing a Training Workflow?

Training workflows can vary a lot depending on the agent architecture, optimization method, and data pipeline you are using. If you want help designing a benchmark-driven training loop for your team, contact sales@runloop.ai.