Overview
Benchmarks are not only useful for evaluation. They can also be used to improve an agent over time by measuring how it performs on a repeatable set of tasks and feeding those results into a broader learning workflow. On Runloop, benchmarks and scenarios are especially useful when you want to make an agent better at a specific class of work, such as fixing tests, completing coding tasks, or following internal development workflows. These workflows can take several forms. Some teams use reinforcement learning, while others use benchmarks to support data generation, model selection, prompt iteration, curriculum design, or other agent improvement strategies.Reinforcement Learning Is One Common Pattern
At a high level, reinforcement learning workflows usually include four stages:- Policy inference: A policy or model generates actions for a task.
- Rollouts: The agent is run on tasks so you can observe its behavior and outputs.
- Reward computation: Each rollout is scored to determine how well the agent performed.
- Policy optimization: Those rewards are used to update the policy so future behavior improves.
How Runloop Fits In
Runloop gives you the infrastructure to execute agents against repeatable tasks in realistic environments:- Use scenarios to define the task, environment, and success criteria.
- Use benchmarks to group scenarios into a reusable training or evaluation suite.
- Use scorers to convert task outcomes into rewards or quality signals.
Improving Agents with Existing or Custom Scenarios
There are two common ways to use benchmarks for agent development:- Start with existing scenarios when you want to improve performance on known public tasks or standard workflows.
- Create custom scenarios when you want to teach an agent to perform better on your own codebase, tools, or task patterns.
- Define the tasks you care about.
- Run the agent against those tasks.
- Score the outputs using your benchmark scorers.
- Use those scores to gauge agent performance and feed a broader learning workflow, whether that is reinforcement learning or another improvement strategy.
Where to Go Next
- Custom Benchmarks for building reusable benchmark suites.
- Creating Scenarios for defining tasks and environments.
- Custom Scorers for designing reward signals and success criteria.
