Benchmarks are frequently cited when comparing model performance, but evaluation is not all you can do with benchmarks. Runloop provides a suite of tools to help you evaluate and improve your agent’s performance, detect regressions, and fix common problems. The central challenge in working with benchmarks is one of scale: running a single benchmark on one machine can take weeks, if not months. Runloop allows you to run benchmarks at scale in a secure environment.Documentation Index
Fetch the complete documentation index at: https://docs.runloop.ai/llms.txt
Use this file to discover all available pages before exploring further.
Orchestrated vs Interactive Benchmarks
Runloop supports two ways to run benchmarks:Orchestrated Benchmarks
Recommended for most users. Submit a benchmark job via the CLI and let
Runloop handle everything: provisioning devboxes, running agents, scoring
results, and aggregating outputs. Compare multiple agents side-by-side with
a single command.Best for:
- Running a full benchmark suite
- Comparing multiple agents
- Reinforcement learning
- CI/CD integration
Interactive Benchmarks
For users who need fine-grained control. Use the SDK to drive benchmark
execution step-by-step, with full access to the devbox at any point during
the run.Best for:
- Debugging agent behavior
- Customizing execution logic
- Benchmark development
When to Use Each Mode
| Use Case | Recommended Mode |
|---|---|
| Running a full benchmark suite | Orchestrated |
| Comparing multiple agents | Orchestrated |
| Reinforcement learning | Orchestrated |
| CI/CD integration | Orchestrated |
| Iterative development | Orchestrated |
| Debugging agent behavior | Interactive |
| Custom execution logic | Interactive |
Main Features
Runloop enables you to customize every aspect of benchmark creation and execution:- Orchestrated Benchmarks: Run benchmarks at cloud scale using your agent or a public agent with a single CLI command. Runloop handles provisioning, execution, scoring, and teardown automatically.
- Public Benchmarks: Run your agent against well-known open source benchmarks like terminal bench 2, AIME, and more.
- Custom Benchmarks: Craft your own scenarios and benchmarks to train or evaluate your agent on a private codebase or dataset.
- Custom Scorers: Create custom scorers to evaluate agents across multiple dimensions, such as security, cost, performance, and compliance.
- Training Using Benchmarks: Learn how benchmark runs and scores can support reinforcement learning workflows and targeted agent improvement.
- Reports & Insights: Identify problems and visualize your agent’s performance changes in the Runloop dashboard.
Key Concepts
Whether you’re running orchestrated or interactive benchmarks, you’ll work with the following key concepts:- Scenario: A scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve it.
- Benchmark: A set of Scenarios that can be run together to produce an overall performance score. Benchmarks can be made up of any number and combination of Scenarios — even Scenarios from other Benchmarks.
- Scoring Function / Scorer: A script or function that is invoked to grade the performance of a Scenario from 0.0 to 1.0.
Getting Started
- Orchestrated (Recommended)
- Interactive
