Skip to main content
Benchmarks are frequently cited when comparing model performance, but evaluation is not all you can do with benchmarks. Runloop provides a suite of tools to help you evaluate and improve your agent’s performance, detect regressions, and fix common problems. The central challenge in working with benchmarks is one of scale: running a single benchmark on one machine can take weeks, if not months. Runloop allows you to run benchmarks at scale in a secure environment.

Orchestrated vs Interactive Benchmarks

Runloop supports two ways to run benchmarks:

Orchestrated Benchmarks

Recommended for most users. Submit a benchmark job via the CLI and let Runloop handle everything: provisioning devboxes, running agents, scoring results, and aggregating outputs. Compare multiple agents side-by-side with a single command.Best for:
  • Running a full benchmark suite
  • Comparing multiple agents
  • Reinforcement learning
  • CI/CD integration

Interactive Benchmarks

For users who need fine-grained control. Use the SDK to drive benchmark execution step-by-step, with full access to the devbox at any point during the run.Best for:
  • Debugging agent behavior
  • Customizing execution logic
  • Benchmark development

When to Use Each Mode

Use CaseRecommended Mode
Running a full benchmark suiteOrchestrated
Comparing multiple agentsOrchestrated
Reinforcement learningOrchestrated
CI/CD integrationOrchestrated
Iterative developmentOrchestrated
Debugging agent behaviorInteractive
Custom execution logicInteractive

Main Features

Runloop enables you to customize every aspect of benchmark creation and execution:
  • Orchestrated Benchmarks: Run benchmarks at cloud scale using your agent or a public agent with a single CLI command. Runloop handles provisioning, execution, scoring, and teardown automatically.
  • Public Benchmarks: Run your agent against well-known open source benchmarks like terminal bench 2, AIME, and more.
  • Custom Benchmarks: Craft your own scenarios and benchmarks to train or evaluate your agent on a private codebase or dataset.
  • Custom Scorers: Create custom scorers to evaluate agents across multiple dimensions, such as security, cost, performance, and compliance.
  • Training Using Benchmarks: Learn how benchmark runs and scores can support reinforcement learning workflows and targeted agent improvement.
  • Reports & Insights: Identify problems and visualize your agent’s performance changes in the Runloop dashboard.

Key Concepts

Whether you’re running orchestrated or interactive benchmarks, you’ll work with the following key concepts:
  • Scenario: A scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve it.
  • Benchmark: A set of Scenarios that can be run together to produce an overall performance score. Benchmarks can be made up of any number and combination of Scenarios — even Scenarios from other Benchmarks.
  • Scoring Function / Scorer: A script or function that is invoked to grade the performance of a Scenario from 0.0 to 1.0.

Getting Started