Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.runloop.ai/llms.txt

Use this file to discover all available pages before exploring further.

Benchmarks are frequently cited when comparing model performance, but evaluation is not all you can do with benchmarks. Runloop provides a suite of tools to help you evaluate and improve your agent’s performance, detect regressions, and fix common problems. The central challenge in working with benchmarks is one of scale: running a single benchmark on one machine can take weeks, if not months. Runloop allows you to run benchmarks at scale in a secure environment.

Orchestrated vs Interactive Benchmarks

Runloop supports two ways to run benchmarks:

Orchestrated Benchmarks

Recommended for most users. Submit a benchmark job via the CLI and let Runloop handle everything: provisioning devboxes, running agents, scoring results, and aggregating outputs. Compare multiple agents side-by-side with a single command.Best for:
  • Running a full benchmark suite
  • Comparing multiple agents
  • Reinforcement learning
  • CI/CD integration

Interactive Benchmarks

For users who need fine-grained control. Use the SDK to drive benchmark execution step-by-step, with full access to the devbox at any point during the run.Best for:
  • Debugging agent behavior
  • Customizing execution logic
  • Benchmark development

When to Use Each Mode

Use CaseRecommended Mode
Running a full benchmark suiteOrchestrated
Comparing multiple agentsOrchestrated
Reinforcement learningOrchestrated
CI/CD integrationOrchestrated
Iterative developmentOrchestrated
Debugging agent behaviorInteractive
Custom execution logicInteractive

Main Features

Runloop enables you to customize every aspect of benchmark creation and execution:
  • Orchestrated Benchmarks: Run benchmarks at cloud scale using your agent or a public agent with a single CLI command. Runloop handles provisioning, execution, scoring, and teardown automatically.
  • Public Benchmarks: Run your agent against well-known open source benchmarks like terminal bench 2, AIME, and more.
  • Custom Benchmarks: Craft your own scenarios and benchmarks to train or evaluate your agent on a private codebase or dataset.
  • Custom Scorers: Create custom scorers to evaluate agents across multiple dimensions, such as security, cost, performance, and compliance.
  • Training Using Benchmarks: Learn how benchmark runs and scores can support reinforcement learning workflows and targeted agent improvement.
  • Reports & Insights: Identify problems and visualize your agent’s performance changes in the Runloop dashboard.

Key Concepts

Whether you’re running orchestrated or interactive benchmarks, you’ll work with the following key concepts:
  • Scenario: A scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve it.
  • Benchmark: A set of Scenarios that can be run together to produce an overall performance score. Benchmarks can be made up of any number and combination of Scenarios — even Scenarios from other Benchmarks.
  • Scoring Function / Scorer: A script or function that is invoked to grade the performance of a Scenario from 0.0 to 1.0.

Getting Started