Overview of Benchmarks & Scenarios on Runloop

Benchmarks are frequently cited when comparing model performance, but evaluation is not all you can do with benchmarks. Runloop provides a suite of tools to help you evaluate and improve your agent’s performance, detect regressions, and fix common problems. The central challenge in working with benchmarks is one of scale: running a single benchmark on one machine can take weeks, if not months. Runloop allows you to run benchmarks at scale in a secure environment with a few lines of code.

Main Features

Runloop enables you to customize every aspect of benchmark creation and execution, including:

Run Public Benchmarks: Easily run your agent against a matrix of well-known and open source benchmarks, such as SWE-bench.
Custom Benchmarks: Craft your own scenarios and benchmarks to train or evaluate your agent on a private codebase or dataset.
Custom Scorers: Create custom scorers to evaluate agents across multiple dimensions, such as security, cost, performance, and compliance.
Reports & Insights: Identify problems and visualize your agent’s performance changes in the Runloop dashboard.

Key Concepts

Whether you’re using public or custom benchmarks, you’ll keep the following key concepts in mind:

Scenario: A scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve it.
Benchmark: A set of Scenarios that can be run together to produce an overall performance score. Benchmarks can be made up of any number and combination of Scenarios — even Scenarios from other Benchmarks.
Scoring Function / Scorer: A script or function that is invoked to grade the performance of a Scenario from 0.0 to 1.0.

Next, learn how to run public benchmarks.

Overview

Tools

Components

Benchmarks & Evals

Debugging

Overview of Benchmarks & Scenarios on Runloop

Main Features

Key Concepts

Overview

Tools

Components

Benchmarks & Evals

Debugging

​Main Features

​Key Concepts

Main Features

Key Concepts