Skip to main content
Benchmarks are frequently cited when comparing model performance, but evaluation is not all you can do with benchmarks. Runloop provides a suite of tools to help you evaluate and improve your agent’s performance, detect regressions, and fix common problems. The central challenge in working with benchmarks is one of scale: running a single benchmark on one machine can take weeks, if not months. Runloop allows you to run benchmarks at scale in a secure environment with a few lines of code.

Main Features

Runloop enables you to customize every aspect of benchmark creation and execution, including:
  • Run Public Benchmarks: Easily run your agent against a matrix of well-known and open source benchmarks, such as SWE-bench.
  • Custom Benchmarks: Craft your own scenarios and benchmarks to train or evaluate your agent on a private codebase or dataset.
  • Custom Scorers: Create custom scorers to evaluate agents across multiple dimensions, such as security, cost, performance, and compliance.
  • Reports & Insights: Identify problems and visualize your agent’s performance changes in the Runloop dashboard.

Key Concepts

Whether you’re using public or custom benchmarks, you’ll keep the following key concepts in mind:
  • Scenario: A scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve it.
  • Benchmark: A set of Scenarios that can be run together to produce an overall performance score. Benchmarks can be made up of any number and combination of Scenarios — even Scenarios from other Benchmarks.
  • Scoring Function / Scorer: A script or function that is invoked to grade the performance of a Scenario from 0.0 to 1.0.
Next, learn how to run public benchmarks.