Main Features
Runloop enables you to customize every aspect of benchmark creation and execution, including:- Run Public Benchmarks: Easily run your agent against a matrix of well-known and open source benchmarks, such as SWE-bench.
- Custom Benchmarks: Craft your own scenarios and benchmarks to train or evaluate your agent on a private codebase or dataset.
- Custom Scorers: Create custom scorers to evaluate agents across multiple dimensions, such as security, cost, performance, and compliance.
- Reports & Insights: Identify problems and visualize your agent’s performance changes in the Runloop dashboard.
Key Concepts
Whether you’re using public or custom benchmarks, you’ll keep the following key concepts in mind:- Scenario: A scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve it.
- Benchmark: A set of Scenarios that can be run together to produce an overall performance score. Benchmarks can be made up of any number and combination of Scenarios — even Scenarios from other Benchmarks.
- Scoring Function / Scorer: A script or function that is invoked to grade the performance of a Scenario from 0.0 to 1.0.
