Main Features
Runloop Benchmarking includes several tools to save you time while optimizing your agent:- Run Public Benchmarks: Easily run your agent against a matrix of well-known and open source benchmarks, such as SWE-bench.
- Run Custom Benchmarks: Create your own custom test cases, then evaluate the agent’s performance against them at scale.
- Reports & Insights: As you run benchmarks over time, you will see how your agent’s performance changes in the Runloop dashboard.
Key Concepts
Whether you’re using public or custom benchmarks, you’ll keep the following key concepts in mind:- Scenario: A single test case where an agent is given a problem and is expected to modify a target environment to solve it. Scenarios help test AI agents in realistic coding environments. Benchmarks may use other language to describe a scenario, such as
Task
,TestCase
orExample
. - Scoring Function: A script or function that runs after the agent completes its task to validate whether the solution works. These functions generate a final score between 0 and 1 to indicate performance.
- Benchmark: A collection of Scenarios designed to evaluate AI agents on a broader set of tasks. Benchmarks help measure agent capabilities systematically.