Looking to run benchmarks quickly? For most use cases, we recommend
Orchestrated Benchmarks which let
you run full benchmark suites with a single CLI command. This page describes
the interactive approach, which gives you fine-grained control over each
scenario run and full access to the devbox at any point during execution.
Interactive Benchmarks Overview
Interactive benchmarks use the Runloop SDK to drive benchmark execution step-by-step. This approach is ideal when you need:- Full control over the execution flow
- Direct access to the devbox during a run
- Custom logic between scenario steps
- Debugging and iterative development
- Synthetic trajectory generation
AsyncRunloop client if you’re working with Python.
Viewing Public Benchmarks
We’re constantly adding new supported datasets. To view the up-to-date list of supported public Benchmarks, use the following API call:Are we missing your favorite open source benchmark? Let us know at
support@runloop.ai
Running Scenarios & Benchmarks
Each Scenario can be run to evaluate an AI agent’s performance. Running a scenario involves:- Initiating a scenario run.
- Launching a development environment (devbox).
- Running the agent against the problem statement.
- Scoring the results.
- Uploading traces for analysis.
Run a single scenario from a public benchmark
Here’s an example of how to run a single scenario from a public benchmark against your own agent. First, create a scenario run to track the status and results of this run:Perform a full benchmark run of a public benchmark
Once your agent is excelling at an individual scenario, you will want to test against all Scenarios for a given Benchmark. Here’s an example of how to perform a full benchmark run of a public benchmark.Next Steps
- Orchestrated Benchmarks: Run full benchmarks at cloud scale with a single CLI command
- Custom Benchmarks: Create your own benchmarks with custom scenarios and scorers
- Custom Scorers: Build domain-specific scoring functions
