Public Benchmarks
Runloop Public Benchmarks make it simple to validate your coding agent against the most popular, open source coding evaluation datasets. Simply select a benchmark and run your agent against it. Each Benchmark contains a set of Scenarios based on each test in the dataset. The Scenario contains the problem statement that your agent must work through, a pre-built environment containing all of context needed to complete the job, and a built-in scoring contract to properly evaluate the result for correctness. When working with benchmarks, keep in mind that datasets are typically large and are therefore paged. Similarly, execution can take a long time, so you should prefer theAsyncRunloop
client if you’re working with Python.
Viewing Public Benchmarks
We’re constantly adding new supported datasets. To list the up-to-date list of supported public Benchmarks, use the following API call:Are we missing your favorite open source benchmark? Let us know at support@runloop.ai
Running Scenarios & Benchmarks
Each Scenario can be run to evaluate an AI agent’s performance. Running a scenario involves:- Initiating a scenario run.
- Launching a development environment (devbox).
- Running the agent against the problem statement.
- Scoring the results.
- Uploading traces for analysis.