Public Benchmarks
Runloop Public Benchmarks make it simple to validate your coding agent against the most popular, open source coding evaluation datasets. Simply select your favorite benchmark and an agent to run it against. Each Benchmark contains a set of Scenarios based on each test in the dataset. The Scenario contains the problem statement that your agent must work through, a pre-built environment containing all of the context needed to complete the job, and a built-in scorer to properly evaluate the result for correctness. When working with benchmarks, keep in mind that benchmark datasets are typically large and are therefore paged. Similarly, execution can take a long time, so you should prefer theAsyncRunloop client if you’re working with Python.
Viewing Public Benchmarks
We’re constantly adding new supported datasets. To view the up-to-date list of supported public Benchmarks, use the following API call:Are we missing your favorite open source benchmark? Let us know at [email protected]
Running Scenarios & Benchmarks
Each Scenario can be run to evaluate an AI agent’s performance. Running a scenario involves:- Initiating a scenario run.
- Launching a development environment (devbox).
- Running the agent against the problem statement.
- Scoring the results.
- Uploading traces for analysis.
