Public Benchmarks
Learn how to easily run your agent against popular public benchmarks.
Public Benchmarks
Runloop Public Benchmarks make it simple to validate your coding agent against the most popular, open source coding evaluation datasets.
Each Benchmark contains a set of Scenarios based on each test in the dataset. The Scenario contains the problem statement that your agent must work through, a pre-built environment containing all of context needed to complete the job, and a built-in scoring contract to properly evaluate the result for correctness.
Viewing Public Benchmarks
We’re constantly adding new supported datasets. To list the up-to-date list of supported public Benchmarks, use the following API call:
Each Benchmark contains a set of Scenarios that correspond to a test-case in the evaluation dataset.
Running Scenarios & Benchmarks
Each Scenario can be run to evaluate an AI agent’s performance. Running a scenario involves:
- Initiating a scenario run.
- Launching a development environment (devbox).
- Running the agent against the problem statement.
- Scoring the results.
- Uploading traces for analysis.
Run a single scenario from a public benchmark
Here’s an example of how to run a single scenario from a public benchmark against your own agent.
First, create a scenario run to track the status and results of this run:
When starting a run, Runloop will create a Devbox with the environment specified by the test requirements.
Wait for the devbox used by the scenario to become ready:
Now, run your agent. How and where your agent runs is up to you. Here’s an example of an agent that leverages the Runloop Devbox that was just created:
Finally, run the scoring function to validate the agent’s performance:
Perform a full benchmark run of a public benchmark
Once your agent is excelling at an individual scenario, you will want to test against all Scenarios for a given Benchmark.
Here’s an example of how to perform a full benchmark run of a public benchmark.
Public Benchmarks make it fast and easy to start evaluating your agent against industry standard coding evaluations. When you’re ready to expand or customize Benchmarks that meet your specific needs, move on to creating Custom Benchmarks.
Was this page helpful?