Public Benchmarks

Runloop Public Benchmarks make it simple to validate your coding agent against the most popular, open source coding evaluation datasets. Simply select a benchmark and run your agent against it. Each Benchmark contains a set of Scenarios based on each test in the dataset. The Scenario contains the problem statement that your agent must work through, a pre-built environment containing all of context needed to complete the job, and a built-in scoring contract to properly evaluate the result for correctness. When working with benchmarks, keep in mind that datasets are typically large and are therefore paged. Similarly, execution can take a long time, so you should prefer the AsyncRunloop client if you’re working with Python.

Viewing Public Benchmarks

We’re constantly adding new supported datasets. To list the up-to-date list of supported public Benchmarks, use the following API call:
import os
import asyncio
from runloop_api_client import AsyncRunloop

# Note: We are using the AsyncRunloop client throughout the rest of this example.
client = AsyncRunloop(bearer_token=os.environ.get("RUNLOOP_API_KEY"))

async def main():
    page = await client.benchmarks.list_public()
    benchmarks = page.benchmarks
    return benchmarks

benchmarks = asyncio.run(main())
Are we missing your favorite open source benchmark? Let us know at support@runloop.ai
Each Benchmark contains a set of Scenarios that correspond to a test-case in the evaluation dataset.
# The Benchmark definition contains a list of all scenarios contained in the benchmark
print(benchmarks[0].scenario_ids)

Running Scenarios & Benchmarks

Each Scenario can be run to evaluate an AI agent’s performance. Running a scenario involves:
  1. Initiating a scenario run.
  2. Launching a development environment (devbox).
  3. Running the agent against the problem statement.
  4. Scoring the results.
  5. Uploading traces for analysis.

Run a single scenario from a public benchmark

Here’s an example of how to run a single scenario from a public benchmark against your own agent. First, create a scenario run to track the status and results of this run:

# Note: we are using the async client here.
scenario_id = benchmarks[0].scenario_ids[0]
scenario_run = await client.scenarios.start_run(
    scenario_id=scenario_id,
    run_name="marshmallow-code__marshmallow-1359 test run",
)

When starting a run, Runloop will create a Devbox with the environment specified by the test requirements. Wait for the devbox used by the scenario to become ready:
# Note the async client is used here.
devbox_id = scenario_run.devbox_id
await client.devboxes.await_running(devbox_id)
Now, run your agent. How and where your agent runs is up to you. Here’s an example of an agent that uses the problem statement as the prompt:
problem_statement = scenario_run.scenario.input_context.problem_statement
my_agent = MyAgent(prompt=problem_statement)
Finally, run the scoring function to validate the agent’s performance:
# Run the scoring function. Automatically marks the scenario run as done.
# Note the async client is used here.
validated = await client.scenarios.runs.score(scenario_run.id)
print(validated)

Perform a full benchmark run of a public benchmark

Once your agent is excelling at an individual scenario, you will want to test against all Scenarios for a given Benchmark. Here’s an example of how to perform a full benchmark run of a public benchmark.
import os
import asyncio
from runloop_api_client import AsyncRunloop

client = AsyncRunloop(bearer_token=os.environ.get("RUNLOOP_API_KEY"))

async def run_full_benchmark():
    # Start a full run of the first public benchmark returned
    benchmark_run = await client.benchmarks.start_run(
        benchmark_id=benchmarks[0].id,
        run_name="optional run name",
    )

    # Example: iterate scenarios (serialize or parallelize as desired)
    for scenario_id in benchmark_run.pending_scenarios:
        scenario_run = await client.scenarios.start_run(
            scenario_id=scenario_id,
            benchmark_run_id=benchmark_run.id,
        )
        await client.devboxes.await_running(scenario_run.devbox_id)
        # Run your agent here using scenario_run.scenario.input_context.problem_statement
        my_agent = MyAgent(prompt=scenario_run.scenario.input_context.problem_statement)
        await client.scenarios.runs.score(scenario_run.id)

    # Optionally end the benchmark run early (runs auto-complete when pending scenarios are done)
    await client.benchmarks.runs.complete(benchmark_run.id)

asyncio.run(run_full_benchmark())
Public Benchmarks make it fast and easy to start evaluating your agent against industry standard coding evaluations. When you’re ready to expand or customize Benchmarks to meet your specific needs, move on to creating Custom Benchmarks.