Public Benchmarks

Start with the Runloop Quickstart to use the examples below.

Runloop Public Benchmarks make it simple to validate your coding agent against the most popular, open source coding evaluation datasets. Simply select your favorite benchmark and an agent to run it against. Each Benchmark contains a set of Scenarios based on each test in the dataset. The Scenario contains the problem statement that your agent must work through, a pre-built environment containing all of the context needed to complete the job, and a built-in scorer to properly evaluate the result for correctness. When working with benchmarks, keep in mind that benchmark datasets are typically large and are therefore paged. Similarly, execution can take a long time, so you should prefer the AsyncRunloop client if you’re working with Python.

Viewing Public Benchmarks

We’re constantly adding new supported datasets. To view the up-to-date list of supported public Benchmarks, use the following API call:

# Query to see the latest list of supported public benchmarks
# princeton-nlp/SWE-bench_Lite, etc
benchmarks = await runloop.api.benchmarks.list_public()

Are we missing your favorite open source benchmark? Let us know at support@runloop.ai

Each Benchmark contains a set of Scenarios that correspond to a test-case in the evaluation dataset.

# The Benchmark definition contains a list of all scenarios contained in the benchmark
print(benchmarks[0].scenario_ids)

Running Scenarios & Benchmarks

Each Scenario can be run to evaluate an AI agent’s performance. Running a scenario involves:

Initiating a scenario run.
Launching a development environment (devbox).
Running the agent against the problem statement.
Scoring the results.
Uploading traces for analysis.

Run a single scenario from a public benchmark

Here’s an example of how to run a single scenario from a public benchmark against your own agent. First, create a scenario run to track the status and results of this run:

# Note: we are using the async client here.
scenario_id = benchmarks[0].scenario_ids[0]
scenario_run = await runloop.api.scenarios.start_run(
    scenario_id=scenario_id,
    run_name="marshmallow-code__marshmallow-1359 test run"
)

When starting a run, Runloop will create a Devbox with the environment specified by the test requirements. Wait for the devbox used by the scenario to become ready:

# Note the async client is used here.
devbox = runloop.devbox.from_id(scenario_run.devbox_id)
await devbox.await_running()

Now, run your agent. How and where your agent runs is up to you. Here’s an example of an agent that uses the problem statement as the prompt:

problem_statement = scenario_run.scenario.input_context.problem_statement
my_agent = MyAgent(prompt=problem_statement)

Finally, run the scoring function to validate the agent’s performance:

# Run the scoring function. Automatically marks the scenario run as done.
results = await runloop.api.scenarios.runs.score_and_await(
  scenario_run.id
)
print(results)

Perform a full benchmark run of a public benchmark

Once your agent is excelling at an individual scenario, you will want to test against all Scenarios for a given Benchmark. Here’s an example of how to perform a full benchmark run of a public benchmark.

# Start a full run of the first public benchmark returned
benchmark_run = await runloop.api.benchmarks.start_run(
  benchmark_id=benchmarks[0].id,
  run_name="optional run name"
)

# Example: iterate scenarios (serialize or parallelize as desired)
for scenario_id in benchmark_run.pending_scenarios:
  scenario_run = await runloop.api.scenarios.start_run(
    scenario_id=scenario_id,
    benchmark_run_id=benchmark_run.id
  )
  devbox = runloop.devbox.from_id(scenario_run.devbox_id)
  await devbox.await_running()
  # Run your agent here using scenario_run.scenario.input_context.problem_statement
  my_agent = MyAgent(prompt=scenario_run.scenario.input_context.problem_statement)
  await runloop.api.scenarios.runs.score(scenario_run.id)

Public Benchmarks make it fast and easy to start evaluating your agent against industry standard coding evals. When you’re ready to expand or customize Benchmarks to meet your specific needs, move on to creating Custom Benchmarks.

Overview

Tools

Components

Benchmarks & Evals

Debugging

Public Benchmarks