Documentation Index
Fetch the complete documentation index at: https://docs.runloop.ai/llms.txt
Use this file to discover all available pages before exploring further.
Looking to run benchmarks quickly? For most use cases, we recommend
Orchestrated Benchmarks which let
you run full benchmark suites with a single CLI command. This page describes
the interactive approach, which gives you fine-grained control over each
scenario run and full access to the devbox at any point during execution.
Interactive Benchmarks Overview
Interactive benchmarks use the Runloop SDK to drive benchmark execution step-by-step. This approach is ideal when you need:
- Full control over the execution flow
- Direct access to the devbox during a run
- Custom logic between scenario steps
- Debugging and iterative development
- Synthetic trajectory generation
Each Benchmark contains a set of Scenarios based on each test in the dataset. The Scenario contains the problem statement that your agent
must work through, a pre-built environment containing all of the context needed to complete the job, and a built-in scorer
to properly evaluate the result for correctness.
When working with benchmarks, keep in mind that benchmark datasets are typically large and are therefore paged. Similarly, execution can take a long time, so you should prefer the AsyncRunloop client if you’re working with Python.
Viewing Public Benchmarks
We’re constantly adding new supported datasets. To view the up-to-date list of supported public Benchmarks, use the following API call:
# Query to see the latest list of supported public benchmarks
benchmarks = await runloop.api.benchmarks.list_public()
Each Benchmark contains a set of Scenarios that correspond to a test-case in the evaluation dataset.
# The Benchmark definition contains a list of all scenarios
# contained in the benchmark
print(benchmarks[0].scenario_ids)
Running Scenarios & Benchmarks
Each Scenario can be run to evaluate an AI agent’s performance. Running a scenario involves:
- Initiating a scenario run.
- Launching a development environment (devbox).
- Running the agent against the problem statement.
- Scoring the results.
- Uploading traces for analysis.
Run a single scenario from a public benchmark
Here’s an example of how to run a single scenario from a public benchmark against your own agent.
First, create a scenario run to track the status and results of this run:
# Note: we are using the async client here.
scenario_id = benchmarks[0].scenario_ids[0]
scenario_run = await runloop.api.scenarios.start_run(
scenario_id=scenario_id,
run_name="marshmallow-code__marshmallow-1359 test run"
)
When starting a run, Runloop will create a Devbox with the environment
specified by the test requirements.
Wait for the devbox used by the scenario to become ready:
# Note the async client is used here.
devbox = runloop.devbox.from_id(scenario_run.devbox_id)
await devbox.await_running()
Now, run your agent. How and where your agent runs is up to you. Here’s an example of an agent that uses the problem statement as the prompt:
problem_statement = scenario_run.scenario.input_context.problem_statement
my_agent = MyAgent(prompt=problem_statement)
Finally, run the scoring function to validate the agent’s performance:
# Run the scoring function. Automatically marks the scenario run as done.
results = await runloop.api.scenarios.runs.score_and_await(scenario_run.id)
print(results)
Once your agent is excelling at an individual scenario, you will want to test
against all Scenarios for a given Benchmark.
Here’s an example of how to perform a full benchmark run of a public benchmark.
# Start a full run of the first public benchmark returned
benchmark_run = await runloop.api.benchmarks.start_run(
benchmark_id=benchmarks[0].id,
run_name="optional run name"
)
# Example: iterate scenarios (serialize or parallelize as desired)
for scenario_id in benchmark_run.pending_scenarios:
scenario_run = await runloop.api.scenarios.start_run(
scenario_id=scenario_id,
benchmark_run_id=benchmark_run.id
)
devbox = runloop.devbox.from_id(scenario_run.devbox_id)
await devbox.await_running()
# Run your agent here using scenario_run.scenario.input_context.problem_statement
my_agent = MyAgent(
prompt=scenario_run.scenario.input_context.problem_statement
)
await runloop.api.scenarios.runs.score(scenario_run.id)
Interactive benchmarks make it easy to start evaluating your agent against industry standard coding evals with full control over the execution process.
Next Steps