Skip to main content

What is a Scenario?

A Scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve it. Scenarios are the building blocks of both Public Benchmarks and Custom Benchmarks. Each Scenario includes:
  • Problem Statement: The task description your agent will work on.
  • Environment: A devbox environment (from a blueprint or snapshot) that contains all required code and tools.
  • Scorer: One or more scoring functions that determine whether the agent succeeded and emit a score between 0.0 and 1.0.
  • Reference Output: A canonical solution or patch used to validate the scorer. After applying the reference output to the environment, the scorer should emit a score of 1.0.

Creating Scenarios

You can create Scenarios directly in the Runloop dashboard using a guided workflow. This is the fastest way to get started before automating Scenario creation with the API. Creating a Scenario from scratch involves the following high-levelworkflow steps:

1. Environment setup

Configuring the environment for a scenario involves selecting a baseline environment and then providing any additional parameters that control how the environment is brought up. You can use a blueprint or a snapshot of a devbox to set up the baseline environment for the scenario in the exact system state you want instead of using a blueprint.

2. Configure scoring and reference output

Next, define how success is measured and establish a reference output to validate that your scorer is working as expected. You can either use simple bash-based scorers or more advanced custom scorers using python or typescript. Developing a custom scorer is a powerful way to test a specific behavior or edge case and is often an iterative process.
  • Scoring functions: Add one or more scorers that return a score between 0.0 and 1.0.
  • Weights: Combine multiple scorers by assigning weights to each component that add up to 1.0.
  • Reference Output (Optional): Provide a known-good output (such as a patch or command) that your scorer can compare against. The reference solution is kept outside of the devbox to avoid leaking solutions.
For more detail on designing robust scoring logic, see Custom Scorers.

3. Add to Benchmarks

Once saved, your Scenario can be reused within multiple benchmarks or as a standalone run. You can also add metadata to organize Scenarios by purpose, programming language, difficulty, or use case.

Scenario Execution Lifecycle

When you run a Scenario, your machine works with Runloop to manage the lifecycle of the devbox and the scoring process.
At a high level, a Scenario run goes through the following phases:
  1. Run created: A Scenario run record is created to track execution.
  2. Environment Provisioning: Runloop launches a devbox using the Scenario’s environment configuration and runs any launch scripts or commands.
  3. Agent Mounting (Optional): Your agent is deployed onto the devbox.
  4. Run Execution: Execute arbitrary commands on the devbox. Most commonly, you will want to instruct the agent to work on the problem statement. The agent is expected to modify the environment to solve the problem.
  5. Scoring: The configured scoring functions run and produce a score between 0.0 and 1.0.
  6. Completion & reporting: The run is marked complete and results, logs, and traces are available in the dashboard and via API.
  7. Shutdown: The devbox is shut down and any resources are freed.
For a deeper dive into running Scenarios and Benchmarks programmatically, see Public Benchmarks.

Creating and Running Scenarios via the API

Once you are comfortable with the dashboard workflow, you can automate Scenario creation and execution using the Runloop API. Here’s an end-to-end example that:
  1. Creates an environment snapshot.
  2. Creates a Scenario.
  3. Starts and scores a Scenario run.
import asyncio
from runloop_api_client import AsyncRunloop

# Note: we use the AsyncRunloop client so we can easily await long-running operations.
client = AsyncRunloop()  # API Key is automatically loaded from "RUNLOOP_API_KEY"


async def main():
    # 1. Create a devbox and set up a minimal failing test inside it
    devbox = await client.devboxes.create()

    # Create tests/test_example.py in the devbox. This test will immediately raise,
    # which gives the agent something concrete to fix.
    await client.devboxes.execute_and_await_completion(
        devbox.id,
        command=(
            "mkdir -p tests && "
            "echo 'def test_example():\\n"
            "    raise Exception(\"intentional failure from test_example\")' "
            "> tests/test_example.py"
        ),
    )

    # Snapshot the devbox after the test file has been created so the scenario
    # environment always contains the failing test.
    snapshot = await client.devboxes.snapshot_disk(
        devbox.id,
        name="my-scenario-baseline",
    )

    # 2. Create the scenario
    scenario = await client.scenarios.create(
        name="My First Scenario",
        input_context={
            "problem_statement": "Fix the failing unit test in tests/test_example.py",
        },
        environment_parameters={
            "snapshot_id": snapshot.id,
        },
        scoring_contract={
            "scoring_function_parameters": [{
                "name": "bash_scorer",
                "scorer": {
                    "type": "bash_script_scorer",
                    "bash_script": "pytest -q && echo 1.0 || echo 0.0",
                },
                "weight": 1.0,
            }],
        },
        reference_output="pytest -q",
    )

    # 3. Start a scenario run and wait for the environment to be ready
    scenario_run = await client.scenarios.start_run(
        scenario_id=scenario.id,
        run_name="my-first-scenario-run",
    )

    await client.devboxes.await_running(scenario_run.devbox_id)

    # Run your agent here, using the problem statement as context
    problem_statement = scenario_run.scenario.input_context.problem_statement
    # my_agent = MyAgent(prompt=problem_statement)
    # my_agent.solve(devbox=scenario_run.devbox_id)

    # 4. Score the run
    result = await client.scenarios.runs.score(scenario_run.id)
    print(result.score)


asyncio.run(main())

Where to Go Next