Overview

Benchmarks are collections of one or more scenarios. A scenario is a single, self-contained test case where an agent is given a problem and is expected to modify a target environment to solve it. Once created, scenarios can be run as many times as you want with different agents, parameters and configurations.

Creating Custom Scenarios

Creating custom scenarios allows users to tailor problem statements and environments specific to their needs. This is useful for test or training agents in controlled conditions or building unique challenges. To define your own scenario:
  1. Build a blueprint from a Dockerfile to set up a baseline environment for the test case represented by the scenario.
  2. Define a scoring function to evaluate the outcome of the scenario. The scoring function must return a score between 0 (fail) and 1 (pass).
  3. Create a problem statement that describes the task the agent must complete.
  4. Configure a reference_output; this is a known good output that the agent must achieve, sometimes referred to as the “gold patch” or “canonical solution”.
  5. Create a scenario using the blueprint, problem statement, environment parameters and scoring function.
You can also use a snapshot of a devbox to set up the baseline environment for the scenario in the exact system state you want instead of using a blueprint. Example:
import os
import asyncio
from runloop_api_client import AsyncRunloop

client = AsyncRunloop(bearer_token=os.environ.get("RUNLOOP_API_KEY"))

async def main():
    devbox = await client.devboxes.create()
    mySnapshot = await client.devboxes.snapshot_disk(
        devbox.id,
        name="div incorrectly centered in flexbox",
    )

    my_new_scenario = await client.scenarios.create(
        name="My New Scenario",
        input_context={"problem_statement": "Create a UI component"},
        environment_parameters={"snapshot_id": mySnapshot.id},
        scoring_contract={
            "scoring_function_parameters": [{
                "name": "bash_scorer",
                "scorer": {
                    "type": "bash_script_scorer",
                    "bash_script": "echo 0.0",
                },
                "weight": 1.0,
            }]
        },
        reference_output="echo 1.0",
    )
    return my_new_scenario

my_new_scenario = asyncio.run(main())

Understanding Scoring Functions

Scoring functions are standalone scripts that validate whether a scenario was successfully completed. These functions grade solutions for correctness and assign a score for evaluation. The score is captured by runloop and used to evaluate the overall performance of a benchmark.

Basic Scoring Function Example

A simple scoring function is a bash script that echoes a score between 0.0 and 1.0:
scoring_function_parameters = [{
    "name": "my-custom-pytest-script",
    "scorer": {
        "name": "bash_scorer",
        "type": "bash_script_scorer",
        "bash_script": "echo 0.0",
    },
    "weight": 1.0,
}]

Custom Scoring Functions

To make scoring more reusable and flexible, you can define custom scoring functions. These are used to evaluate performance in specific ways, such as running tests or analyzing output logs. Example:
import asyncio

async def create_custom_scenario():
    my_custom_scenario = await client.scenarios.create(
        name="scenario with custom scorer",
        input_context={"problem_statement": "Create a UI component"},
        environment_parameters={"snapshot_id": my_new_scenario.environment_parameters["snapshot_id"]},
        scoring_contract={
            "scoring_function_parameters": [{
                "name": "my-custom-pytest-script",
                "scorer": {
                    "type": "custom_scorer",
                    "custom_scorer_type": "my-custom-pytest-script",
                    "scorer_params": {"relevant_tests": ["foo.test.py", "bar.test.py"]},
                },
                "weight": 1.0,
            }]
        },
    )
    return my_custom_scenario

my_custom_scenario = asyncio.run(create_custom_scenario())
Note that many scenarios will use the same scoring function with different parameters, depending on the test case.

Custom benchmarks

Once you have your scenarios and scoring functions defined, you can run all of your custom scenarios as a custom benchmark. You’ll need to create the benchmark instance first, then run it. Here’s how:
import asyncio

async def create_benchmark():
    my_benchmark = await client.benchmarks.create(
        name="py bench",
        scenario_ids=[my_new_scenario.id, my_custom_scenario.id]
    )
    return my_benchmark

my_benchmark = asyncio.run(create_benchmark())
You can update both code scenarios and benchmarks at any time so that you can build it up over time. You can also add or remove scenarios from a benchmark as needed.