Build Custom Agent Benchmarks with Runloop

Start with the Runloop Quickstart to use the examples below.

Overview

Custom Benchmarks are collections of Scenarios that can be run together to produce an overall performance score. Each Scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve the problem. The Scenarios you use in a Custom Benchmark can be ones you create yourself, or sourced from Public Benchmarks. Once created, a Scenario can be run as many times as you want with different agents, parameters and configurations.

Creating Custom Scenarios

Creating custom scenarios allows users to tailor problem statements and environments specific to their needs. This is useful for testing or training agents under controlled conditions or building unique challenges. To define your own scenario:

Create a Devbox image for running your scenario by either building a Blueprint (eg, from a Dockerfile) or snapshotting an existing Devbox
Define a scoring function to evaluate the outcome of the scenario. The scoring function must return a score between 0 (fail) and 1 (pass).
Create a problem statement that describes the task the agent must complete.
Configure a reference_output; this is a known good output that the agent must achieve, sometimes referred to as the “gold patch” or “canonical solution”.
Create a scenario using the blueprint, problem statement, environment parameters and scoring function.

Example:

devbox = await runloop.devbox.create(blueprint_name="bpt_123")
my_snapshot = await devbox.snapshot_disk(
  name="div incorrectly centered in flexbox"
)

my_new_scenario = await runloop.api.scenarios.create(
  name="My New Scenario",
  input_context={"problem_statement": "Create a UI component"},
  environment_parameters={"snapshot_id": my_snapshot.id},
  scoring_contract={
    "scoring_function_parameters": [{
      "name": "bash_scorer",
      "scorer": {
        "type": "bash_script_scorer",
        "bash_script": "echo 0.0"
      },
      "weight": 1.0
    }]
  },
  reference_output="echo 1.0"
)

Understanding Scoring Functions

Scoring functions are standalone scripts that validate whether a scenario was successfully completed. These functions grade solutions for correctness and assign a score for evaluation. The score is captured by runloop and used to evaluate the overall performance of a benchmark.

Basic Scoring Function Example

A simple scoring function is a bash script that echoes a score between 0 (failure) and 1 (success):

scoring_function_parameters = [{
  "name": "my-custom-pytest-script",
  "scorer": {
    "name": "bash_scorer",
    "type": "bash_script_scorer",
    "bash_script": "echo 0.0"
  },
  "weight": 1.0
}]

Custom Scoring Functions

To make scoring more reusable and flexible, you can define custom scoring functions. These are used to evaluate performance in specific ways, such as running tests or analyzing output logs. Example:

my_custom_scenario = await runloop.api.scenarios.create(
  name="scenario with custom scorer",
  input_context={"problem_statement": "Create a UI component"},
  environment_parameters={"snapshot_id": my_new_scenario.environment_parameters["snapshot_id"]},
  scoring_contract={
    "scoring_function_parameters": [{
      "name": "my-custom-pytest-script",
      "scorer": {
        "type": "custom_scorer",
        "custom_scorer_type": "my-custom-pytest-script",
        "scorer_params": {"relevant_tests": ["foo.test.py", "bar.test.py"]}
      },
      "weight": 1.0
    }]
  }
)

Note that many scenarios will use the same scoring function with different parameters, depending on the test case.

Custom benchmarks

Once you have your scenarios and scoring functions defined, you can run all of your custom scenarios as a custom benchmark. You’ll need to create the benchmark instance first, then run it. Here’s how:

my_benchmark = await runloop.api.benchmarks.create(
  name="py bench",
  scenario_ids=[my_new_scenario.id, my_custom_scenario.id]
)

You can update both code scenarios and benchmarks at any time so that you can build it up over time. You can also add or remove scenarios from a benchmark as needed.

Overview

Tools

Components

Repo Connect

Benchmarks & Evals

Debugging

Developer Tools

Build Custom Agent Benchmarks with Runloop

Overview

Creating Custom Scenarios

Understanding Scoring Functions

Basic Scoring Function Example

Custom Scoring Functions

Custom benchmarks

Overview

Tools

Components

Repo Connect

Benchmarks & Evals

Debugging

Developer Tools

​Overview

​Creating Custom Scenarios

​Understanding Scoring Functions

​Basic Scoring Function Example

​Custom Scoring Functions

​Custom benchmarks

Overview

Creating Custom Scenarios

Understanding Scoring Functions

Basic Scoring Function Example

Custom Scoring Functions

Custom benchmarks