Creating Custom Scenarios

Creating custom scenarios allows users to tailor problem statements and environments specific to their needs. This is useful for testing agents in controlled conditions or building unique challenges.

To define your own scenario:

  1. Create a development environment (devbox).
  2. Take a snapshot of the environment at a key point in time.
  3. Define a problem statement for the scenario.
  4. Attach scoring functions to measure performance.

Example:

const devbox = await runloop.devboxes.create({ blueprint_name: "bpt_123" });
const mySnapshot = await runloop.devboxes.snapshotDisk(devbox.id, {
  name: 'div incorrectly centered in flexbox',
});

const myNewScenario = await runloop.scenarios.create({
  name: 'My New Scenario',
  input_context: { problem_statement: 'Create a UI component' },
  environment_parameters: { snapshot_id: '123' },
  scoring_contract: {
    scoring_function_parameters: [
      {
        name: 'bash_scorer',
        scorer: {
          type: 'bash_script_scorer',
          bash_script: 'some script that writes files and validates output',
        },
        weight: 1.0,
      },
    ],
  },
});

Understanding Scoring Functions

Scoring functions validate whether a scenario was successfully completed. These functions help ensure solutions are correct, provide feedback, and assign a score for evaluation.

Basic Scoring Function Example

A simple scoring function is a bash script that echoes a score between 0 and 1:

scoring_function_parameters: [
      {
        name: 'my-custom-pytest-script',
        scorer: {
          name: 'bash_scorer',
          type: 'bash_script_scorer',
          bash_script: 'echo 1.0',
        },
        weight: 1.0,
      },
    ],

Custom Scoring Functions

To make scoring more reusable and flexible, you can define custom scoring functions. These are used to evaluate performance in specific ways, such as running tests or analyzing output logs.

Example:

const myCustomScenario = await runloop.scenarios.create({
  name: 'scenario with custom scorer',
  input_context: { problem_statement: 'Create a UI component' },
  environment_parameters: { snapshot_id: mySnapshot.id },
  scoring_contract: {
    scoring_function_parameters: [
      {
        name: 'my-custom-pytest-script',
        scorer: {
          type: 'custom_scorer',
          custom_scorer_type: 'my-custom-pytest-script',
          scorer_params: { relevant_tests: ['foo.test.py', 'bar.test.py'] },
        },
        weight: 1.0,
      },
    ],
  },
});

Custom benchmarks

Once you have your scenarios and scoring functions defined, you can run all of your custom scenarios as a custom benchmark.

You’ll need to create the benchmark instance first, then run it. Here’s how:

const myBenchmark = await runloop.benchmarks.create({
	name: 'py bench',
    scenarios: [myNewScenario.id, myCustomScenario.id]
})

You can update both code scenarios and benchmarks at any time so that you can build it up over time.

Was this page helpful?