Documentation Index
Fetch the complete documentation index at: https://docs.runloop.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Custom Benchmarks are collections of Scenarios that can be run together to produce an overall performance score. Each Scenario is a single, self-contained test case or task where an agent is given a problem and is expected to modify a target environment to solve the problem. The Scenarios you use in a Custom Benchmark can be ones you create yourself, or sourced from Public Benchmarks. Once created, a Scenario can be run as many times as you want with different agents, parameters and configurations. If you want to use benchmark runs and scores as part of a reinforcement learning workflow, see Training Using Benchmarks.Creating Custom Scenarios
Creating custom scenarios allows users to tailor problem statements and environments specific to their needs. This is useful for testing or training agents under controlled conditions or building unique challenges. To define your own scenario:- Create a Devbox image for running your scenario by either building
a Blueprint (eg, from a
Dockerfile) or snapshotting an existing Devbox - Define a scoring function to evaluate the outcome of the scenario. The scoring function must return a score between 0 (fail) and 1 (pass).
- Create a problem statement that describes the task the agent must complete.
- Configure a
reference_output; this is a known good output that the agent must achieve, sometimes referred to as the “gold patch” or “canonical solution”. - Create a scenario using the blueprint, problem statement, environment parameters and scoring function.
Understanding Scoring Functions
Scoring functions are standalone scripts that validate whether a scenario was successfully completed. These functions grade solutions for correctness and assign a score for evaluation. The score is captured by runloop and used to evaluate the overall performance of a benchmark.Basic Scoring Function Example
A simple scoring function is a bash script that echoes a score between0 (failure) and 1 (success):
Custom Scoring Functions
To make scoring more reusable and flexible, you can define custom scoring functions. These are used to evaluate performance in specific ways, such as running tests or analyzing output logs. Example:Custom benchmarks
Once you have your scenarios and scoring functions defined, you can run all of your custom scenarios as a custom benchmark. You’ll need to create the benchmark instance first, then run it. Here’s how:Running Custom Benchmarks
Once your benchmark is created, you can run it using either orchestrated or interactive mode:- Orchestrated (Recommended)
- Interactive
Next Steps
- Training Using Benchmarks: See how scenarios, benchmarks, and scorers fit into a high-level RL workflow
- Custom Scorers: Build domain-specific scoring functions
- Creating Scenarios: Deep dive into scenario configuration
- Orchestrated Benchmarks: Run benchmarks at cloud scale
