Build Custom Agent Benchmarks with Runloop
Learn how to create and run custom benchmarks.
Creating Custom Scenarios
Creating custom scenarios allows users to tailor problem statements and environments specific to their needs. This is useful for testing agents in controlled conditions or building unique challenges.
To define your own scenario:
- Create a development environment (devbox).
- Take a snapshot of the environment at a key point in time.
- Define a problem statement for the scenario.
- Attach scoring functions to measure performance.
Example:
Understanding Scoring Functions
Scoring functions validate whether a scenario was successfully completed. These functions help ensure solutions are correct, provide feedback, and assign a score for evaluation.
Basic Scoring Function Example
A simple scoring function is a bash script that echoes a score between 0 and 1:
Custom Scoring Functions
To make scoring more reusable and flexible, you can define custom scoring functions. These are used to evaluate performance in specific ways, such as running tests or analyzing output logs.
Example:
Custom benchmarks
Once you have your scenarios and scoring functions defined, you can run all of your custom scenarios as a custom benchmark.
You’ll need to create the benchmark instance first, then run it. Here’s how:
You can update both code scenarios and benchmarks at any time so that you can build it up over time.
Was this page helpful?