Overview
Benchmarks are collections of one or more scenarios. A scenario is a single, self-contained test case where an agent is given a problem and is expected to modify a target environment to solve it. Once created, scenarios can be run as many times as you want with different agents, parameters and configurations.Creating Custom Scenarios
Creating custom scenarios allows users to tailor problem statements and environments specific to their needs. This is useful for test or training agents in controlled conditions or building unique challenges. To define your own scenario:- Build a blueprint from a
Dockerfile
to set up a baseline environment for the test case represented by the scenario. - Define a scoring function to evaluate the outcome of the scenario. The scoring function must return a score between 0 (fail) and 1 (pass).
- Create a problem statement that describes the task the agent must complete.
- Configure a
reference_output
; this is a known good output that the agent must achieve, sometimes referred to as the “gold patch” or “canonical solution”. - Create a scenario using the blueprint, problem statement, environment parameters and scoring function.