Runloop Public Benchmarks make it simple to validate your coding agent against the most popular, open source
coding evaluation datasets.Each Benchmark contains a set of Scenarios based on each test in the dataset. The Scenario contains the problem statement that your agent
must work through, a pre-built environment containing all of context needed to complete the job, and a built-in scoring contract
to properly evaluate the result for correctness.
We’re constantly adding new supported datasets. To list the up-to-date list of supported public Benchmarks, use the following API call:
Copy
Ask AI
// Query to see the latest list of supported public benchmarks// princeton-nlp/SWE-bench_Lite, etcconst { benchmarks } = await rl.benchmarks.list_public();
Are we missing your favorite open source benchmark? Let us know at support@runloop.ai
Each Benchmark contains a set of Scenarios that correspond to a test-case in the evaluation dataset.
Copy
Ask AI
// The Benchmark definition contains a list of all scenarios contained in the benchmarkconsole.log(benchmarks[0].scenarioIds)
Here’s an example of how to run a single scenario from a public benchmark against your own agent.First, create a scenario run to track the status and results of this run:
When starting a run, Runloop will create a Devbox with the environment
specified by the test requirements.Wait for the devbox used by the scenario to become ready:
Now, run your agent. How and where your agent runs is up to you. Here’s an example of an agent that leverages the Runloop Devbox that was just created:
Copy
Ask AI
const myAgent = new MyAgent({ prompt: scenarioRun.scenario.context.problemStatement, tools: [runloop.devboxes.shellTools(devboxId)],});
Finally, run the scoring function to validate the agent’s performance:
Copy
Ask AI
// Run the scoring function. Automatically marks the secenario run as done.const validateResults = await runloop.scenarioRuns.scoreAndAwait( scenarioRun.id);console.log(validateResults);
Perform a full benchmark run of a public benchmark
Once your agent is excelling at an individual scenario, you will want to test
against all Scenarios for a given Benchmark.Here’s an example of how to perform a full benchmark run of a public benchmark.
Copy
Ask AI
// Start a full run of the first public benchmark returnedlet benchmarkRun = await runloop.benchmarks.startRun({ benchmark_id: benchmarks[0].id, run_name: 'optional run name'});// This shows a serialized scenario by scenario runner but can also run in any// level of parallelismbenchmarkRun.pending_scenarios.forEach(async scenarioId => { // create a scenario run tied to the benchmark run const scenarioRun = await runloop.scenarios.startRunAndAwaitEnvReady({ scenario_id: scenarioId, benchmark_run_id: benchmarkRun.id }); const devboxId = scenarioRun.devbox_id; // Run your agent on the problem at hand to see how it does const myAgent = new MyAgent({ prompt: scenarioRun.scenario.context.problemStatement, tools: [runloop.devboxes.shellTools(devboxId)], }); // Score and complete the run. This will also properly shutdown the Devbox environment. const validateResults = await runloop.scenarios.runs.scoreAndComplete( scenarioRun.id );});// Benchmark runs will end automatically when no more pending scenarios but also// can optionally just end a benchmark run earlyawait runloop.benchmarks.runs.complete(benchmarkRun.id)
Public Benchmarks make it fast and easy to start evaluating your agent against industry standard coding evaluations.
When you’re ready to expand or customize Benchmarks that meet your specific needs, move on to creating Custom Benchmarks.