Public Benchmarks

Runloop Public Benchmarks make it simple to validate your coding agent against the most popular, open source coding evaluation datasets.

Each Benchmark contains a set of Scenarios based on each test in the dataset. The Scenario contains the problem statement that your agent must work through, a pre-built environment containing all of context needed to complete the job, and a built-in scoring contract to properly evaluate the result for correctness.

Viewing Public Benchmarks

We’re constantly adding new supported datasets. To list the up-to-date list of supported public Benchmarks, use the following API call:

// Query to see the latest list of supported public benchmarks
// princeton-nlp/SWE-bench_Lite, etc
const { benchmarks } = await rl.benchmarks.list_public();
Are we missing your favorite open source benchmark? Let us know at support@runloop.ai

Each Benchmark contains a set of Scenarios that correspond to a test-case in the evaluation dataset.

// The Benchmark definition contains a list of all scenarios contained in the benchmark
console.log(benchmarks[0].scenarioIds)

Running Scenarios & Benchmarks

Each Scenario can be run to evaluate an AI agent’s performance. Running a scenario involves:

  1. Initiating a scenario run.
  2. Launching a development environment (devbox).
  3. Running the agent against the problem statement.
  4. Scoring the results.
  5. Uploading traces for analysis.

Run a single scenario from a public benchmark

Here’s an example of how to run a single scenario from a public benchmark against your own agent.

First, create a scenario run to track the status and results of this run:

const scenarioId = benchmarks[0].scenarioIds[0]
const scenarioRun = await runloop.scenarios.startRun({
    scenario_id: scenarioId,
    run_name: 'marshmallow-code__marshmallow-1359 test run'
});

When starting a run, Runloop will create a Devbox with the environment specified by the test requirements.

Wait for the devbox used by the scenario to become ready:

const devboxId = scenarioRun.devbox_id;
await runloop.devboxes.awaitRunning(devboxId);

Now, run your agent. How and where your agent runs is up to you. Here’s an example of an agent that leverages the Runloop Devbox that was just created:

const myAgent = new MyAgent({
    prompt: scenarioRun.scenario.context.problemStatement,
    tools: [runloop.devboxes.shellTools(devboxId)],
});

Finally, run the scoring function to validate the agent’s performance:

// Run the scoring function. Automatically marks the secenario run as done.
const validateResults = await runloop.scenarioRuns.scoreAndAwait(
    scenarioRun.id
);
console.log(validateResults);

Perform a full benchmark run of a public benchmark

Once your agent is excelling at an individual scenario, you will want to test against all Scenarios for a given Benchmark.

Here’s an example of how to perform a full benchmark run of a public benchmark.

// Start a full run of the first public benchmark returned
let benchmarkRun = await runloop.benchmarks.startRun({
    benchmark_id: benchmarks[0].id,
    run_name: 'optional run name'
});

// This shows a serialized scenario by scenario runner but can also run in any
// level of parallelism
benchmarkRun.pending_scenarios.forEach(async scenarioId => {
    // create a scenario run tied to the benchmark run
    const scenarioRun = await runloop.scenarios.startRunAndAwaitEnvReady({
        scenario_id: scenarioId,
        benchmark_run_id: benchmarkRun.id
    });

    const devboxId = scenarioRun.devbox_id;

    // Run your agent on the problem at hand to see how it does
    const myAgent = new MyAgent({
        prompt: scenarioRun.scenario.context.problemStatement,
        tools: [runloop.devboxes.shellTools(devboxId)],
    });

    // Score and complete the run. This will also properly shutdown the Devbox environment.
    const validateResults = await runloop.scenarios.runs.scoreAndComplete(
        scenarioRun.id
    );
});

// Benchmark runs will end automatically when no more pending scenarios but also
// can optionally just end a benchmark run early
await runloop.benchmarks.runs.complete(benchmarkRun.id)

Public Benchmarks make it fast and easy to start evaluating your agent against industry standard coding evaluations. When you’re ready to expand or customize Benchmarks that meet your specific needs, move on to creating Custom Benchmarks.

Was this page helpful?