Overview of Benchmarking on Runloop
Make your agent better and more reliable with Runloop’s tools for benchmarking.
Your AI coding agent is capable of numerous tasks such as reading code, writing and preparing patches, and submitting commits to code repositories.
A common problem with such agents is ensuring that they perform: Without monitoring, tuning and optimization, your agent may be prone to making mistakes, experience regressions over time, and generally not deliver the best user experience.
Runloop Benchmarking is a a suite of tools to help you address these issues and stay focused on building the best possible agent.
Main Features
Runloop Benchmarking includes several tools to save you time while optimizing your agent:
- Run Public Benchmarks: Easily run your agent against a matrix of well-known and open source benchmarks, such as SWE-bench.
- Run Custom Benchmarks: Write custom scoring functions for each of your agent’s tasks, then evaluate the agent’s performance against them.
- Reports & Insights: As you run benchmarks over time, you will see how your agent’s performance changes in the Runloop dashboard.
Key Concepts
Whether you’re using public or custom benchmarks, you’ll keep the following key concepts in mind:
- Code Scenario: A single test case where an agent is given a problem and is expected to modify a target environment to solve it. Scenarios help test AI agents in realistic coding environments.
- Scoring Function: A script or function that runs after the agent completes its task to validate whether the solution works. These functions generate a final score between 0 and 1 to indicate performance.
- Benchmark: A collection of Code Scenarios designed to evaluate AI agents on a broader set of tasks. Benchmarks help measure agent capabilities systematically.
Next, learn how to run public benchmarks.
Was this page helpful?