Orchestrated benchmarks are the recommended way to run benchmarks on Runloop.
For fine-grained control over individual scenario runs, see Interactive
Benchmarks.
Prerequisites
Before running orchestrated benchmarks, you need:- Runloop CLI installed: Install via npm, yarn, or pnpm:
- API key configured: Set your Runloop API key:
- Agent configuration: Orchestrated benchmarks work with any agent that can run on a Runloop devbox. You have two options:
- Bring your own agent: Deploy your own agent to run on Runloop devboxes. This is the most common approach for teams developing proprietary agents. Contact us at support@runloop.ai for help setting up your custom agent.
-
Use a supported public agent: Run benchmarks with popular, public
open-source agents. Set up the required API keys as environment variables on
your local machine, and the CLI will automatically create secrets:
Agent Required Environment Variables claude-codeANTHROPIC_API_KEYorCLAUDE_CODE_OAUTH_TOKENcodexOPENAI_API_KEYopencodeANTHROPIC_API_KEY,OPENAI_API_KEY, orGOOGLE_API_KEYgooseANTHROPIC_API_KEY,OPENAI_API_KEY, orGOOGLE_API_KEYgemini-cliGEMINI_API_KEYorGOOGLE_API_KEY
Quick Start
Run a benchmark with a single command:- Creates a benchmark job with the specified agent and benchmark
- Provisions a devbox for each scenario in the benchmark
- Runs the agent on each scenario in parallel (by default, 10 scenarios are executed concurrently)
- Scores the results automatically
- Collects and aggregates all results into the UI
Running Benchmark Jobs
Basic Usage
Run a single agent against a benchmark:Comparing Multiple Agents
Compare multiple agents side-by-side by specifying multiple--agent flags:
Running Specific Scenarios
Instead of a full benchmark, you can run specific scenarios by ID:Controlling Parallelism
By default, benchmark jobs run 10 scenarios concurrently. Increase parallelism for faster execution:Setting Timeouts
Configure agent timeout (in seconds) for long-running scenarios:Passing Environment Variables
Pass additional environment variables to the agent:Using Secrets
Reference Runloop secrets for sensitive values:Monitoring Jobs
Watch Live Progress
Monitor a running job with a full-screen progress display:List Jobs
View recent benchmark jobs:Viewing Results
Summary Report
Get a summary of results after a job completes:Extended Results
View individual scenario results with the-e flag:
Output Formats
Export results as JSON or YAML for programmatic processing:Downloading Logs
Download devbox logs for debugging:Supported Agents
Orchestrated benchmarks support the following agents:| Agent | Description |
|---|---|
claude-code | Anthropic’s Claude Code agent |
codex | OpenAI’s Codex agent |
opencode | Open-source coding agent |
goose | Block’s Goose agent |
gemini-cli | Google’s Gemini CLI agent |
agent:model:
Supported Benchmarks
Orchestrated benchmark jobs work with any benchmark available on Runloop, including:- SWE-bench Verified
- Laude Institute/Terminal-Bench-2.0
- ScaleAI/SWE-Bench Pro
- AIME
- ARC-AGI-2
- bigcodebench
- BigCodeBench-Hard (instruct)
- BigCodeBench-Hard (Complete)
- ReplicationBench
- GPQA Diamond
- Aider/Polyglot
- Replication Bench
Command Reference
rli benchmark-job run
Create and run a benchmark job.
| Option | Description |
|---|---|
--agent <agent:model> | Agent to run. Format: agent:model. Can specify multiple. |
--benchmark <id-or-name> | Benchmark ID or name to run |
--scenarios <ids...> | Scenario IDs to run (alternative to --benchmark) |
-n, --job-name <name> | Name for this job |
--env-vars <vars...> | Environment variables (format: KEY=value) |
--secrets <secrets...> | Secrets to inject (format: ENV_VAR=SECRET_NAME) |
--timeout <seconds> | Agent timeout in seconds (default: 7200) |
--n-attempts <n> | Number of attempts per scenario (default: 1) |
--n-concurrent-trials <n> | Number of concurrent trials (default: 10) |
--timeout-multiplier <n> | Timeout multiplier (default: 1.0) |
-o, --output <format> | Output format: text, json, yaml |
rli benchmark-job watch
Watch benchmark job progress in real-time.
rli benchmark-job summary
Get benchmark job results.
| Option | Description |
|---|---|
-e, --extended | Show individual scenario results |
-o, --output <format> | Output format: text, json, yaml |
rli benchmark-job list
List benchmark jobs.
| Option | Description |
|---|---|
--days <n> | Show jobs from the last N days (default: 1) |
--all | Show all jobs (no time filter) |
--status <statuses> | Filter by status (comma-separated) |
-o, --output <format> | Output format: text, json, yaml |
initializing, queued, running, completed, failed, cancelled, timeout
rli benchmark-job logs
Download devbox logs for a benchmark job.
| Option | Description |
|---|---|
-o, --output-dir <path> | Output directory for logs |
--run <id> | Download logs for a specific benchmark run only |
--scenario <id> | Download logs for a specific scenario run only |
Best Practices
- Start with a small subset: Test your configuration with a few scenarios before running a full benchmark.
- Use meaningful job names: Name your jobs descriptively to make them easy to find and reuse later.
- Monitor long-running jobs: Use
rli benchmark-job watchto track progress, or check back withrli benchmark-job list. - Export results: Use
-o jsonto export results for analysis or CI/CD integration. - Tune parallelism: Increase
--n-concurrent-trialsfor faster execution, but be mindful of rate limits on external APIs.
Next Steps
- Create custom benchmarks to evaluate your agent on your own scenarios
- Build custom scorers to evaluate agent performance
- View results in the dashboard for detailed analysis
