Documentation Index
Fetch the complete documentation index at: https://docs.runloop.ai/llms.txt
Use this file to discover all available pages before exploring further.
Orchestrated benchmarks let you run full benchmark suites or sets of scenarios with a single command. Runloop handles all the complexity: provisioning devboxes for each scenario, running your agents, scoring results, and aggregating outputs. You can compare multiple agents side-by-side, run hundreds of scenarios in parallel, and walk away while the job completes in the cloud.
Orchestrated benchmarks are the recommended way to run benchmarks on Runloop.
For fine-grained control over individual scenario runs, see Interactive
Benchmarks.
Prerequisites
Before running orchestrated benchmarks, you need:
- Runloop CLI installed: Install via npm, yarn, or pnpm:
npm install -g @runloop/rl-cli
- API key configured: Set your Runloop API key:
export RUNLOOP_API_KEY=your_api_key_here
- Agent configuration: Orchestrated benchmarks work with any agent that can run on a Runloop devbox. You have two options:
-
Bring your own agent: Deploy your own agent to run on Runloop devboxes.
This is the most common approach for teams developing proprietary agents.
Contact us at support@runloop.ai for help setting up your custom agent.
-
Use a supported public agent: Run benchmarks with popular, public
open-source agents. Set up the required API keys as environment variables on
your local machine, and the CLI will automatically create secrets:
| Agent | Required Environment Variables |
|---|
claude-code | ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN |
codex | OPENAI_API_KEY |
opencode | ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY |
goose | ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY |
gemini-cli | GEMINI_API_KEY or GOOGLE_API_KEY |
Are we missing an agent you need? Contact us at support@runloop.ai to request support for a new public agent.
Quick Start
Run a benchmark with a single command:
rli benchmark-job run \
--agent "claude-code:claude-sonnet-4-6" \
--benchmark "terminal-bench-2" \
-n "my-first-benchmark-run"
This command:
- Creates a benchmark job with the specified agent and benchmark
- Provisions a devbox for each scenario in the benchmark
- Runs the agent on each scenario in parallel (by default, 10 scenarios are executed concurrently)
- Scores the results automatically
- Collects and aggregates all results into the UI
Running Benchmark Jobs
Basic Usage
Run a single agent against a benchmark:
rli benchmark-job run \
--agent "claude-code:claude-sonnet-4-6" \
--benchmark "terminal-bench-2"
Comparing Multiple Agents
Compare multiple agents side-by-side by specifying multiple --agent flags:
rli benchmark-job run \
--agent "claude-code:claude-sonnet-4-6" \
--agent "codex:gpt-4o" \
--benchmark "terminal-bench-2" \
-n "terminal-bench-agent-comparison"
Each agent runs independently against the full benchmark, and results are aggregated for easy comparison.
Running Specific Scenarios
Instead of a full benchmark, you can run specific scenarios by ID:
rli benchmark-job run \
--agent "claude-code:claude-sonnet-4-6" \
--scenarios scn_abc123 scn_def456 \
-n "specific-scenarios-run"
Controlling Parallelism
By default, benchmark jobs run 10 scenarios concurrently. Increase parallelism for faster execution:
rli benchmark-job run \
--agent "claude-code:claude-sonnet-4-6" \
--benchmark "terminal-bench-2" \
--n-concurrent-trials 50 \
-n "high-parallelism-run"
Setting Timeouts
Configure agent timeout (in seconds) for long-running scenarios:
rli benchmark-job run \
--agent "claude-code:claude-sonnet-4-6" \
--benchmark "terminal-bench-2" \
--timeout 3600 \
-n "long-timeout-run"
Passing Environment Variables
Pass additional environment variables to the agent:
rli benchmark-job run \
--agent "claude-code:claude-sonnet-4-6" \
--benchmark "terminal-bench-2" \
--env-vars "DEBUG=true" "LOG_LEVEL=verbose" \
-n "debug-run"
Using Secrets
Reference Runloop secrets for sensitive values:
rli benchmark-job run \
--agent "claude-code:claude-sonnet-4-6" \
--benchmark "terminal-bench-2" \
--secrets "GITHUB_TOKEN=my-github-secret" \
-n "with-secrets-run"
Monitoring Jobs
Watch Live Progress
Monitor a running job with a full-screen progress display:
rli benchmark-job watch <job_id>
This shows real-time updates as scenarios complete, including pass/fail status and running totals.
List Jobs
View recent benchmark jobs:
Filter by time range or status:
# Jobs from the last 7 days
rli benchmark-job list --days 7
# All jobs (no time filter)
rli benchmark-job list --all
# Only running jobs
rli benchmark-job list --status running
# Multiple statuses
rli benchmark-job list --status running,completed
Viewing Results
Summary Report
Get a summary of results after a job completes:
rli benchmark-job summary <job_id>
Extended Results
View individual scenario results with the -e flag:
rli benchmark-job summary -e <job_id>
Export results as JSON or YAML for programmatic processing:
rli benchmark-job summary <job_id> -o json
rli benchmark-job summary <job_id> -o yaml
Downloading Logs
Download devbox logs for debugging:
# Download all logs for a job
rli benchmark-job logs <job_id>
# Download to a specific directory
rli benchmark-job logs <job_id> -o ./my-logs
# Download logs for a specific benchmark run
rli benchmark-job logs <job_id> --run <benchmark_run_id>
# Download logs for a specific scenario
rli benchmark-job logs <job_id> --scenario <scenario_run_id>
Supported Agents
Orchestrated benchmarks support the following agents:
| Agent | Description |
|---|
claude-code | Anthropic’s Claude Code agent |
codex | OpenAI’s Codex agent |
opencode | Open-source coding agent |
goose | Block’s Goose agent |
gemini-cli | Google’s Gemini CLI agent |
Specify the agent and model in the format agent:model:
--agent "claude-code:claude-sonnet-4-6"
--agent "codex:gpt-4o"
--agent "gemini-cli:gemini-2.5-pro"
Supported Benchmarks
Orchestrated benchmark jobs work with any benchmark available on Runloop, including:
- SWE-bench Verified
- Laude Institute/Terminal-Bench-2.0
- ScaleAI/SWE-Bench Pro
- AIME
- ARC-AGI-2
- bigcodebench
- BigCodeBench-Hard (instruct)
- BigCodeBench-Hard (Complete)
- ReplicationBench
- GPQA Diamond
- Aider/Polyglot
- Replication Bench
View available benchmarks:
benchmarks = await runloop.api.benchmarks.list_public()
You can also run your own custom benchmarks via orchestrated mode.
Command Reference
rli benchmark-job run
Create and run a benchmark job.
| Option | Description |
|---|
--agent <agent:model> | Agent to run. Format: agent:model. Can specify multiple. |
--benchmark <id-or-name> | Benchmark ID or name to run |
--scenarios <ids...> | Scenario IDs to run (alternative to --benchmark) |
-n, --job-name <name> | Name for this job |
--env-vars <vars...> | Environment variables (format: KEY=value) |
--secrets <secrets...> | Secrets to inject (format: ENV_VAR=SECRET_NAME) |
--timeout <seconds> | Agent timeout in seconds (default: 7200) |
--n-attempts <n> | Number of attempts per scenario (default: 1) |
--n-concurrent-trials <n> | Number of concurrent trials (default: 10) |
--timeout-multiplier <n> | Timeout multiplier (default: 1.0) |
-o, --output <format> | Output format: text, json, yaml |
rli benchmark-job watch
Watch benchmark job progress in real-time.
rli benchmark-job watch <job_id>
rli benchmark-job summary
Get benchmark job results.
| Option | Description |
|---|
-e, --extended | Show individual scenario results |
-o, --output <format> | Output format: text, json, yaml |
rli benchmark-job list
List benchmark jobs.
| Option | Description |
|---|
--days <n> | Show jobs from the last N days (default: 1) |
--all | Show all jobs (no time filter) |
--status <statuses> | Filter by status (comma-separated) |
-o, --output <format> | Output format: text, json, yaml |
Valid statuses: initializing, queued, running, completed, failed, cancelled, timeout
rli benchmark-job logs
Download devbox logs for a benchmark job.
| Option | Description |
|---|
-o, --output-dir <path> | Output directory for logs |
--run <id> | Download logs for a specific benchmark run only |
--scenario <id> | Download logs for a specific scenario run only |
Best Practices
- Start with a small subset: Test your configuration with a few scenarios before running a full benchmark.
- Use meaningful job names: Name your jobs descriptively to make them easy to find and reuse later.
- Monitor long-running jobs: Use
rli benchmark-job watch to track progress, or check back with rli benchmark-job list.
- Export results: Use
-o json to export results for analysis or CI/CD integration.
- Tune parallelism: Increase
--n-concurrent-trials for faster execution, but be mindful of rate limits on external APIs.
Next Steps