Orchestrated Benchmarks

Orchestrated benchmarks let you run full benchmark suites or sets of scenarios with a single command. Runloop handles all the complexity: provisioning devboxes for each scenario, running your agents, scoring results, and aggregating outputs. You can compare multiple agents side-by-side, run hundreds of scenarios in parallel, and walk away while the job completes in the cloud.

Orchestrated benchmarks are the recommended way to run benchmarks on Runloop. For fine-grained control over individual scenario runs, see Interactive Benchmarks.

Prerequisites

Before running orchestrated benchmarks, you need:

Runloop CLI installed: Install via npm, yarn, or pnpm:

npm install -g @runloop/rl-cli

API key configured: Set your Runloop API key:

export RUNLOOP_API_KEY=your_api_key_here

Agent configuration: Orchestrated benchmarks work with any agent that can run on a Runloop devbox. You have two options:

Bring your own agent: Deploy your own agent to run on Runloop devboxes. This is the most common approach for teams developing proprietary agents. Contact us at support@runloop.ai for help setting up your custom agent.

Use a supported public agent: Run benchmarks with popular, public open-source agents. Set up the required API keys as environment variables on your local machine, and the CLI will automatically create secrets:

Agent	Required Environment Variables
`claude-code`	`ANTHROPIC_API_KEY` or `CLAUDE_CODE_OAUTH_TOKEN`
`codex`	`OPENAI_API_KEY`
`opencode`	`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GOOGLE_API_KEY`
`goose`	`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GOOGLE_API_KEY`
`gemini-cli`	`GEMINI_API_KEY` or `GOOGLE_API_KEY`

Are we missing an agent you need? Contact us at support@runloop.ai to request support for a new public agent.

Quick Start

Run a benchmark with a single command:

rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  -n "my-first-benchmark-run"

This command:

Creates a benchmark job with the specified agent and benchmark
Provisions a devbox for each scenario in the benchmark
Runs the agent on each scenario in parallel (by default, 10 scenarios are executed concurrently)
Scores the results automatically
Collects and aggregates all results into the UI

Running Benchmark Jobs

Basic Usage

Run a single agent against a benchmark:

rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2"

Comparing Multiple Agents

Compare multiple agents side-by-side by specifying multiple --agent flags:

rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --agent "codex:gpt-4o" \
  --benchmark "terminal-bench-2" \
  -n "terminal-bench-agent-comparison"

Each agent runs independently against the full benchmark, and results are aggregated for easy comparison.

Running Specific Scenarios

Instead of a full benchmark, you can run specific scenarios by ID:

rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --scenarios scn_abc123 scn_def456 \
  -n "specific-scenarios-run"

Controlling Parallelism

By default, benchmark jobs run 10 scenarios concurrently. Increase parallelism for faster execution:

rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --n-concurrent-trials 50 \
  -n "high-parallelism-run"

Setting Timeouts

Configure agent timeout (in seconds) for long-running scenarios:

rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --timeout 3600 \
  -n "long-timeout-run"

Passing Environment Variables

Pass additional environment variables to the agent:

rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --env-vars "DEBUG=true" "LOG_LEVEL=verbose" \
  -n "debug-run"

Using Secrets

Reference Runloop secrets for sensitive values:

rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --secrets "GITHUB_TOKEN=my-github-secret" \
  -n "with-secrets-run"

Monitoring Jobs

Watch Live Progress

Monitor a running job with a full-screen progress display:

rli benchmark-job watch <job_id>

This shows real-time updates as scenarios complete, including pass/fail status and running totals.

List Jobs

View recent benchmark jobs:

rli benchmark-job list

Filter by time range or status:

# Jobs from the last 7 days
rli benchmark-job list --days 7

# All jobs (no time filter)
rli benchmark-job list --all

# Only running jobs
rli benchmark-job list --status running

# Multiple statuses
rli benchmark-job list --status running,completed

Viewing Results

Summary Report

Get a summary of results after a job completes:

rli benchmark-job summary <job_id>

Extended Results

View individual scenario results with the -e flag:

rli benchmark-job summary -e <job_id>

Output Formats

Export results as JSON or YAML for programmatic processing:

rli benchmark-job summary <job_id> -o json
rli benchmark-job summary <job_id> -o yaml

Downloading Logs

Download devbox logs for debugging:

# Download all logs for a job
rli benchmark-job logs <job_id>

# Download to a specific directory
rli benchmark-job logs <job_id> -o ./my-logs

# Download logs for a specific benchmark run
rli benchmark-job logs <job_id> --run <benchmark_run_id>

# Download logs for a specific scenario
rli benchmark-job logs <job_id> --scenario <scenario_run_id>

Supported Agents

Orchestrated benchmarks support the following agents:

Agent	Description
`claude-code`	Anthropic’s Claude Code agent
`codex`	OpenAI’s Codex agent
`opencode`	Open-source coding agent
`goose`	Block’s Goose agent
`gemini-cli`	Google’s Gemini CLI agent

Specify the agent and model in the format agent:model:

--agent "claude-code:claude-sonnet-4-6"
--agent "codex:gpt-4o"
--agent "gemini-cli:gemini-2.5-pro"

Supported Benchmarks

Orchestrated benchmark jobs work with any benchmark available on Runloop, including:

SWE-bench Verified
Laude Institute/Terminal-Bench-2.0
ScaleAI/SWE-Bench Pro
AIME
ARC-AGI-2
bigcodebench
BigCodeBench-Hard (instruct)
BigCodeBench-Hard (Complete)
ReplicationBench
GPQA Diamond
Aider/Polyglot
Replication Bench

View available benchmarks:

benchmarks = await runloop.api.benchmarks.list_public()

You can also run your own custom benchmarks via orchestrated mode.

Command Reference

`rli benchmark-job run`

Create and run a benchmark job.

Option	Description
`--agent <agent:model>`	Agent to run. Format: `agent:model`. Can specify multiple.
`--benchmark <id-or-name>`	Benchmark ID or name to run
`--scenarios <ids...>`	Scenario IDs to run (alternative to `--benchmark`)
`-n, --job-name <name>`	Name for this job
`--env-vars <vars...>`	Environment variables (format: `KEY=value`)
`--secrets <secrets...>`	Secrets to inject (format: `ENV_VAR=SECRET_NAME`)
`--timeout <seconds>`	Agent timeout in seconds (default: 7200)
`--n-attempts <n>`	Number of attempts per scenario (default: 1)
`--n-concurrent-trials <n>`	Number of concurrent trials (default: 10)
`--timeout-multiplier <n>`	Timeout multiplier (default: 1.0)
`-o, --output <format>`	Output format: `text`, `json`, `yaml`

`rli benchmark-job watch`

Watch benchmark job progress in real-time.

rli benchmark-job watch <job_id>

`rli benchmark-job summary`

Get benchmark job results.

Option	Description
`-e, --extended`	Show individual scenario results
`-o, --output <format>`	Output format: `text`, `json`, `yaml`

`rli benchmark-job list`

List benchmark jobs.

Option	Description
`--days <n>`	Show jobs from the last N days (default: 1)
`--all`	Show all jobs (no time filter)
`--status <statuses>`	Filter by status (comma-separated)
`-o, --output <format>`	Output format: `text`, `json`, `yaml`

Valid statuses: initializing, queued, running, completed, failed, cancelled, timeout

`rli benchmark-job logs`

Download devbox logs for a benchmark job.

Option	Description
`-o, --output-dir <path>`	Output directory for logs
`--run <id>`	Download logs for a specific benchmark run only
`--scenario <id>`	Download logs for a specific scenario run only

Best Practices

Start with a small subset: Test your configuration with a few scenarios before running a full benchmark.
Use meaningful job names: Name your jobs descriptively to make them easy to find and reuse later.
Monitor long-running jobs: Use rli benchmark-job watch to track progress, or check back with rli benchmark-job list.
Export results: Use -o json to export results for analysis or CI/CD integration.
Tune parallelism: Increase --n-concurrent-trials for faster execution, but be mindful of rate limits on external APIs.

Next Steps

Create custom benchmarks to evaluate your agent on your own scenarios
Build custom scorers to evaluate agent performance
View results in the dashboard for detailed analysis

Overview

Tools

Components

Benchmarks & Evals

Debugging

Prerequisites

Quick Start

Running Benchmark Jobs

Basic Usage

Comparing Multiple Agents

Running Specific Scenarios

Controlling Parallelism

Setting Timeouts

Passing Environment Variables

Using Secrets

Monitoring Jobs

Watch Live Progress

List Jobs

Viewing Results

Summary Report

Extended Results

Output Formats

Downloading Logs

Supported Agents

Supported Benchmarks

Command Reference

`rli benchmark-job run`

`rli benchmark-job watch`

`rli benchmark-job summary`

`rli benchmark-job list`

`rli benchmark-job logs`

Best Practices

Next Steps

Overview

Tools

Components

Benchmarks & Evals

Debugging

Documentation Index

​Prerequisites

​Quick Start

​Running Benchmark Jobs

​Basic Usage

​Comparing Multiple Agents

​Running Specific Scenarios

​Controlling Parallelism

​Setting Timeouts

​Passing Environment Variables

​Using Secrets

​Monitoring Jobs

​Watch Live Progress

​List Jobs

​Viewing Results

​Summary Report

​Extended Results

​Output Formats

​Downloading Logs

​Supported Agents

​Supported Benchmarks

​Command Reference

​rli benchmark-job run

​rli benchmark-job watch

​rli benchmark-job summary

​rli benchmark-job list

​rli benchmark-job logs

​Best Practices

​Next Steps

Prerequisites

Quick Start

Running Benchmark Jobs

Basic Usage

Comparing Multiple Agents

Running Specific Scenarios

Controlling Parallelism

Setting Timeouts

Passing Environment Variables

Using Secrets

Monitoring Jobs

Watch Live Progress

List Jobs

Viewing Results

Summary Report

Extended Results

Output Formats

Downloading Logs

Supported Agents

Supported Benchmarks

Command Reference

`rli benchmark-job run`

`rli benchmark-job watch`

`rli benchmark-job summary`

`rli benchmark-job list`

`rli benchmark-job logs`

Best Practices

Next Steps