> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runloop.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Orchestrated Benchmarks

> Run benchmarks at cloud scale with a single CLI command.

Orchestrated benchmarks let you run full benchmark suites or sets of scenarios with a single command. Runloop handles all the complexity: provisioning devboxes for each scenario, running your agents, scoring results, and aggregating outputs. You can compare multiple agents side-by-side, run hundreds of scenarios in parallel, and walk away while the job completes in the cloud.

<Info>
  Orchestrated benchmarks are the recommended way to run benchmarks on Runloop.
  For fine-grained control over individual scenario runs, see [Interactive
  Benchmarks](/docs/benchmarks/public-benchmarks).
</Info>

## Prerequisites

Before running orchestrated benchmarks, you need:

1. **Runloop CLI installed**: Install via npm, yarn, or pnpm:

```bash theme={null}
npm install -g @runloop/rl-cli
```

2. **API key configured**: Set your Runloop API key:

```bash theme={null}
export RUNLOOP_API_KEY=your_api_key_here
```

3. **Agent configuration**: Orchestrated benchmarks work with any agent that can run on a Runloop devbox. You have two options:

* **Bring your own agent**: Deploy your own agent to run on Runloop devboxes.
  This is the most common approach for teams developing proprietary agents.
  Contact us at [support@runloop.ai](mailto:support@runloop.ai) for help setting up your custom agent.

* **Use a supported public agent**: Run benchmarks with popular, public
  open-source agents. Set up the required API keys as environment variables on
  your local machine, and the CLI will automatically create secrets:

  | Agent         | Required Environment Variables                             |
  | ------------- | ---------------------------------------------------------- |
  | `claude-code` | `ANTHROPIC_API_KEY` or `CLAUDE_CODE_OAUTH_TOKEN`           |
  | `codex`       | `OPENAI_API_KEY`                                           |
  | `opencode`    | `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GOOGLE_API_KEY` |
  | `goose`       | `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or `GOOGLE_API_KEY` |
  | `gemini-cli`  | `GEMINI_API_KEY` or `GOOGLE_API_KEY`                       |

Are we missing an agent you need? Contact us at [support@runloop.ai](mailto:support@runloop.ai) to request support for a new public agent.

## Quick Start

Run a benchmark with a single command:

```bash theme={null}
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  -n "my-first-benchmark-run"
```

This command:

1. Creates a benchmark job with the specified agent and benchmark
2. Provisions a devbox for each scenario in the benchmark
3. Runs the agent on each scenario in parallel (by default, 10 scenarios are executed concurrently)
4. Scores the results automatically
5. Collects and aggregates all results into the UI

## Running Benchmark Jobs

### Basic Usage

Run a single agent against a benchmark:

```bash theme={null}
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2"
```

### Comparing Multiple Agents

Compare multiple agents side-by-side by specifying multiple `--agent` flags:

```bash theme={null}
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --agent "codex:gpt-4o" \
  --benchmark "terminal-bench-2" \
  -n "terminal-bench-agent-comparison"
```

Each agent runs independently against the full benchmark, and results are aggregated for easy comparison.

### Running Specific Scenarios

Instead of a full benchmark, you can run specific scenarios by ID:

```bash theme={null}
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --scenarios scn_abc123 scn_def456 \
  -n "specific-scenarios-run"
```

### Controlling Parallelism

By default, benchmark jobs run 10 scenarios concurrently. Increase parallelism for faster execution:

```bash theme={null}
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --n-concurrent-trials 50 \
  -n "high-parallelism-run"
```

### Setting Timeouts

Configure agent timeout (in seconds) for long-running scenarios:

```bash theme={null}
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --timeout 3600 \
  -n "long-timeout-run"
```

### Passing Environment Variables

Pass additional environment variables to the agent:

```bash theme={null}
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --env-vars "DEBUG=true" "LOG_LEVEL=verbose" \
  -n "debug-run"
```

### Using Secrets

Reference Runloop secrets for sensitive values:

```bash theme={null}
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --secrets "GITHUB_TOKEN=my-github-secret" \
  -n "with-secrets-run"
```

## Monitoring Jobs

### Watch Live Progress

Monitor a running job with a full-screen progress display:

```bash theme={null}
rli benchmark-job watch <job_id>
```

This shows real-time updates as scenarios complete, including pass/fail status and running totals.

### List Jobs

View recent benchmark jobs:

```bash theme={null}
rli benchmark-job list
```

Filter by time range or status:

```bash theme={null}
# Jobs from the last 7 days
rli benchmark-job list --days 7

# All jobs (no time filter)
rli benchmark-job list --all

# Only running jobs
rli benchmark-job list --status running

# Multiple statuses
rli benchmark-job list --status running,completed
```

## Viewing Results

### Summary Report

Get a summary of results after a job completes:

```bash theme={null}
rli benchmark-job summary <job_id>
```

### Extended Results

View individual scenario results with the `-e` flag:

```bash theme={null}
rli benchmark-job summary -e <job_id>
```

### Output Formats

Export results as JSON or YAML for programmatic processing:

```bash theme={null}
rli benchmark-job summary <job_id> -o json
rli benchmark-job summary <job_id> -o yaml
```

## Downloading Logs

Download devbox logs for debugging:

```bash theme={null}
# Download all logs for a job
rli benchmark-job logs <job_id>

# Download to a specific directory
rli benchmark-job logs <job_id> -o ./my-logs

# Download logs for a specific benchmark run
rli benchmark-job logs <job_id> --run <benchmark_run_id>

# Download logs for a specific scenario
rli benchmark-job logs <job_id> --scenario <scenario_run_id>
```

## Supported Agents

Orchestrated benchmarks support the following agents:

| Agent         | Description                   |
| ------------- | ----------------------------- |
| `claude-code` | Anthropic's Claude Code agent |
| `codex`       | OpenAI's Codex agent          |
| `opencode`    | Open-source coding agent      |
| `goose`       | Block's Goose agent           |
| `gemini-cli`  | Google's Gemini CLI agent     |

Specify the agent and model in the format `agent:model`:

```bash theme={null}
--agent "claude-code:claude-sonnet-4-6"
--agent "codex:gpt-4o"
--agent "gemini-cli:gemini-2.5-pro"
```

## Supported Benchmarks

Orchestrated benchmark jobs work with any benchmark available on Runloop, including:

* **SWE-bench Verified**
* **Laude Institute/Terminal-Bench-2.0**
* **ScaleAI/SWE-Bench Pro**
* **AIME**
* **ARC-AGI-2**
* **bigcodebench**
* **BigCodeBench-Hard (instruct)**
* **BigCodeBench-Hard (Complete)**
* **ReplicationBench**
* **GPQA Diamond**
* **Aider/Polyglot**
* **Replication Bench**

View available benchmarks:

```python theme={null}
benchmarks = await runloop.api.benchmarks.list_public()
```

You can also run your own [custom benchmarks](/docs/benchmarks/custom-benchmarks) via orchestrated mode.

## Command Reference

### `rli benchmark-job run`

Create and run a benchmark job.

| Option                      | Description                                                |
| --------------------------- | ---------------------------------------------------------- |
| `--agent <agent:model>`     | Agent to run. Format: `agent:model`. Can specify multiple. |
| `--benchmark <id-or-name>`  | Benchmark ID or name to run                                |
| `--scenarios <ids...>`      | Scenario IDs to run (alternative to `--benchmark`)         |
| `-n, --job-name <name>`     | Name for this job                                          |
| `--env-vars <vars...>`      | Environment variables (format: `KEY=value`)                |
| `--secrets <secrets...>`    | Secrets to inject (format: `ENV_VAR=SECRET_NAME`)          |
| `--timeout <seconds>`       | Agent timeout in seconds (default: 7200)                   |
| `--n-attempts <n>`          | Number of attempts per scenario (default: 1)               |
| `--n-concurrent-trials <n>` | Number of concurrent trials (default: 10)                  |
| `--timeout-multiplier <n>`  | Timeout multiplier (default: 1.0)                          |
| `-o, --output <format>`     | Output format: `text`, `json`, `yaml`                      |

### `rli benchmark-job watch`

Watch benchmark job progress in real-time.

```bash theme={null}
rli benchmark-job watch <job_id>
```

### `rli benchmark-job summary`

Get benchmark job results.

| Option                  | Description                           |
| ----------------------- | ------------------------------------- |
| `-e, --extended`        | Show individual scenario results      |
| `-o, --output <format>` | Output format: `text`, `json`, `yaml` |

### `rli benchmark-job list`

List benchmark jobs.

| Option                  | Description                                 |
| ----------------------- | ------------------------------------------- |
| `--days <n>`            | Show jobs from the last N days (default: 1) |
| `--all`                 | Show all jobs (no time filter)              |
| `--status <statuses>`   | Filter by status (comma-separated)          |
| `-o, --output <format>` | Output format: `text`, `json`, `yaml`       |

Valid statuses: `initializing`, `queued`, `running`, `completed`, `failed`, `cancelled`, `timeout`

### `rli benchmark-job logs`

Download devbox logs for a benchmark job.

| Option                    | Description                                     |
| ------------------------- | ----------------------------------------------- |
| `-o, --output-dir <path>` | Output directory for logs                       |
| `--run <id>`              | Download logs for a specific benchmark run only |
| `--scenario <id>`         | Download logs for a specific scenario run only  |

## Best Practices

1. **Start with a small subset**: Test your configuration with a few scenarios before running a full benchmark.
2. **Use meaningful job names**: Name your jobs descriptively to make them easy to find and reuse later.
3. **Monitor long-running jobs**: Use `rli benchmark-job watch` to track progress, or check back with `rli benchmark-job list`.
4. **Export results**: Use `-o json` to export results for analysis or CI/CD integration.
5. **Tune parallelism**: Increase `--n-concurrent-trials` for faster execution, but be mindful of rate limits on external APIs.

## Next Steps

* [Create custom benchmarks](/docs/benchmarks/custom-benchmarks) to evaluate your agent on your own scenarios
* [Build custom scorers](/docs/benchmarks/custom-scorers) to evaluate agent performance
* [View results in the dashboard](/docs/tools/dashboard) for detailed analysis
