Skip to main content
Orchestrated benchmarks let you run full benchmark suites or sets of scenarios with a single command. Runloop handles all the complexity: provisioning devboxes for each scenario, running your agents, scoring results, and aggregating outputs. You can compare multiple agents side-by-side, run hundreds of scenarios in parallel, and walk away while the job completes in the cloud.
Orchestrated benchmarks are the recommended way to run benchmarks on Runloop. For fine-grained control over individual scenario runs, see Interactive Benchmarks.

Prerequisites

Before running orchestrated benchmarks, you need:
  1. Runloop CLI installed: Install via npm, yarn, or pnpm:
npm install -g @runloop/rl-cli
  1. API key configured: Set your Runloop API key:
export RUNLOOP_API_KEY=your_api_key_here
  1. Agent configuration: Orchestrated benchmarks work with any agent that can run on a Runloop devbox. You have two options:
  • Bring your own agent: Deploy your own agent to run on Runloop devboxes. This is the most common approach for teams developing proprietary agents. Contact us at support@runloop.ai for help setting up your custom agent.
  • Use a supported public agent: Run benchmarks with popular, public open-source agents. Set up the required API keys as environment variables on your local machine, and the CLI will automatically create secrets:
    AgentRequired Environment Variables
    claude-codeANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN
    codexOPENAI_API_KEY
    opencodeANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY
    gooseANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY
    gemini-cliGEMINI_API_KEY or GOOGLE_API_KEY
Are we missing an agent you need? Contact us at support@runloop.ai to request support for a new public agent.

Quick Start

Run a benchmark with a single command:
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  -n "my-first-benchmark-run"
This command:
  1. Creates a benchmark job with the specified agent and benchmark
  2. Provisions a devbox for each scenario in the benchmark
  3. Runs the agent on each scenario in parallel (by default, 10 scenarios are executed concurrently)
  4. Scores the results automatically
  5. Collects and aggregates all results into the UI

Running Benchmark Jobs

Basic Usage

Run a single agent against a benchmark:
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2"

Comparing Multiple Agents

Compare multiple agents side-by-side by specifying multiple --agent flags:
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --agent "codex:gpt-4o" \
  --benchmark "terminal-bench-2" \
  -n "terminal-bench-agent-comparison"
Each agent runs independently against the full benchmark, and results are aggregated for easy comparison.

Running Specific Scenarios

Instead of a full benchmark, you can run specific scenarios by ID:
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --scenarios scn_abc123 scn_def456 \
  -n "specific-scenarios-run"

Controlling Parallelism

By default, benchmark jobs run 10 scenarios concurrently. Increase parallelism for faster execution:
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --n-concurrent-trials 50 \
  -n "high-parallelism-run"

Setting Timeouts

Configure agent timeout (in seconds) for long-running scenarios:
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --timeout 3600 \
  -n "long-timeout-run"

Passing Environment Variables

Pass additional environment variables to the agent:
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --env-vars "DEBUG=true" "LOG_LEVEL=verbose" \
  -n "debug-run"

Using Secrets

Reference Runloop secrets for sensitive values:
rli benchmark-job run \
  --agent "claude-code:claude-sonnet-4-6" \
  --benchmark "terminal-bench-2" \
  --secrets "GITHUB_TOKEN=my-github-secret" \
  -n "with-secrets-run"

Monitoring Jobs

Watch Live Progress

Monitor a running job with a full-screen progress display:
rli benchmark-job watch <job_id>
This shows real-time updates as scenarios complete, including pass/fail status and running totals.

List Jobs

View recent benchmark jobs:
rli benchmark-job list
Filter by time range or status:
# Jobs from the last 7 days
rli benchmark-job list --days 7

# All jobs (no time filter)
rli benchmark-job list --all

# Only running jobs
rli benchmark-job list --status running

# Multiple statuses
rli benchmark-job list --status running,completed

Viewing Results

Summary Report

Get a summary of results after a job completes:
rli benchmark-job summary <job_id>

Extended Results

View individual scenario results with the -e flag:
rli benchmark-job summary -e <job_id>

Output Formats

Export results as JSON or YAML for programmatic processing:
rli benchmark-job summary <job_id> -o json
rli benchmark-job summary <job_id> -o yaml

Downloading Logs

Download devbox logs for debugging:
# Download all logs for a job
rli benchmark-job logs <job_id>

# Download to a specific directory
rli benchmark-job logs <job_id> -o ./my-logs

# Download logs for a specific benchmark run
rli benchmark-job logs <job_id> --run <benchmark_run_id>

# Download logs for a specific scenario
rli benchmark-job logs <job_id> --scenario <scenario_run_id>

Supported Agents

Orchestrated benchmarks support the following agents:
AgentDescription
claude-codeAnthropic’s Claude Code agent
codexOpenAI’s Codex agent
opencodeOpen-source coding agent
gooseBlock’s Goose agent
gemini-cliGoogle’s Gemini CLI agent
Specify the agent and model in the format agent:model:
--agent "claude-code:claude-sonnet-4-6"
--agent "codex:gpt-4o"
--agent "gemini-cli:gemini-2.5-pro"

Supported Benchmarks

Orchestrated benchmark jobs work with any benchmark available on Runloop, including:
  • SWE-bench Verified
  • Laude Institute/Terminal-Bench-2.0
  • ScaleAI/SWE-Bench Pro
  • AIME
  • ARC-AGI-2
  • bigcodebench
  • BigCodeBench-Hard (instruct)
  • BigCodeBench-Hard (Complete)
  • ReplicationBench
  • GPQA Diamond
  • Aider/Polyglot
  • Replication Bench
View available benchmarks:
benchmarks = await runloop.api.benchmarks.list_public()
You can also run your own custom benchmarks via orchestrated mode.

Command Reference

rli benchmark-job run

Create and run a benchmark job.
OptionDescription
--agent <agent:model>Agent to run. Format: agent:model. Can specify multiple.
--benchmark <id-or-name>Benchmark ID or name to run
--scenarios <ids...>Scenario IDs to run (alternative to --benchmark)
-n, --job-name <name>Name for this job
--env-vars <vars...>Environment variables (format: KEY=value)
--secrets <secrets...>Secrets to inject (format: ENV_VAR=SECRET_NAME)
--timeout <seconds>Agent timeout in seconds (default: 7200)
--n-attempts <n>Number of attempts per scenario (default: 1)
--n-concurrent-trials <n>Number of concurrent trials (default: 10)
--timeout-multiplier <n>Timeout multiplier (default: 1.0)
-o, --output <format>Output format: text, json, yaml

rli benchmark-job watch

Watch benchmark job progress in real-time.
rli benchmark-job watch <job_id>

rli benchmark-job summary

Get benchmark job results.
OptionDescription
-e, --extendedShow individual scenario results
-o, --output <format>Output format: text, json, yaml

rli benchmark-job list

List benchmark jobs.
OptionDescription
--days <n>Show jobs from the last N days (default: 1)
--allShow all jobs (no time filter)
--status <statuses>Filter by status (comma-separated)
-o, --output <format>Output format: text, json, yaml
Valid statuses: initializing, queued, running, completed, failed, cancelled, timeout

rli benchmark-job logs

Download devbox logs for a benchmark job.
OptionDescription
-o, --output-dir <path>Output directory for logs
--run <id>Download logs for a specific benchmark run only
--scenario <id>Download logs for a specific scenario run only

Best Practices

  1. Start with a small subset: Test your configuration with a few scenarios before running a full benchmark.
  2. Use meaningful job names: Name your jobs descriptively to make them easy to find and reuse later.
  3. Monitor long-running jobs: Use rli benchmark-job watch to track progress, or check back with rli benchmark-job list.
  4. Export results: Use -o json to export results for analysis or CI/CD integration.
  5. Tune parallelism: Increase --n-concurrent-trials for faster execution, but be mindful of rate limits on external APIs.

Next Steps