Back
Using RLI and Cloud Orchestration for Benchmarks on Runloop
James Chainey
Benchmarks

Using RLI and Cloud Orchestration for Benchmarks on Runloop

Run & Compare Agents in one command

In part 1 of our Benchmark Cloud Orchestrator announcement we told you that we would make it easy to run your favorite benchmarks, at scale, with a single command on Runloop.

Today we'll walk you through the benchmark-job command, show you how to watch the status of a run and explore the results. You'll learn how to run many benchmarks, compare agents, and inspect the outputs.

Running Benchmark Jobs with RLI

Running a Benchmark Job doesn’t require writing code. You can drive the entire process using Runloop’s interactive terminal interface: rli. This is the same command line tool you can use to spin up Devboxes, build blueprints, create secrets, configure network policies, and much more.

Example: Using RLI to Run a Benchmark Job

In this example, we launch the popular math benchmark, AIME, and compare two agent configurations side-by-side:

  • Claude: Haiku 4.5
  • Codex: GPT-4o

We configure the benchmark to run every test case in the benchmark simultaneously and submit the job.

# Create and run a benchmark job
rli benchmark-job run \
  --agent "claude-code:claude-haiku-4-5-20251001" \ # Run as many agents as you want
  --agent "codex:gpt-4o" \
  --benchmark "AIME" \                  # Benchmark to run against
  -n "aime-math-showdown-haiku-vs-4o" \ # Name for this job
  --n-concurrent-trials 60              # Number of concurrent trials

The entire benchmark will run once for each agent specified.

Once the job starts, Runloop provisions a devbox per test case. Runloop executes tasks in parallel, and results are aggregated as the run progresses. At this point you can close your terminal and walk away as the job continues running in the cloud until completion. You can also get the status of a Benchmark Job, and get live updates, using rli:

# Watch the live status of a BenchmarkJob:
rli benchmark-job watch <benchmark_job_id>

After the Benchmark Job completes, the full run state, including results for each scenario, are available for inspection.

Inspecting the Results

To see full results, simply use rli to display the Benchmark Job summary:

# Display the full job result summary:
rli benchmark-job summary -e <benchmark_job_id>

Compatibility & Supported Benchmarks

Agent Support

Today Benchmark Jobs has support for the following agents:

  • claude-code
  • codex
  • gemini-cli
  • opencode
  • goose
  • mini-swe-agent

Support for custom agents is available upon request.

Benchmark Support

Benchmark Jobs work out of the box with many hosted, compatible benchmarks. You can get started today with the following benchmarks today:

  • Aider Polyglot
  • AIME
  • ARC-AGI-2
  • BigCodeBench (Full, Hard, and Instruct sets)
  • GPQA Diamond
  • Replication Bench
  • SWE-Bench Pro
  • Terminal Bench 2

API Support for Benchmark Jobs

Benchmark Jobs are also available through the Runloop API, allowing teams to integrate benchmark execution directly into their own workflows.

This makes it straightforward to:

  • Trigger benchmark runs programmatically
  • Automate large-scale model comparisons
  • Integrate benchmark execution into CI pipelines

In other words, benchmark execution becomes infrastructure rather than a fragile script.

Learn More

Find out more about Benchmarks and Benchmark Jobs in our docs site.

Need help? Our support team would love to help.

Are you interested in running benchmarks at scale inside your VPC? Get in touch: sales@runloop.ai