Using RLI and Cloud Orchestration for Benchmarks on Runloop

In part 1 of our Benchmark Cloud Orchestrator announcement we told you that we would make it easy to run your favorite benchmarks, at scale, with a single command on Runloop.

Today we'll walk you through the benchmark-job command, show you how to watch the status of a run and explore the results. You'll learn how to run many benchmarks, compare agents, and inspect the outputs.

Running Benchmark Jobs with RLI

Running a Benchmark Job doesn’t require writing code. You can drive the entire process using Runloop’s interactive terminal interface: rli. This is the same command line tool you can use to spin up Devboxes, build blueprints, create secrets, configure network policies, and much more.

Example: Using RLI to Run a Benchmark Job

In this example, we launch the popular math benchmark, AIME, and compare two agent configurations side-by-side:

Claude: Haiku 4.5
Codex: GPT-4o

We configure the benchmark to run every test case in the benchmark simultaneously and submit the job.

# Create and run a benchmark job
rli benchmark-job run \
  --agent "claude-code:claude-haiku-4-5-20251001" \ # Run as many agents as you want
  --agent "codex:gpt-4o" \
  --benchmark "AIME" \                  # Benchmark to run against
  -n "aime-math-showdown-haiku-vs-4o" \ # Name for this job
  --n-concurrent-trials 60              # Number of concurrent trials

The entire benchmark will run once for each agent specified.

Once the job starts, Runloop provisions a devbox per test case. Runloop executes tasks in parallel, and results are aggregated as the run progresses. At this point you can close your terminal and walk away as the job continues running in the cloud until completion. You can also get the status of a Benchmark Job, and get live updates, using rli:

# Watch the live status of a BenchmarkJob:
rli benchmark-job watch <benchmark_job_id>

After the Benchmark Job completes, the full run state, including results for each scenario, are available for inspection.

Inspecting the Results

To see full results, simply use rli to display the Benchmark Job summary:

# Display the full job result summary:
rli benchmark-job summary -e <benchmark_job_id>

‍

Compatibility & Supported Benchmarks

Agent Support

Today Benchmark Jobs has support for the following agents:

claude-code
codex
gemini-cli
opencode
goose
mini-swe-agent

Support for custom agents is available upon request.

Benchmark Support

Benchmark Jobs work out of the box with many hosted, compatible benchmarks. You can get started today with the following benchmarks today:

Aider Polyglot
AIME
ARC-AGI-2
BigCodeBench (Full, Hard, and Instruct sets)
GPQA Diamond
Replication Bench
SWE-Bench Pro
Terminal Bench 2

API Support for Benchmark Jobs

Benchmark Jobs are also available through the Runloop API, allowing teams to integrate benchmark execution directly into their own workflows.

This makes it straightforward to:

Trigger benchmark runs programmatically
Automate large-scale model comparisons
Integrate benchmark execution into CI pipelines

In other words, benchmark execution becomes infrastructure rather than a fragile script.

Learn More

Find out more about Benchmarks and Benchmark Jobs in our docs site.

Need help? Our support team would love to help.

Are you interested in running benchmarks at scale inside your VPC? Get in touch: sales@runloop.ai

Using RLI and Cloud Orchestration for Benchmarks on Runloop

Running Benchmark Jobs with RLI

Example: Using RLI to Run a Benchmark Job

Inspecting the Results

Compatibility & Supported Benchmarks

Agent Support

Benchmark Support

API Support for Benchmark Jobs

Learn More

Enjoyed This Article?

Take a Look At Our Latest Blogs

Get Started With Runloop

Get Started With Runloop

Cookie Settings

Using RLI and Cloud Orchestration for Benchmarks on Runloop

Running Benchmark Jobs with RLI

Example: Using RLI to Run a Benchmark Job

Inspecting the Results

Compatibility & Supported Benchmarks

Agent Support

Benchmark Support

API Support for Benchmark Jobs

Learn More

Enjoyed This Article?

Take a Look At Our Latest Blogs

Get Started With Runloop

Get Started With Runloop