We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Privacy Policy and Terms of Service
Using RLI and Cloud Orchestration for Benchmarks on Runloop
Run & Compare Agents in one command
In part 1 of our Benchmark Cloud Orchestrator announcement we told you that we would make it easy to run your favorite benchmarks, at scale, with a single command on Runloop.
Today we'll walk you through the benchmark-job command, show you how to watch the status of a run and explore the results. You'll learn how to run many benchmarks, compare agents, and inspect the outputs.
Running Benchmark Jobs with RLI
Running a Benchmark Job doesn’t require writing code. You can drive the entire process using Runloop’s interactive terminal interface: rli. This is the same command line tool you can use to spin up Devboxes, build blueprints, create secrets, configure network policies, and much more.
Example: Using RLI to Run a Benchmark Job
In this example, we launch the popular math benchmark, AIME, and compare two agent configurations side-by-side:
Claude: Haiku 4.5
Codex: GPT-4o
We configure the benchmark to run every test case in the benchmark simultaneously and submit the job.
# Create and run a benchmark jobrli benchmark-job run \
--agent "claude-code:claude-haiku-4-5-20251001" \ # Run as many agents as you want --agent "codex:gpt-4o" \
--benchmark "AIME" \ # Benchmark to run against -n "aime-math-showdown-haiku-vs-4o" \ # Name for this job --n-concurrent-trials 60 # Number of concurrent trials
The entire benchmark will run once for each agent specified.
Once the job starts, Runloop provisions a devbox per test case. Runloop executes tasks in parallel, and results are aggregated as the run progresses. At this point you can close your terminal and walk away as the job continues running in the cloud until completion. You can also get the status of a Benchmark Job, and get live updates, using rli:
# Watch the live status of a BenchmarkJob:rli benchmark-job watch <benchmark_job_id>
After the Benchmark Job completes, the full run state, including results for each scenario, are available for inspection.
Inspecting the Results
To see full results, simply use rli to display the Benchmark Job summary:
# Display the full job result summary:rli benchmark-job summary -e <benchmark_job_id>
Compatibility & Supported Benchmarks
Agent Support
Today Benchmark Jobs has support for the following agents:
claude-code
codex
gemini-cli
opencode
goose
mini-swe-agent
Support for custom agents is available upon request.
Benchmark Support
Benchmark Jobs work out of the box with many hosted, compatible benchmarks. You can get started today with the following benchmarks today:
Aider Polyglot
AIME
ARC-AGI-2
BigCodeBench (Full, Hard, and Instruct sets)
GPQA Diamond
Replication Bench
SWE-Bench Pro
Terminal Bench 2
API Support for Benchmark Jobs
Benchmark Jobs are also available through the Runloop API, allowing teams to integrate benchmark execution directly into their own workflows.
This makes it straightforward to:
Trigger benchmark runs programmatically
Automate large-scale model comparisons
Integrate benchmark execution into CI pipelines
In other words, benchmark execution becomes infrastructure rather than a fragile script.
Learn More
Find out more about Benchmarks and Benchmark Jobs in our docs site.