From Local Execution to Cloud Orchestration


Run & Compare Agents in one command
In part 1 of our Benchmark Cloud Orchestrator announcement we told you that we would make it easy to run your favorite benchmarks, at scale, with a single command on Runloop.
Today we'll walk you through the benchmark-job command, show you how to watch the status of a run and explore the results. You'll learn how to run many benchmarks, compare agents, and inspect the outputs.
Running a Benchmark Job doesn’t require writing code. You can drive the entire process using Runloop’s interactive terminal interface: rli. This is the same command line tool you can use to spin up Devboxes, build blueprints, create secrets, configure network policies, and much more.
In this example, we launch the popular math benchmark, AIME, and compare two agent configurations side-by-side:
We configure the benchmark to run every test case in the benchmark simultaneously and submit the job.
# Create and run a benchmark job
rli benchmark-job run \
--agent "claude-code:claude-haiku-4-5-20251001" \ # Run as many agents as you want
--agent "codex:gpt-4o" \
--benchmark "AIME" \ # Benchmark to run against
-n "aime-math-showdown-haiku-vs-4o" \ # Name for this job
--n-concurrent-trials 60 # Number of concurrent trialsThe entire benchmark will run once for each agent specified.
Once the job starts, Runloop provisions a devbox per test case. Runloop executes tasks in parallel, and results are aggregated as the run progresses. At this point you can close your terminal and walk away as the job continues running in the cloud until completion. You can also get the status of a Benchmark Job, and get live updates, using rli:
# Watch the live status of a BenchmarkJob:
rli benchmark-job watch <benchmark_job_id>.png)
After the Benchmark Job completes, the full run state, including results for each scenario, are available for inspection.
To see full results, simply use rli to display the Benchmark Job summary:
# Display the full job result summary:
rli benchmark-job summary -e <benchmark_job_id>
.png)
Today Benchmark Jobs has support for the following agents:
Support for custom agents is available upon request.
Benchmark Jobs work out of the box with many hosted, compatible benchmarks. You can get started today with the following benchmarks today:
Benchmark Jobs are also available through the Runloop API, allowing teams to integrate benchmark execution directly into their own workflows.
This makes it straightforward to:
In other words, benchmark execution becomes infrastructure rather than a fragile script.
Find out more about Benchmarks and Benchmark Jobs in our docs site.
Need help? Our support team would love to help.
Are you interested in running benchmarks at scale inside your VPC? Get in touch: sales@runloop.ai