Run AI Benchmarks at Scale with Runloop and W&B Weave

This is Part 3 of our series on using rli to orchestrate AI workloads on Runloop's cloud. Catch up on Part 1 and Part 2 if you haven't already. This post is a companion piece to an entry on the Weights & Biases blog.

Building an AI agent is one thing. Knowing whether it's actually getting better is another. In this final installment, we show how to combine Runloop's Cloud Orchestrator with Weights & Biases Weave to run benchmarks at scale and get deep, actionable insight into every decision your agent makes.

This allows you to stop guessing whether your changes improve performance & start measuring instead.

The Iterative Agent Development Challenge & Runloop's Solution

In Parts 1 and 2, we walked through using rli to configure and execute AI workloads on Runloop's cloud infrastructure. We covered project setup, environment management, and running agents in isolated, reproducible containers.

A Quick Recap

For those just joining us: in part 1 of our series, we introduced the central problem in running benchmarks to train models and evaluate performance: running benchmarks at scale.

Benchmarks are more than simple tools for measuring performance. They're also huge training sets. Seen through this lens, benchmarks & evals are the engine driving the AI ecosystem. Benchmarks are how models are trained, how agents are evaluated, and how errors are corrected. However, running benchmarks at scale is challenging. Most benchmarks use docker containers as the standard way of parallelizing task execution. This works well when running on a single system, but there's a fundamental difference between running test cases on a single, local machine, and running an enormous dataset with tens of thousands of test cases. At that scale, running each test case in parallel in a secure, isolated environment requires specialized infrastructure. Runloop was designed to solve these kind of problems: we already support execution of tens of thousands of concurrent test cases, each one using a different container image. The next critical piece of missing infrastructure is providing and exposing excellent user ergonomics to run benchmarks at scale.

In part 2, we introduced Runloop's solution to this problem: Cloud Orchestrated Benchmarks. With our cloud orchestrator, you can run popular public benchmarks or create and execute your own custom benchmarks. We showed you how to use rli to create a benchmark job and use it to evaluate and contrast agent performance. Now you can make apples-to-apples comparisons between agents on a benchmark and run every single test case in the entire benchmark in mere minutes.

Here's what it looks like in action:

Beyond Benchmark Execution

Spinning up sandboxes and running tests isn't where the story ends. It's not enough to be able to run a benchmark as an isolated test without pairing that execution with analysis. With fast, repeatable benchmarks, agent development iteration becomes something tangible and measurable.

Runloop benchmarks make it trivial to define and run many different tasks in parallel. By collecting structured results, Runloop turns agent development into an empirical discipline. You don't guess if a change helped. You measure it across dozens or hundreds of runs and know for sure.

Defining and Executing High-Fidelity Benchmarks with `rli`

A benchmark is only useful if it's consistent and reproducible. Runloop provides both by executing every trial in an isolated cloud environment—same container image, same dependencies, same infrastructure—so that variance in results reflects actual agent behavior, not environmental drift.

Running the benchmark is a single command:

rli benchmark-job run \
  --agent "claude-code:claude-haiku-4-5-20251001" \
  --agent "codex:gpt-4o" \
  --benchmark "AIME" \
  -n "aime-math-showdown-haiku-vs-4o" \
  --n-concurrent-trials 60

This spins up the required cloud resources, executes your agent against every task in the benchmark definition, and collects the results. Need to test a prompt change against your baseline? Run it again with a different agent image or environment variable. Need statistical confidence? Run many trials. Runloop handles the orchestration so you can focus on the experiment.

Beyond Simple Scores: Deep Diving into Trajectories with W&B Weave

A pass/fail score tells you what happened. Trajectories tell you why. That's where Weights & Biases Weave comes in.

Weave is a platform purpose-built for rich experiment logging, visualization, and comparison. When integrated with Runloop, it captures the full story of every benchmark run—not just the final score, but every intermediate step, tool call, reasoning trace, and token count along the way.

This kind of structured logging feeds directly into Weave's visualization and comparison tools, giving you the ability to answer the questions that actually matter during development:

Which test cases failed and what was different about the agent's approach when it failed as opposed to when it succeeded?
What tools did the agent use?
At what point did the agent start going off track?
Did the agent hallucinate or try to use a tool that doesn't exist?
Was the agent efficient with its token usage?
Did a prompt or model change help or hurt, and by how much?

This goes well beyond a leaderboard number. It's the kind of instrumentation that makes the difference between "the score dropped 2%" and "the score dropped 2% because the agent stopped using the search tool on multi-hop questions after we changed the system prompt."

Accelerating Iteration: Data-Driven Decisions for Superior Agents

The real power of combining Runloop and Weave isn't in any single feature—it's in the workflow they enable together.

Runloop gives you fast, scalable, reproducible benchmark execution. Weave gives you deep, structured analysis of what happened during those runs. Together, they create a tight feedback loop: make a change, run your benchmarks, inspect the trajectories, and decide what to do next—all backed by data, not intuition.

Review your completed benchmark jobs directly from Runloop's CLI:

# Show summary of benchmark-job execution results
rli benchmark-job summary -e <benchmark_job_id>

‍

Then dive into the details in Weave to compare different benchmark runs side-by-side, analyze trajectory diffs, and pinpoint exactly where performance shifted.

This combination is especially valuable for:

Teams developing custom agent flows who want to compare model providers
Agent developers who want to deploy and test their own agent with A/B testing and statistical rigor
Testing and debugging agent regressions by comparing trajectories across runs
Tracking agent performance over time

The future of agent development looks a lot like the best practices we already know from software engineering and ML training: measure, analyze, optimize, repeat. Runloop and Weave make that loop fast and convenient enough to actually use on every change.

Get Started with Cloud Orchestrated Benchmarks

If you've followed along with Parts 1 and 2, you already have everything you need to run benchmarks at scale.

Learn more about the Runloop Cloud Orchestrator in our docs site.
Get started using our command line interface, rli, to run Benchmark Jobs.

Run AI Benchmarks at Scale with Runloop and W&B Weave

The Iterative Agent Development Challenge & Runloop's Solution

A Quick Recap

Beyond Benchmark Execution

Defining and Executing High-Fidelity Benchmarks with `rli`

Beyond Simple Scores: Deep Diving into Trajectories with W&B Weave

Accelerating Iteration: Data-Driven Decisions for Superior Agents

Get Started with Cloud Orchestrated Benchmarks

Enjoyed This Article?

Take a Look At Our Latest Blogs

Get Started With Runloop

Get Started With Runloop

Cookie Settings

Run AI Benchmarks at Scale with Runloop and W&B Weave

The Iterative Agent Development Challenge & Runloop's Solution

A Quick Recap

Beyond Benchmark Execution

Defining and Executing High-Fidelity Benchmarks with rli

Beyond Simple Scores: Deep Diving into Trajectories with W&B Weave

Accelerating Iteration: Data-Driven Decisions for Superior Agents

Get Started with Cloud Orchestrated Benchmarks

Enjoyed This Article?

Take a Look At Our Latest Blogs

Get Started With Runloop

Get Started With Runloop

Defining and Executing High-Fidelity Benchmarks with `rli`