

Use Runloop and W&B Weave to answer these questions with real benchmark data. Read Part 3 of our series.
This is Part 3 of our series on using rli to orchestrate AI workloads on Runloop's cloud. Catch up on Part 1 and Part 2 if you haven't already. This post is a companion piece to an entry on the Weights & Biases blog.
Building an AI agent is one thing. Knowing whether it's actually getting better is another. In this final installment, we show how to combine Runloop's Cloud Orchestrator with Weights & Biases Weave to run benchmarks at scale and get deep, actionable insight into every decision your agent makes.
This allows you to stop guessing whether your changes improve performance & start measuring instead.
In Parts 1 and 2, we walked through using rli to configure and execute AI workloads on Runloop's cloud infrastructure. We covered project setup, environment management, and running agents in isolated, reproducible containers.
For those just joining us: in part 1 of our series, we introduced the central problem in running benchmarks to train models and evaluate performance: running benchmarks at scale.
Benchmarks are more than simple tools for measuring performance. They're also huge training sets. Seen through this lens, benchmarks & evals are the engine driving the AI ecosystem. Benchmarks are how models are trained, how agents are evaluated, and how errors are corrected. However, running benchmarks at scale is challenging. Most benchmarks use docker containers as the standard way of parallelizing task execution. This works well when running on a single system, but there's a fundamental difference between running test cases on a single, local machine, and running an enormous dataset with tens of thousands of test cases. At that scale, running each test case in parallel in a secure, isolated environment requires specialized infrastructure. Runloop was designed to solve these kind of problems: we already support execution of tens of thousands of concurrent test cases, each one using a different container image. The next critical piece of missing infrastructure is providing and exposing excellent user ergonomics to run benchmarks at scale.
In part 2, we introduced Runloop's solution to this problem: Cloud Orchestrated Benchmarks. With our cloud orchestrator, you can run popular public benchmarks or create and execute your own custom benchmarks. We showed you how to use rli to create a benchmark job and use it to evaluate and contrast agent performance. Now you can make apples-to-apples comparisons between agents on a benchmark and run every single test case in the entire benchmark in mere minutes.
Here's what it looks like in action:
Spinning up sandboxes and running tests isn't where the story ends. It's not enough to be able to run a benchmark as an isolated test without pairing that execution with analysis. With fast, repeatable benchmarks, agent development iteration becomes something tangible and measurable.
Runloop benchmarks make it trivial to define and run many different tasks in parallel. By collecting structured results, Runloop turns agent development into an empirical discipline. You don't guess if a change helped. You measure it across dozens or hundreds of runs and know for sure.
rliA benchmark is only useful if it's consistent and reproducible. Runloop provides both by executing every trial in an isolated cloud environment—same container image, same dependencies, same infrastructure—so that variance in results reflects actual agent behavior, not environmental drift.
Running the benchmark is a single command:
rli benchmark-job run \
--agent "claude-code:claude-haiku-4-5-20251001" \
--agent "codex:gpt-4o" \
--benchmark "AIME" \
-n "aime-math-showdown-haiku-vs-4o" \
--n-concurrent-trials 60This spins up the required cloud resources, executes your agent against every task in the benchmark definition, and collects the results. Need to test a prompt change against your baseline? Run it again with a different agent image or environment variable. Need statistical confidence? Run many trials. Runloop handles the orchestration so you can focus on the experiment.
A pass/fail score tells you what happened. Trajectories tell you why. That's where Weights & Biases Weave comes in.
Weave is a platform purpose-built for rich experiment logging, visualization, and comparison. When integrated with Runloop, it captures the full story of every benchmark run—not just the final score, but every intermediate step, tool call, reasoning trace, and token count along the way.
This kind of structured logging feeds directly into Weave's visualization and comparison tools, giving you the ability to answer the questions that actually matter during development:
This goes well beyond a leaderboard number. It's the kind of instrumentation that makes the difference between "the score dropped 2%" and "the score dropped 2% because the agent stopped using the search tool on multi-hop questions after we changed the system prompt."
The real power of combining Runloop and Weave isn't in any single feature—it's in the workflow they enable together.
Runloop gives you fast, scalable, reproducible benchmark execution. Weave gives you deep, structured analysis of what happened during those runs. Together, they create a tight feedback loop: make a change, run your benchmarks, inspect the trajectories, and decide what to do next—all backed by data, not intuition.
Review your completed benchmark jobs directly from Runloop's CLI:
# Show summary of benchmark-job execution results
rli benchmark-job summary -e <benchmark_job_id>
Then dive into the details in Weave to compare different benchmark runs side-by-side, analyze trajectory diffs, and pinpoint exactly where performance shifted.
This combination is especially valuable for:
The future of agent development looks a lot like the best practices we already know from software engineering and ML training: measure, analyze, optimize, repeat. Runloop and Weave make that loop fast and convenient enough to actually use on every change.
If you've followed along with Parts 1 and 2, you already have everything you need to run benchmarks at scale.