Hold every variable constant except the one you are measuring
Runloop's AI model evaluation infrastructure isolates a single variable and measures its effect. Every evaluation runs on identical scenario sets in isolated sandboxes, producing results that are directly comparable. You define the evaluation parameters through a BenchmarkJobDef template; the platform runs it and delivers structured, comparable output through the results API.
```python
import runloop
# Define a model comparison evaluation
job = runloop.benchmark_jobs.create(
name="claude-vs-gpt4-codebase-eval",
benchmark_def="internal-monorepo-suite",
variants=[
{"model": "claude-sonnet-4-5", "agent": "coding-agent-v3"},
{"model": "gpt-4.1", "agent": "coding-agent-v3"},
],
config={"concurrency": 100, "timeout_seconds": 600}
)
# Compare results side-by-side
comparison = runloop.benchmark_jobs.compare(job.id)
print(comparison.summary_table())
``````typescript
import Runloop from "runloop";
// Define a model comparison evaluation
const job = await runloop.benchmarkJobs.create({
name: "claude-vs-gpt4-codebase-eval",
benchmarkDef: "internal-monorepo-suite",
variants: [
{ model: "claude-sonnet-4-5", agent: "coding-agent-v3" },
{ model: "gpt-4.1", agent: "coding-agent-v3" },
],
config: { concurrency: 100, timeoutSeconds: 600 },
});
// Compare results side-by-side
const comparison = await runloop.benchmarkJobs.compare(job.id);
console.log(comparison.summaryTable());
``````bash
# Run a model comparison evaluation
runloop benchmark run \
--name "claude-vs-gpt4-codebase-eval" \
--benchmark-def "internal-monorepo-suite" \
--variant "model=claude-sonnet-4-5,agent=coding-agent-v3" \
--variant "model=gpt-4.1,agent=coding-agent-v3" \
--concurrency 100 \
--timeout 600
# Compare results side-by-side
runloop benchmark compare --job claude-vs-gpt4-codebase-eval --format table
```
---Evaluate models across the dimensions that matter
Four measurement axes that drive deployment decisions.
Does the agent produce the right output? Test suite pass rates and diff matches against known-good solutions, scored consistently across all variants you compare.

Measures consistency across repeated attempts on identical tasks. Score distributions surface variance that pass-rate averages conceal.

Duration, tool calls, and execution cost reveal how agents reach solutions. Two models with identical pass rates may have dramatically different cost profiles. Token usage and cost-per-task tracking are planned [PLANNED: requires LLM Proxy].

Which tools did the agent reach for, how did it recover from errors, and did it explore dead ends? Qualitative differences that aggregate scores obscure. [PLANNED: requires LLM Proxy]

Model vendor selection backed by your own data
When a new model drops or contract renewal approaches, the question is always the same: should we switch? Leaderboard rankings compress complex performance into a single number. Two models with identical SWE-Bench scores can behave very differently on a codebase with heavy TypeScript, custom linters, or non-standard build tooling. Runloop makes vendor selection empirical. Build a representative benchmark suite from tasks in your codebase, then run the same suite against every model you are evaluating. The comparison dashboard shows side-by-side results: pass rate, duration, and score distribution across all scenarios.

Evaluate every model release, not just the first one
Model evaluation is not a one-time event. Models update. Agent frameworks evolve. Dependencies change. Performance validated three months ago may no longer hold, and a model upgrade that improves aggregate benchmarks may quietly regress on the specific task categories your production workload depends on. BenchmarkJobDef templates let you define a standard evaluation configuration and re-run it whenever a variable changes. The platform stores results across runs, building a longitudinal evaluation history that shows how performance on your workload changes across model generations.

From test suite to deployment decision
Four steps from evaluation definition to actionable data.
Select from public benchmarks like SWE-Bench Verified, curated scenario sets from academic research partners, or custom scenarios from your own repositories. Each scenario specifies a task, environment, and scoring contract.

Specify model and agent variants. Set orchestration parameters: concurrency, retry policy, timeout limits. Save as a BenchmarkJobDef template for repeated evaluation runs.

Submit through the API, CLI, or dashboard. The platform schedules across isolated sandboxes, manages retries, and aggregates results as scenarios complete. Monitor progress in real time.

View comparative results in the side-by-side dashboard. Filter by scenario to identify divergence patterns. Export structured results through the API for downstream analysis.

What other platforms cannot offer
Three capabilities with zero equivalents across competing platforms.
Side-by-side comparison is a native platform capability, not a third-party integration. Structured results API returns comparison data programmatically.

Public benchmarks for baseline screening, curated sets from academic researchers, and custom benchmarks from your proprietary codebase.

BenchmarkJobDef templates capture the full evaluation configuration. Re-run with a single parameter change. Evaluation history accumulates automatically.

Model evaluation questions
Common questions about evaluating AI models and agents with Runloop's benchmarking infrastructure.
Public leaderboards rank models on standardized, publicly available tasks. The results tell you how a model performs on that specific benchmark suite, which may have little correlation with your production workload. Runloop lets you run the same controlled evaluation infrastructure on your own tasks, with your own scoring criteria, in private environments. You can also run public benchmarks like SWE-Bench Verified through Runloop when you need a standardized baseline, but the platform is designed for evaluations where you control the test suite and the results stay private to your organization.
Yes. Runloop supports custom benchmark suites built from your own repositories and task definitions. Scenarios run in isolated sandboxes [VERIFY: MicroVM architecture] with credential protection through the Credential Gateway, so your proprietary code and API keys are never exposed. See the [Custom Benchmarks](/solutions/custom-benchmarks) solution page for the full workflow and security model for private evaluation suites.
The comparison dashboard currently reports pass rate, average duration, and score distribution across all scenarios in a job. You can filter results by scenario category, drill into individual runs to inspect execution details, and export structured data through the results API. Token usage, cost-per-task, and tool call pattern metrics are planned capabilities that will become available once the LLM Proxy integration ships [PLANNED: requires LLM Proxy].
Save your evaluation configuration as a BenchmarkJobDef template. The template captures everything: the benchmark suite, the orchestration parameters, and the variant definitions. When a new model version is released, re-run the template with the updated model parameter. The platform stores results from every run, so you can compare the new version against the previous baseline directly in the comparison dashboard. For teams that want to automate this as part of a CI/CD pipeline, see [Regression Testing](/solutions/regression-testing).
Yes. The evaluation variant configuration accepts both a model parameter and an agent parameter. This means you can hold the model constant and compare two agent framework versions, or compare a new model running through your existing agent against the incumbent. The platform treats the model-agent pair as the unit of evaluation, so any combination of model provider and agent codebase can be tested as a variant.