30,000+

concurrent environments

<10ms

Credential Gateway latencyenvironments

50ms

command execution

<20ms

MCP Hub routing
THE LEADERBOARD GAP
Public leaderboards rank models on standardized tasks and compress the result into a single number. A model that leads on SWE-Bench may underperform on your specific workload because your codebase uses different languages, repository structures, and tooling patterns. An agent's performance depends on the interaction between model, framework, tool configuration, and target environment. Without controlled evaluation infrastructure, teams make deployment decisions on incomplete data -- choosing models based on public rankings that may have no correlation with performance on the work that actually matters.
Learn how Runloop solves this
-- Engineering Lead, Series B AI Startup
CONTROLLED EVALUATION INFRASTRUCTURE

Hold every variable constant except the one you are measuring

Runloop's AI model evaluation infrastructure isolates a single variable and measures its effect. Every evaluation runs on identical scenario sets in isolated sandboxes, producing results that are directly comparable. You define the evaluation parameters through a BenchmarkJobDef template; the platform runs it and delivers structured, comparable output through the results API.

```python
import runloop

# Define a model comparison evaluation
job = runloop.benchmark_jobs.create(
    name="claude-vs-gpt4-codebase-eval",
    benchmark_def="internal-monorepo-suite",
    variants=[
        {"model": "claude-sonnet-4-5", "agent": "coding-agent-v3"},
        {"model": "gpt-4.1", "agent": "coding-agent-v3"},
    ],
    config={"concurrency": 100, "timeout_seconds": 600}
)

# Compare results side-by-side
comparison = runloop.benchmark_jobs.compare(job.id)
print(comparison.summary_table())
```
npm install @runloop/api-client
```typescript
import Runloop from "runloop";

// Define a model comparison evaluation
const job = await runloop.benchmarkJobs.create({
  name: "claude-vs-gpt4-codebase-eval",
  benchmarkDef: "internal-monorepo-suite",
  variants: [
    { model: "claude-sonnet-4-5", agent: "coding-agent-v3" },
    { model: "gpt-4.1", agent: "coding-agent-v3" },
  ],
  config: { concurrency: 100, timeoutSeconds: 600 },
});

// Compare results side-by-side
const comparison = await runloop.benchmarkJobs.compare(job.id);
console.log(comparison.summaryTable());
```
npm install @runloop/api-client
```bash
# Run a model comparison evaluation
runloop benchmark run \
  --name "claude-vs-gpt4-codebase-eval" \
  --benchmark-def "internal-monorepo-suite" \
  --variant "model=claude-sonnet-4-5,agent=coding-agent-v3" \
  --variant "model=gpt-4.1,agent=coding-agent-v3" \
  --concurrency 100 \
  --timeout 600

# Compare results side-by-side
runloop benchmark compare --job claude-vs-gpt4-codebase-eval --format table
```

---
npm install @runloop/api-client
badge

Evaluate models across the dimensions that matter

Four measurement axes that drive deployment decisions.

Correctness

Does the agent produce the right output? Test suite pass rates and diff matches against known-good solutions, scored consistently across all variants you compare.

Reliability

Measures consistency across repeated attempts on identical tasks. Score distributions surface variance that pass-rate averages conceal.

Efficiency

Duration, tool calls, and execution cost reveal how agents reach solutions. Two models with identical pass rates may have dramatically different cost profiles. Token usage and cost-per-task tracking are planned [PLANNED: requires LLM Proxy].

Trajectory

Which tools did the agent reach for, how did it recover from errors, and did it explore dead ends? Qualitative differences that aggregate scores obscure. [PLANNED: requires LLM Proxy]

From test suite to deployment decision

Four steps from evaluation definition to actionable data.

Process card icon.
Radar.
01
Define Your Test Suite

Select from public benchmarks like SWE-Bench Verified, curated scenario sets from academic research partners, or custom scenarios from your own repositories. Each scenario specifies a task, environment, and scoring contract.

Green light and grid of the progress.
chat bubble icon
Radar.
02
Configure the Comparison

Specify model and agent variants. Set orchestration parameters: concurrency, retry policy, timeout limits. Save as a BenchmarkJobDef template for repeated evaluation runs.

White light and grid of the progress.
chat bubble icon
Radar.
02
Run the Evaluation

Submit through the API, CLI, or dashboard. The platform schedules across isolated sandboxes, manages retries, and aggregates results as scenarios complete. Monitor progress in real time.

White light and grid of the progress.
chat bubble icon
Radar.
02
Analyze and Compare

View comparative results in the side-by-side dashboard. Filter by scenario to identify divergence patterns. Export structured results through the API for downstream analysis.

White light and grid of the progress.

No other AI agent infrastructure platform offers CI/CD-integrated benchmark evaluation as a product capability. Regression testing for AI agents is not a feature competitors provide -- it is a category Runloop created.

What other platforms cannot offer

Three capabilities with zero equivalents across competing platforms.

Comparative Analysis

Side-by-side comparison is a native platform capability, not a third-party integration. Structured results API returns comparison data programmatically.

Three-Tier Library

Public benchmarks for baseline screening, curated sets from academic researchers, and custom benchmarks from your proprietary codebase.

Persistent Templates

BenchmarkJobDef templates capture the full evaluation configuration. Re-run with a single parameter change. Evaluation history accumulates automatically.

F.A.Q

Model evaluation questions

Common questions about evaluating AI models and agents with Runloop's benchmarking infrastructure.

How is Runloop model evaluation different from public leaderboards?s a 'silent regression' in AI agents?
Can I evaluate models on my proprietary codebase?
What metrics does the comparison dashboard show?
How do I set up recurring evaluations for model updates?
Does Runloop support evaluating custom agent frameworks, not just models
More questions? Visit our docs or send us a message