Benchmarks

The Evaluation Platform for AI Agents

Run public benchmarks, build private evaluation suites, and integrate regression testing into your CI/CD pipeline -- all on infrastructure purpose-built for agent execution.

Get Started View Docs

Public & Custom Benchmarks

Comparative Analysis

Credential Gateway

Parallel Orchestration

Scoring Contracts

CI/CD Integration

Evaluation is not just accuracy

Leaderboard rankings tell you which model scores highest on a generic task set. Production decisions require measurement across four dimensions.

Performance

Does the agent solve the task? Pass rate, duration, and score distribution across scenarios.

Model Evaluation →

Cost Efficiency

Two models with identical pass rates can differ 10x in token consumption and API cost.

Model Evaluation →

Compliance & Safety

Does the agent respect credential boundaries? Stay within its permitted tool set? Handle PII according to policy

Agent Testing →

Business Logic

Does the agent follow your conventions, coding standards, and domain-specific rules?

Custom Benchmarks →

runloop

Weights & Biases

Execution meets observability

Runloop runs your agent benchmarks. Weights & Biases tracks every metric, artifact, and experiment across runs. Together: a single workflow from benchmark execution to model selection.

Benchmark results stream directly into W&B experiments

Compare model performance, cost, and safety metrics in W&B dashboards

Track evaluation lineage from scenario to training decision

Trigger Runloop benchmark jobs from W&B pipelines

Runloop is the only AI agent sandbox provider with an integrated evaluation and benchmarking platform. E2B, Daytona, Modal, and other sandbox providers offer execution environments. None of them help you measure whether your agent is improving.

30,000+

concurrent environments

2 modes

interactive & orchestrated

3 tiers

public, curated, custom

<10ms

Credential Gateway latency

Every benchmark runs on the same infrastructure: isolated sandboxes, orchestrated execution, Credential Gateway protection, and structured result aggregation.

Evaluation Ecosystem

Built with the evaluation community

Runloop partners with industry, non-profit, and academic organizations working to define evaluation standards for AI agents.

MLCommons

HELM

LMSys

Berkeley EECS

Operational Modes

Two ways to run benchmarks

Every evaluation workflow runs through one of two modes. Choose based on your feedback loop speed and scale requirements.

Interactive

Interactive Benchmark Runs

Real-time execution with streaming results. Submit a benchmark, watch scenarios complete, and consume structured output as it arrives.

Online reinforcement learning with environment-grounded reward signals

CI/CD pipeline gates with immediate pass/fail on code push

Ad-hoc model comparison during development

Debugging individual scenario failures in real time

Orchestrated

Orchestrated Benchmark Runs

Declarative job submission with Harbor-compatible configuration. The platform handles provisioning, distribution, retry, and aggregation at scale.

Mass generation of training data for SFT and RFT loops

Evaluating multiple models against a standard benchmark suite

Agent selection across configurations at production scale

Scheduled regression runs across thousands of scenarios

Benchmark Tiers

Three tiers of evaluation scenarios

From industry-standard suites to your own proprietary codebase.

Run industry-standard evaluations on demand

Access established benchmarks like SWE-Bench Verified directly through the platform. Results are private to your organization, not submitted to a public leaderboard.

Use public benchmarks as a first-pass filter before deeper evaluation

Establish a competitive baseline on well-known tasks

No infrastructure setup -- submit a job, receive structured results

Model Evaluation →

Isolate and measure specific capabilities

Curated benchmarks are themed scenario collections developed with academic researchers. Run 40 targeted scenarios instead of 300 broad ones.

Debugging, multi-file refactoring, test generation -- each measured independently

Safety-specific suites for boundary respect and credential handling

Standardized conditions for research teams publishing results

Custom Benchmarks →

Build private benchmark suites on your own code

Custom benchmarks use your own repositories, task definitions, and scoring criteria. Credentials managed through the Credential Gateway.

Compare models against your actual workload -- generic benchmarks do not predict performance on your code

Score agents on data handling, PII rules, and access control alongside correctnes

Proprietary code never leaves your security boundary

Custom Benchmarks →

Use Cases

What teams build on the platform

The two operational modes and three benchmark tiers combine to support these evaluation workflows.

Side-by-side model comparison

Hold one variable constant, change another, and compare results. The comparison dashboard surfaces where configurations diverge on per-scenario behavior.

Run identical scenarios across multiple configurations in a single Benchmark Job

Scenario-level drill-down to find where models actually differ

Model Evaluation →

Evaluation Metrics

METRIC
STATUS

Pass rate
Available

Average duration
Available

Score distribution
Available

Token usage per task
Planned (LLM Proxy)

Cost per task
Planned (LLM Proxy)

Tool call patterns
Planned (LLM Proxy)

METRIC	STATUS
Pass rate	Available
Average duration	Available
Score distribution	Available
Token usage per task	Planned (LLM Proxy)
Cost per task	Planned (LLM Proxy)
Tool call patterns	Planned (LLM Proxy)

Regression testing in CI/CD

Define a baseline, then run it automatically when variables change: model updates, framework upgrades, prompt modifications. Results serve as CI/CD gates.

BenchmarkJobDef templates: reusable evaluation configurations

Gate on pass rate, duration, and compliance scores simultaneously

Regression Testing →

Training data generation for RFT and SFT

Fine-tuning workloads use Runloop's sandbox infrastructure as the execution substrate for training loops. Scoring contracts encode correctness, compliance, safety, and business logic into the reward signal.

Environment-grounded scoring at training-loop throughput

Snapshot and restore at each step -- no rebuilding environments

Fine-Tuning →

Declarative benchmark job execution at scale

Submit a job specification. The platform handles everything between submission and structured results.

Submit Job Spec

Harbor YAML, BenchmarkJobDef references, or direct scenario lists.

Orchestrate Execution

Provision sandboxes, distribute scenarios, manage retry, enforce timeouts.

Monitor Progress

Real-time streaming through the Job Dashboard. Drill into individual scenarios.

Consume Results

Structured API output for programmatic consumption. Comparison dashboard for visual analysis.

Orchestration API

One API for every evaluation workflow

The same Benchmark Jobs API powers model evaluation, regression testing, and training data generation.

import runloop

# Submit a benchmark job -- same API for all evaluation workflows
job = runloop.benchmark_jobs.create(
    name="q1-model-evaluation",
    benchmark_job_def="agent-v3-full-suite",
    variants=[
        {"model": "claude-sonnet-4-5", "agent": "code-agent-v3"},
        {"model": "gpt-4.1", "agent": "code-agent-v3"},
    ],
    config={"concurrency": 100, "timeout_seconds": 300}
)

# Stream results as scenarios complete
for result in runloop.benchmark_jobs.stream_results(job.id):
    print(f"{result.scenario}: {result.status}")

import Runloop from 'runloop';

const job = await runloop.benchmarkJobs.create({
  name: 'q1-model-evaluation',
  benchmarkJobDef: 'agent-v3-full-suite',
  variants: [
    { model: 'claude-sonnet-4-5', agent: 'code-agent-v3' },
    { model: 'gpt-4.1', agent: 'code-agent-v3' },
  ],
  config: { concurrency: 100, timeoutSeconds: 300 }
});

for await (const result of runloop.benchmarkJobs.streamResults(job.id)) {
  console.log(`${result.scenario}: ${result.status}`);
}

# Submit a benchmark job
runloop benchmark create \
  --name "q1-model-evaluation" \
  --def "agent-v3-full-suite" \
  --variant "model=claude-sonnet-4-5,agent=code-agent-v3" \
  --variant "model=gpt-4.1,agent=code-agent-v3" \
  --concurrency 100

# Stream results
runloop benchmark stream <job-id>

From model selection to production monitoring

One evaluation platform for performance, cost, compliance, and safety -- across public benchmarks, private suites, and CI/CD integration.

Get Started View Docs

Need enterprise deployment? Talk to Sales

The Evaluation Platform for AI Agents

Evaluation is not just accuracy

Execution meets observability

Built with the evaluation community

Two ways to run benchmarks

Interactive Benchmark Runs

Orchestrated Benchmark Runs

Three tiers of evaluation scenarios

Run industry-standard evaluations on demand

Isolate and measure specific capabilities

Build private benchmark suites on your own code

What teams build on the platform

Side-by-side model comparison

Regression testing in CI/CD

Training data generation for RFT and SFT

Declarative benchmark job execution at scale

One API for every evaluation workflow

From model selection to production monitoring

Get Started With Runloop

Get Started With Runloop