Benchmarks

The Evaluation Platform for AI Agents

Run public benchmarks, build private evaluation suites, and integrate regression testing into your CI/CD pipeline -- all on infrastructure purpose-built for agent execution.

Public & Custom Benchmarks
Comparative Analysis
Credential Gateway
Parallel Orchestration
Scoring Contracts
CI/CD Integration

Evaluation is not just accuracy

Leaderboard rankings tell you which model scores highest on a generic task set. Production decisions require measurement across four dimensions.

Performance
Does the agent solve the task? Pass rate, duration, and score distribution across scenarios.
Model Evaluation →
Cost Efficiency
Two models with identical pass rates can differ 10x in token consumption and API cost.
Model Evaluation →
Compliance & Safety
Does the agent respect credential boundaries? Stay within its permitted tool set? Handle PII according to policy
Agent Testing →
Business Logic
Does the agent follow your conventions, coding standards, and domain-specific rules?
Custom Benchmarks →
runloop
+
Weights & Biases

Execution meets observability

Runloop runs your agent benchmarks. Weights & Biases tracks every metric, artifact, and experiment across runs. Together: a single workflow from benchmark execution to model selection.

Benchmark results stream directly into W&B experiments
Compare model performance, cost, and safety metrics in W&B dashboards
Track evaluation lineage from scenario to training decision
Trigger Runloop benchmark jobs from W&B pipelines
Runloop is the only AI agent sandbox provider with an integrated evaluation and benchmarking platform. E2B, Daytona, Modal, and other sandbox providers offer execution environments. None of them help you measure whether your agent is improving.
30,000+
concurrent environments
2 modes
interactive & orchestrated
3 tiers
public, curated, custom
<10ms
Credential Gateway latency
Every benchmark runs on the same infrastructure: isolated sandboxes, orchestrated execution, Credential Gateway protection, and structured result aggregation.
Evaluation Ecosystem

Built with the evaluation community

Runloop partners with industry, non-profit, and academic organizations working to define evaluation standards for AI agents.

MLCommons
HELM
LMSys
Berkeley EECS
Operational Modes

Two ways to run benchmarks

Every evaluation workflow runs through one of two modes. Choose based on your feedback loop speed and scale requirements.

Interactive

Interactive Benchmark Runs

Real-time execution with streaming results. Submit a benchmark, watch scenarios complete, and consume structured output as it arrives.

Online reinforcement learning with environment-grounded reward signals
CI/CD pipeline gates with immediate pass/fail on code push
Ad-hoc model comparison during development
Debugging individual scenario failures in real time
Orchestrated

Orchestrated Benchmark Runs

Declarative job submission with Harbor-compatible configuration. The platform handles provisioning, distribution, retry, and aggregation at scale.

Mass generation of training data for SFT and RFT loops
Evaluating multiple models against a standard benchmark suite
Agent selection across configurations at production scale
Scheduled regression runs across thousands of scenarios
Benchmark Tiers

Three tiers of evaluation scenarios

From industry-standard suites to your own proprietary codebase.

Run industry-standard evaluations on demand

Access established benchmarks like SWE-Bench Verified directly through the platform. Results are private to your organization, not submitted to a public leaderboard.

Use public benchmarks as a first-pass filter before deeper evaluation
Establish a competitive baseline on well-known tasks
No infrastructure setup -- submit a job, receive structured results
Model Evaluation →

Isolate and measure specific capabilities

Curated benchmarks are themed scenario collections developed with academic researchers. Run 40 targeted scenarios instead of 300 broad ones.

Debugging, multi-file refactoring, test generation -- each measured independently
Safety-specific suites for boundary respect and credential handling
Standardized conditions for research teams publishing results
Custom Benchmarks →

Build private benchmark suites on your own code

Custom benchmarks use your own repositories, task definitions, and scoring criteria. Credentials managed through the Credential Gateway.

Compare models against your actual workload -- generic benchmarks do not predict performance on your code
Score agents on data handling, PII rules, and access control alongside correctnes
Proprietary code never leaves your security boundary
Custom Benchmarks →
Use Cases

What teams build on the platform

The two operational modes and three benchmark tiers combine to support these evaluation workflows.

Side-by-side model comparison

Hold one variable constant, change another, and compare results. The comparison dashboard surfaces where configurations diverge on per-scenario behavior.

Run identical scenarios across multiple configurations in a single Benchmark Job
Scenario-level drill-down to find where models actually differ
Model Evaluation →
Evaluation Metrics
METRIC STATUS
Pass rate Available
Average duration Available
Score distribution Available
Token usage per task Planned (LLM Proxy)
Cost per task Planned (LLM Proxy)
Tool call patterns Planned (LLM Proxy)

Regression testing in CI/CD

Define a baseline, then run it automatically when variables change: model updates, framework upgrades, prompt modifications. Results serve as CI/CD gates.

BenchmarkJobDef templates: reusable evaluation configurations
Gate on pass rate, duration, and compliance scores simultaneously
Regression Testing →

Training data generation for RFT and SFT

Fine-tuning workloads use Runloop's sandbox infrastructure as the execution substrate for training loops. Scoring contracts encode correctness, compliance, safety, and business logic into the reward signal.

Environment-grounded scoring at training-loop throughput
Snapshot and restore at each step -- no rebuilding environments
Fine-Tuning →

Declarative benchmark job execution at scale

Submit a job specification. The platform handles everything between submission and structured results.

Submit Job Spec
Harbor YAML, BenchmarkJobDef references, or direct scenario lists.
Orchestrate Execution
Provision sandboxes, distribute scenarios, manage retry, enforce timeouts.
Monitor Progress
Real-time streaming through the Job Dashboard. Drill into individual scenarios.
Consume Results
Structured API output for programmatic consumption. Comparison dashboard for visual analysis.
Orchestration API

One API for every evaluation workflow

The same Benchmark Jobs API powers model evaluation, regression testing, and training data generation.

import runloop

# Submit a benchmark job -- same API for all evaluation workflows
job = runloop.benchmark_jobs.create(
    name="q1-model-evaluation",
    benchmark_job_def="agent-v3-full-suite",
    variants=[
        {"model": "claude-sonnet-4-5", "agent": "code-agent-v3"},
        {"model": "gpt-4.1", "agent": "code-agent-v3"},
    ],
    config={"concurrency": 100, "timeout_seconds": 300}
)

# Stream results as scenarios complete
for result in runloop.benchmark_jobs.stream_results(job.id):
    print(f"{result.scenario}: {result.status}")
import Runloop from 'runloop';

const job = await runloop.benchmarkJobs.create({
  name: 'q1-model-evaluation',
  benchmarkJobDef: 'agent-v3-full-suite',
  variants: [
    { model: 'claude-sonnet-4-5', agent: 'code-agent-v3' },
    { model: 'gpt-4.1', agent: 'code-agent-v3' },
  ],
  config: { concurrency: 100, timeoutSeconds: 300 }
});

for await (const result of runloop.benchmarkJobs.streamResults(job.id)) {
  console.log(`${result.scenario}: ${result.status}`);
}
# Submit a benchmark job
runloop benchmark create \
  --name "q1-model-evaluation" \
  --def "agent-v3-full-suite" \
  --variant "model=claude-sonnet-4-5,agent=code-agent-v3" \
  --variant "model=gpt-4.1,agent=code-agent-v3" \
  --concurrency 100

# Stream results
runloop benchmark stream <job-id>

From model selection to production monitoring

One evaluation platform for performance, cost, compliance, and safety -- across public benchmarks, private suites, and CI/CD integration.

Need enterprise deployment? Talk to Sales