Benchmarks

30,000+

concurrent environments

3 tiers

public, curated, custom benchmarks

3 modes

evaluation, regression, fine-tuning

<10ms

Credential Gateway latency

badge

Evaluation is not just accuracy

Production AI agent decisions require measurement across four dimensions. Runloop's evaluation infrastructure supports all of them.

Performance

Does the agent solve the task? Pass rate, duration, and score distribution across scenarios. The baseline question, but not the only one. [Model Evaluation](/solutions/model-evaluation) develops this fully.

Cost Efficiency

Two models with identical pass rates can differ 10x in token consumption and API cost. Cost-per-task metrics turn model selection from a capability question into an ROI decision. [Model Evaluation](/solutions/model-evaluation) covers cost-performance analysis.

Compliance & Safety

Does the agent respect credential boundaries? Does it stay within its permitted tool set? Does it handle PII according to policy? Security and compliance evaluation uses the same benchmark infrastructure with different scoring contracts. [Agent Testing](/solutions/agent-testing) and [Security](/security) develop this fully.

Business Logic

Does the agent follow your organization's conventions, coding standards, and domain-specific rules? Custom benchmarks built on your codebase measure what generic benchmarks cannot. [Custom Benchmarks](/solutions/custom-benchmarks) develops this fully.

FULL LIFECYCLE

Every benchmark runs on the same infrastructure: isolated sandboxes via the [Runloop Platform](/product), orchestrated execution through Benchmark Jobs, Credential Gateway protection, and structured result aggregation. Runloop partners with industry, non-profit, and academic groups working to set evaluation standards.

Learn how Runloop solves this

-- Engineering Lead, Series B AI Startup

BENCHMARK TIER

Run industry-standard AI agent evaluations on demand

Access established benchmarks like SWE-Bench Verified directly through the platform. Run your models and agents against the same test suites the industry uses for comparison, without building your own evaluation harness or managing benchmark infrastructure. Results are private to your organization -- not submitted to a public leaderboard. Run against any model or agent configuration through a single API call.

Initial screening: Use public benchmarks as a first-pass filter before investing in deeper evaluation on your own codebase

Competitive baseline: Establish a performance baseline on well-known tasks and track it over time as models and agents evolve

Cost-performance filtering: Compare not just which model passes more scenarios, but which delivers the best pass rate per dollar of API spend (planned)

No infrastructure setup: Scenarios are pre-configured; submit a job and receive structured results

Model Evaluation

BENCHMARK TIER

Isolate and measure specific AI agent capabilities

Curated benchmarks are themed scenario collections that isolate specific agent capabilities. Developed in partnership with academic researchers, they measure dimensions that broad benchmarks like SWE-Bench compress into a single aggregate score. Run 40 targeted scenarios instead of 300 broad ones. Tighter feedback loops identify exactly where a model or agent configuration excels or struggles -- across performance, safety behavior, and adherence to coding conventions.

Capability dimensions: Debugging, multi-file refactoring, test generation, toolchain interaction, codebase navigation -- each measured independently

Safety-specific suites: Scenario sets designed to test boundary respect, credential handling behavior, and tool access compliance under adversarial conditions

Controlled conditions: Standardized evaluation for research teams publishing results or comparing approaches on specific dimensions

Academic partnerships: Developed with researchers focused on rigorous, reproducible AI agent evaluation methodology"

Custom Benchmarks

BENCHMARK TIER

Build private AI agent benchmark suites on your own code

Custom benchmarks let enterprises build evaluation suites using their own repositories, task definitions, and scoring criteria. Every scenario runs in an isolated sandbox configured to match your production environment. Credentials are managed through the Credential Gateway. This is where the four evaluation dimensions converge: performance on your actual tasks, cost efficiency against your budget constraints, compliance with your regulatory posture, and adherence to your organization's coding standards and business rules.

Vendor selection: Compare models against your actual workload -- generic benchmarks do not predict performance on your code

Compliance validation: Score agents on whether they follow your data handling policies, PII rules, and access control requirements alongside functional correctness

Business logic conformance: Test whether agents follow your coding conventions, architectural patterns, and domain-specific rules that no public benchmark measures

Security boundary: Proprietary code and credentials never leave your security boundary via the Credential Gateway and optional [VPC deployment](/solutions/deploy-to-vpc)

Custom Benchmarks

OPERATIONAL MODE

Side-by-side comparison across models and agent versions

Evaluation mode answers selection questions: which model should we use? How does the new agent version compare? Does changing the tool configuration improve performance? Hold one variable constant, change another, and compare results. The comparison dashboard surfaces where configurations diverge -- not just on aggregate pass rate, but on per-scenario behavior, duration distribution, and (when available) cost per task.

Multi-variant comparison: Run identical scenarios across multiple configurations in a single Benchmark Job"

Cost-performance analysis: Token usage and cost per task metrics enable ROI-driven model selection, not just accuracy-driven (planned -- requires LLM Proxy)

Scenario-level drill-down: Aggregate metrics mask important divergence; drill into individual scenarios to find where models actually differ

Any benchmark tier: Evaluate with public benchmarks for broad screening, curated for capability-specific, or custom for workload-specific vendor selection

Model Evaluation

OPERATIONAL MODE

Catch silent AI agent performance degradation in CI/CD

Regression testing mode integrates benchmark evaluation into your deployment pipeline. Define a baseline, then run it automatically when variables change: model updates, agent framework upgrades, dependency changes, prompt template modifications, or tool configuration changes. Results serve as CI/CD gates. This catches more than performance regressions -- scoring contracts can detect compliance drift (an agent that starts accessing tools it previously avoided) and business logic violations (an agent that stops following conventions after a prompt change).

BenchmarkJobDef templates: Store evaluation configurations as reusable artifacts; re-run with updated parameters and compare against baseline

Pipeline-native results: Structured API output your CI system consumes programmatically -- no custom evaluation harness

Multi-dimensional regression: Gate on pass rate, duration thresholds, and compliance scores simultaneously

Resilient orchestration: Parallel execution, configurable retry, timeout enforcement, and partial result capture

Regression Testing

OPERATIONAL MODE

Execution infrastructure for RFT and SFT at training-loop speed

Fine-tuning workloads use Runloop's sandbox infrastructure as the execution substrate for training loops that require code execution and scoring. Reinforcement Fine-Tuning generates candidate solutions, executes them in isolated sandboxes, and feeds pass/fail signals back to the training process. Supervised Fine-Tuning validates training data quality at scale by executing and scoring examples in production-representative environments. Scoring contracts can encode not just functional correctness but also compliance criteria, safety boundaries, and business logic conformance into the reward signal.

RFT reward signals: Environment-grounded scoring at the throughput your training loop demands -- thousands of concurrent sandboxes

SFT data validation: Execute and score candidate training examples at scale before inclusion in training sets

Snapshot reuse: Snapshot configured environments and restore at each step to reduce setup time for heavy dependency trees

Framework integration: Clean API boundary -- Runloop handles execution and scoring, your training framework handles learning

Fine-Tuning

Declarative benchmark job execution at scale

Submit a job specification. The platform handles everything between submission and structured results.

Radar.

Submit Job Spec

Three input formats: Harbor YAML for external configurations, BenchmarkJobDef references for established suite templates, or direct scenario lists for ad-hoc evaluations.

Radar.

Orchestrate Execution

The platform provisions sandboxes, distributes scenarios at your concurrency limit, manages retry policies, and enforces timeouts. Individual failures do not collapse the job.

Radar.

Monitor Progress

Real-time streaming through the Job Dashboard. Drill from job to benchmark run to individual scenario. Graceful interruption and deterministic resume survive platform deploys.

Radar.

Consume Results

Structured API output for programmatic consumption. Comparison dashboard for visual analysis. Pass rate, duration, score distribution, and (planned) cost-per-task metrics.

ORCHESTRATION API

One API for every evaluation workflow

The same Benchmark Jobs API powers model evaluation, regression testing, and fine-tuning signal generation

```python
import runloop

# Submit a benchmark job -- same API for all evaluation workflows
job = runloop.benchmark_jobs.create(
    name="q1-model-evaluation",
    benchmark_job_def="agent-v3-full-suite",
    variants=[
        {"model": "claude-sonnet-4-5", "agent": "code-agent-v3"},
        {"model": "gpt-4.1", "agent": "code-agent-v3"},
    ],
    config={"concurrency": 100, "timeout_seconds": 300}
)

# Stream results as scenarios complete
for result in runloop.benchmark_jobs.stream_results(job.id):
    print(f"{result.scenario}: {result.status} (score: {result.score})")
```

npm install @runloop/api-client

```typescript
import Runloop from 'runloop';

const job = await runloop.benchmarkJobs.create({
  name: 'q1-model-evaluation',
  benchmarkJobDef: 'agent-v3-full-suite',
  variants: [
    { model: 'claude-sonnet-4-5', agent: 'code-agent-v3' },
    { model: 'gpt-4.1', agent: 'code-agent-v3' },
  ],
  config: { concurrency: 100, timeoutSeconds: 300 }
});

for await (const result of runloop.benchmarkJobs.streamResults(job.id)) {
  console.log(`${result.scenario}: ${result.status} (score: ${result.score})`);
}
```

npm install @runloop/api-client

```bash
runloop benchmark run \
  --name "q1-model-evaluation" \
  --job-def "agent-v3-full-suite" \
  --variant "model=claude-sonnet-4-5,agent=code-agent-v3" \
  --variant "model=gpt-4.1,agent=code-agent-v3" \
  --concurrency 100 \
  --timeout 300

runloop benchmark results --job q1-model-evaluation --format table
```

npm install @runloop/api-client

30,000+

3 tiers

3 modes

<10ms

Evaluation is not just accuracy

Run industry-standard AI agent evaluations on demand

Isolate and measure specific AI agent capabilities

Build private AI agent benchmark suites on your own code

Side-by-side comparison across models and agent versions

Catch silent AI agent performance degradation in CI/CD

Execution infrastructure for RFT and SFT at training-loop speed

Declarative benchmark job execution at scale

One API for every evaluation workflow

Get Started With Runloop

Get Started With Runloop