30,000+

concurrent environments

3 tiers

public, curated, custom benchmarks

3 modes

evaluation, regression, fine-tuning

<10ms

Credential Gateway latency

Runloop is the only AI agent sandbox provider with an integrated evaluation and benchmarking platform. E2B, Daytona, Modal, and other sandbox providers offer execution environments. None of them help you measure whether your agent is improving.

badge

Evaluation is not just accuracy

Production AI agent decisions require measurement across four dimensions. Runloop's evaluation infrastructure supports all of them.

Performance

Does the agent solve the task? Pass rate, duration, and score distribution across scenarios. The baseline question, but not the only one. [Model Evaluation](/solutions/model-evaluation) develops this fully.

Cost Efficiency

Two models with identical pass rates can differ 10x in token consumption and API cost. Cost-per-task metrics turn model selection from a capability question into an ROI decision. [Model Evaluation](/solutions/model-evaluation) covers cost-performance analysis.

Compliance & Safety

Does the agent respect credential boundaries? Does it stay within its permitted tool set? Does it handle PII according to policy? Security and compliance evaluation uses the same benchmark infrastructure with different scoring contracts. [Agent Testing](/solutions/agent-testing) and [Security](/security) develop this fully.

Business Logic

Does the agent follow your organization's conventions, coding standards, and domain-specific rules? Custom benchmarks built on your codebase measure what generic benchmarks cannot. [Custom Benchmarks](/solutions/custom-benchmarks) develops this fully.

FULL LIFECYCLE
Every benchmark runs on the same infrastructure: isolated sandboxes via the [Runloop Platform](/product), orchestrated execution through Benchmark Jobs, Credential Gateway protection, and structured result aggregation. Runloop partners with industry, non-profit, and academic groups working to set evaluation standards.
Learn how Runloop solves this
-- Engineering Lead, Series B AI Startup

Declarative benchmark job execution at scale

Submit a job specification. The platform handles everything between submission and structured results.

Process card icon.
Radar.
01
Submit Job Spec

Three input formats: Harbor YAML for external configurations, BenchmarkJobDef references for established suite templates, or direct scenario lists for ad-hoc evaluations.

Green light and grid of the progress.
chat bubble icon
Radar.
02
Orchestrate Execution

The platform provisions sandboxes, distributes scenarios at your concurrency limit, manages retry policies, and enforces timeouts. Individual failures do not collapse the job.

White light and grid of the progress.
chat bubble icon
Radar.
02
Monitor Progress

Real-time streaming through the Job Dashboard. Drill from job to benchmark run to individual scenario. Graceful interruption and deterministic resume survive platform deploys.

White light and grid of the progress.
chat bubble icon
Radar.
02
Consume Results

Structured API output for programmatic consumption. Comparison dashboard for visual analysis. Pass rate, duration, score distribution, and (planned) cost-per-task metrics.

White light and grid of the progress.
ORCHESTRATION API

One API for every evaluation workflow

The same Benchmark Jobs API powers model evaluation, regression testing, and fine-tuning signal generation

```python
import runloop

# Submit a benchmark job -- same API for all evaluation workflows
job = runloop.benchmark_jobs.create(
    name="q1-model-evaluation",
    benchmark_job_def="agent-v3-full-suite",
    variants=[
        {"model": "claude-sonnet-4-5", "agent": "code-agent-v3"},
        {"model": "gpt-4.1", "agent": "code-agent-v3"},
    ],
    config={"concurrency": 100, "timeout_seconds": 300}
)

# Stream results as scenarios complete
for result in runloop.benchmark_jobs.stream_results(job.id):
    print(f"{result.scenario}: {result.status} (score: {result.score})")
```
npm install @runloop/api-client
```typescript
import Runloop from 'runloop';

const job = await runloop.benchmarkJobs.create({
  name: 'q1-model-evaluation',
  benchmarkJobDef: 'agent-v3-full-suite',
  variants: [
    { model: 'claude-sonnet-4-5', agent: 'code-agent-v3' },
    { model: 'gpt-4.1', agent: 'code-agent-v3' },
  ],
  config: { concurrency: 100, timeoutSeconds: 300 }
});

for await (const result of runloop.benchmarkJobs.streamResults(job.id)) {
  console.log(`${result.scenario}: ${result.status} (score: ${result.score})`);
}
```
npm install @runloop/api-client
```bash
runloop benchmark run \
  --name "q1-model-evaluation" \
  --job-def "agent-v3-full-suite" \
  --variant "model=claude-sonnet-4-5,agent=code-agent-v3" \
  --variant "model=gpt-4.1,agent=code-agent-v3" \
  --concurrency 100 \
  --timeout 300

runloop benchmark results --job q1-model-evaluation --format table
```
npm install @runloop/api-client

No other AI agent infrastructure platform offers CI/CD-integrated benchmark evaluation as a product capability. Regression testing for AI agents is not a feature competitors provide -- it is a category Runloop created.