Custom AI Benchmarks on Proprietary Code

30,000+

concurrent environments

<10ms

Credential Gateway latency

50ms

command execution

<20ms

MCP Hub routing

THE LEADERBOARD GA

Public benchmarks like SWE-Bench Verified test against open-source repositories with known patterns. Your codebase has its own conventions, dependency chains, internal tooling, and complexity profile. An agent that scores well on Django issues may struggle with your internal monorepo or custom build system. The gap between public benchmark scores and real-world performance on proprietary code is where enterprises lose time and money -- choosing models based on rankings that may have no correlation with the work that actually matters.

Learn how Runloop solves this

-- Engineering Lead, Series B AI Startup

PRIVATE EVALUATION INFRASTRUCTURE

A private benchmark suite that reflects your actual environment

Runloop Custom Benchmarks let you define evaluation suites using your own repositories, task definitions, and scoring criteria. Every scenario runs in an isolated sandbox with the same toolchains and dependencies your agents encounter in production. Credentials are managed through the Credential Gateway -- API keys and tokens are never exposed to the evaluation environment.

```python
import runloop

# Define custom scenarios from your private repository
scenarios = runloop.scenarios.create_from_repo(
    repo_url="https://github.com/your-org/internal-monorepo",
    task_filter="issues/labeled/agent-benchmark",
    scoring="test_suite_pass"
)

# Run evaluation with two model configurations
job = runloop.benchmark_jobs.create(
    name="q1-vendor-evaluation",
    scenarios=scenarios,
    variants=[
        {"model": "claude-sonnet-4-5", "agent": "code-agent-v3"},
        {"model": "gpt-4.1", "agent": "code-agent-v3"},
    ],
    config={"concurrency": 100, "timeout_seconds": 600}
)
```

npm install @runloop/api-client

```typescript
import Runloop from 'runloop';

// Define custom scenarios from your private repository
const scenarios = await runloop.scenarios.createFromRepo({
  repoUrl: 'https://github.com/your-org/internal-monorepo',
  taskFilter: 'issues/labeled/agent-benchmark',
  scoring: 'test_suite_pass'
});

// Run evaluation with two model configurations
const job = await runloop.benchmarkJobs.create({
  name: 'q1-vendor-evaluation',
  scenarios,
  variants: [
    { model: 'claude-sonnet-4-5', agent: 'code-agent-v3' },
    { model: 'gpt-4.1', agent: 'code-agent-v3' },
  ],
  config: { concurrency: 100, timeoutSeconds: 600 }
});
```

npm install @runloop/api-client

```bash
# Create scenarios from a private repository
runloop scenarios create-from-repo \
  --repo "https://github.com/your-org/internal-monorepo" \
  --filter "issues/labeled/agent-benchmark" \
  --scoring "test_suite_pass"

# Run evaluation with two model configurations
runloop benchmark run \
  --name "q1-vendor-evaluation" \
  --scenarios ./scenarios.yaml \
  --variant "model=claude-sonnet-4-5,agent=code-agent-v3" \
  --variant "model=gpt-4.1,agent=code-agent-v3" \
  --concurrency 100 \
  --timeout 600
```

npm install @runloop/api-client

badge

Enterprise benchmarking for proprietary code evaluation

Four primitives from the [Runloop Platform](/product) that make private benchmarking secure, repeatable, and scalable.

Scoring Contracts

Define scoring criteria per scenario: test suite pass, diff match, or custom scoring functions. Structured results API returns machine-readable output for downstream analytics.

BenchmarkJobDef Templates

Save evaluation configurations as reusable templates. Standardize evaluations across teams. Re-run with a single parameter change when models update.

Credential Gateway

API keys injected as opaque, devbox-bound tokens. Real credentials never exist in the benchmark environment. Tokens expire when the sandbox terminates.

Job Dashboard

Monitor benchmark execution with live progress indicators across all scenarios. Drill from failed jobs into failed scenarios with full execution logs.

Use Case

Choose the right model for your codebase, not the leaderboard

When a new model drops, the first question is always the same: does it actually perform better on our work? Two models with identical SWE-Bench scores can behave very differently on a codebase with heavy TypeScript, custom linters, or non-standard build tooling. Runloop makes AI vendor selection empirical. Build a benchmark suite from representative tasks in your codebase, then run the same suite against every model. The platform's comparative analysis produces side-by-side results: pass rate, duration, and score distribution.

Private test suites: Build scenarios from your own codebase, not open-source repositories with different patterns

Controlled variables: Same agent codebase, different models -- isolate the model as the single variable

Credential protection:"Proprietary code and API keys are never exposed during evaluation via the Credential Gateway

Data-backed decisions: When leadership asks why you picked one model, you have specific scenario-level data

Model Evaluation

Use Case

Stop building benchmark harnesses -- start running benchmarks

Before Runloop, orchestrating a benchmark run meant writing custom scripts to manage parallel execution, handle retries, track failures, and aggregate results. Every team built its own harness. Every harness was fragile. Benchmark Jobs eliminate this overhead. Submit a declarative job specification and the platform handles concurrency limits, retry policies, timeout enforcement, and graceful resume across deploys. Individual scenario failures do not collapse the entire job; partial results are always captured.

Declarative submission: Submit a job spec via API or CLI and Runloop manages the entire execution fleet

Three input formats: Harbor YAML, Benchmark Definitions for established suites, and Scenario Lists for custom test sets

Resilient execution: Retry policies, timeout enforcement, and partial result capture prevent flaky infrastructure from corrupting evaluations"

Real-time monitoring: Live progress streaming through the Job Dashboard with scenario-level drill-down

View orchestration docs

From scenario definition to comparative results

Four steps from evaluation design to actionable data.

Radar.

Define Your Scenarios

Encode tasks from your codebase as scenario definitions. Each specifies the task, runtime environment, and scoring contract. Reference private repos through secure credential injection.

Radar.

Configure Your Evaluation

Select agents and models to test. Set concurrency, retry policy, and timeout parameters. Use a BenchmarkJobDef template for standardized evaluations or submit an ad-hoc specification.

Radar.

Run and Monitor

Submit via API or CLI. The platform schedules across isolated sandboxes [VERIFY: MicroVM architecture], manages the lifecycle, and streams progress to the Job Dashboard.

Radar.

Analyze and Compare

View side-by-side comparisons across configurations. Filter by scenario to identify divergence. Export structured results through the API for internal reporting tools.

What competing platforms do not offer

Three capabilities that separate Runloop from every alternative.

Private Benchmarks

Other agent sandbox providers offer execution but no benchmarking. Runloop is the only platform where you evaluate models against your own codebase with credential protection.

Managed Orchestratio

Competing platforms leave job orchestration to you. Runloop provides declarative submission with managed concurrency, retry policies, timeout enforcement, and resilient execution.

Built-In Comparison

No competing platform offers built-in side-by-side evaluation. Run scenarios across model variants and see pass rate, duration, and score distribution in a unified view.

F.A.Q

Custom benchmarks questions

Common questions about building private AI benchmark suites with Runloop.

More questions? Visit our docs or send us a message

30,000+

<10ms

50ms

<20ms

A private benchmark suite that reflects your actual environment

Enterprise benchmarking for proprietary code evaluation

Choose the right model for your codebase, not the leaderboard

Stop building benchmark harnesses -- start running benchmarks

From scenario definition to comparative results

What competing platforms do not offer

Custom benchmarks questions

Get Started With Runloop

Get Started With Runloop