30,000+

concurrent environments

<10ms

Credential Gateway latency

50ms

command execution

<20ms

MCP Hub routing
THE LEADERBOARD GA
Public benchmarks like SWE-Bench Verified test against open-source repositories with known patterns. Your codebase has its own conventions, dependency chains, internal tooling, and complexity profile. An agent that scores well on Django issues may struggle with your internal monorepo or custom build system. The gap between public benchmark scores and real-world performance on proprietary code is where enterprises lose time and money -- choosing models based on rankings that may have no correlation with the work that actually matters.
Learn how Runloop solves this
-- Engineering Lead, Series B AI Startup
PRIVATE EVALUATION INFRASTRUCTURE

A private benchmark suite that reflects your actual environment

Runloop Custom Benchmarks let you define evaluation suites using your own repositories, task definitions, and scoring criteria. Every scenario runs in an isolated sandbox with the same toolchains and dependencies your agents encounter in production. Credentials are managed through the Credential Gateway -- API keys and tokens are never exposed to the evaluation environment.

```python
import runloop

# Define custom scenarios from your private repository
scenarios = runloop.scenarios.create_from_repo(
    repo_url="https://github.com/your-org/internal-monorepo",
    task_filter="issues/labeled/agent-benchmark",
    scoring="test_suite_pass"
)

# Run evaluation with two model configurations
job = runloop.benchmark_jobs.create(
    name="q1-vendor-evaluation",
    scenarios=scenarios,
    variants=[
        {"model": "claude-sonnet-4-5", "agent": "code-agent-v3"},
        {"model": "gpt-4.1", "agent": "code-agent-v3"},
    ],
    config={"concurrency": 100, "timeout_seconds": 600}
)
```
npm install @runloop/api-client
```typescript
import Runloop from 'runloop';

// Define custom scenarios from your private repository
const scenarios = await runloop.scenarios.createFromRepo({
  repoUrl: 'https://github.com/your-org/internal-monorepo',
  taskFilter: 'issues/labeled/agent-benchmark',
  scoring: 'test_suite_pass'
});

// Run evaluation with two model configurations
const job = await runloop.benchmarkJobs.create({
  name: 'q1-vendor-evaluation',
  scenarios,
  variants: [
    { model: 'claude-sonnet-4-5', agent: 'code-agent-v3' },
    { model: 'gpt-4.1', agent: 'code-agent-v3' },
  ],
  config: { concurrency: 100, timeoutSeconds: 600 }
});
```
npm install @runloop/api-client
```bash
# Create scenarios from a private repository
runloop scenarios create-from-repo \
  --repo "https://github.com/your-org/internal-monorepo" \
  --filter "issues/labeled/agent-benchmark" \
  --scoring "test_suite_pass"

# Run evaluation with two model configurations
runloop benchmark run \
  --name "q1-vendor-evaluation" \
  --scenarios ./scenarios.yaml \
  --variant "model=claude-sonnet-4-5,agent=code-agent-v3" \
  --variant "model=gpt-4.1,agent=code-agent-v3" \
  --concurrency 100 \
  --timeout 600
```
npm install @runloop/api-client
badge

Enterprise benchmarking for proprietary code evaluation

Four primitives from the [Runloop Platform](/product) that make private benchmarking secure, repeatable, and scalable.

Scoring Contracts

Define scoring criteria per scenario: test suite pass, diff match, or custom scoring functions. Structured results API returns machine-readable output for downstream analytics.

BenchmarkJobDef Templates

Save evaluation configurations as reusable templates. Standardize evaluations across teams. Re-run with a single parameter change when models update.

Credential Gateway

API keys injected as opaque, devbox-bound tokens. Real credentials never exist in the benchmark environment. Tokens expire when the sandbox terminates.

Job Dashboard

Monitor benchmark execution with live progress indicators across all scenarios. Drill from failed jobs into failed scenarios with full execution logs.

From scenario definition to comparative results

Four steps from evaluation design to actionable data.

Process card icon.
Radar.
01
Define Your Scenarios

Encode tasks from your codebase as scenario definitions. Each specifies the task, runtime environment, and scoring contract. Reference private repos through secure credential injection.

Green light and grid of the progress.
chat bubble icon
Radar.
02
Configure Your Evaluation

Select agents and models to test. Set concurrency, retry policy, and timeout parameters. Use a BenchmarkJobDef template for standardized evaluations or submit an ad-hoc specification.

White light and grid of the progress.
chat bubble icon
Radar.
02
Run and Monitor

Submit via API or CLI. The platform schedules across isolated sandboxes [VERIFY: MicroVM architecture], manages the lifecycle, and streams progress to the Job Dashboard.

White light and grid of the progress.
chat bubble icon
Radar.
02
Analyze and Compare

View side-by-side comparisons across configurations. Filter by scenario to identify divergence. Export structured results through the API for internal reporting tools.

White light and grid of the progress.

No other AI agent infrastructure platform offers custom benchmark suites on proprietary code with integrated orchestration, comparative analysis, and a full security model.

What competing platforms do not offer

Three capabilities that separate Runloop from every alternative.

Private Benchmarks

Other agent sandbox providers offer execution but no benchmarking. Runloop is the only platform where you evaluate models against your own codebase with credential protection.

Managed Orchestratio

Competing platforms leave job orchestration to you. Runloop provides declarative submission with managed concurrency, retry policies, timeout enforcement, and resilient execution.

Built-In Comparison

No competing platform offers built-in side-by-side evaluation. Run scenarios across model variants and see pass rate, duration, and score distribution in a unified view.

F.A.Q

Custom benchmarks questions

Common questions about building private AI benchmark suites with Runloop.

Can I use my proprietary codebase for benchmarks without exposing source code?
What scoring methods does Runloop support for custom benchmarks?
How do custom benchmarks compare to using SWE-Bench Verified?
Can I reuse benchmark configurations across multiple evaluation runs?
What deployment options exist for organizations with strict data residency requirements?
More questions? Visit our docs or send us a message