A private benchmark suite that reflects your actual environment
Runloop Custom Benchmarks let you define evaluation suites using your own repositories, task definitions, and scoring criteria. Every scenario runs in an isolated sandbox with the same toolchains and dependencies your agents encounter in production. Credentials are managed through the Credential Gateway -- API keys and tokens are never exposed to the evaluation environment.
```python
import runloop
# Define custom scenarios from your private repository
scenarios = runloop.scenarios.create_from_repo(
repo_url="https://github.com/your-org/internal-monorepo",
task_filter="issues/labeled/agent-benchmark",
scoring="test_suite_pass"
)
# Run evaluation with two model configurations
job = runloop.benchmark_jobs.create(
name="q1-vendor-evaluation",
scenarios=scenarios,
variants=[
{"model": "claude-sonnet-4-5", "agent": "code-agent-v3"},
{"model": "gpt-4.1", "agent": "code-agent-v3"},
],
config={"concurrency": 100, "timeout_seconds": 600}
)
``````typescript
import Runloop from 'runloop';
// Define custom scenarios from your private repository
const scenarios = await runloop.scenarios.createFromRepo({
repoUrl: 'https://github.com/your-org/internal-monorepo',
taskFilter: 'issues/labeled/agent-benchmark',
scoring: 'test_suite_pass'
});
// Run evaluation with two model configurations
const job = await runloop.benchmarkJobs.create({
name: 'q1-vendor-evaluation',
scenarios,
variants: [
{ model: 'claude-sonnet-4-5', agent: 'code-agent-v3' },
{ model: 'gpt-4.1', agent: 'code-agent-v3' },
],
config: { concurrency: 100, timeoutSeconds: 600 }
});
``````bash
# Create scenarios from a private repository
runloop scenarios create-from-repo \
--repo "https://github.com/your-org/internal-monorepo" \
--filter "issues/labeled/agent-benchmark" \
--scoring "test_suite_pass"
# Run evaluation with two model configurations
runloop benchmark run \
--name "q1-vendor-evaluation" \
--scenarios ./scenarios.yaml \
--variant "model=claude-sonnet-4-5,agent=code-agent-v3" \
--variant "model=gpt-4.1,agent=code-agent-v3" \
--concurrency 100 \
--timeout 600
```Enterprise benchmarking for proprietary code evaluation
Four primitives from the [Runloop Platform](/product) that make private benchmarking secure, repeatable, and scalable.
Define scoring criteria per scenario: test suite pass, diff match, or custom scoring functions. Structured results API returns machine-readable output for downstream analytics.

Save evaluation configurations as reusable templates. Standardize evaluations across teams. Re-run with a single parameter change when models update.

API keys injected as opaque, devbox-bound tokens. Real credentials never exist in the benchmark environment. Tokens expire when the sandbox terminates.

Monitor benchmark execution with live progress indicators across all scenarios. Drill from failed jobs into failed scenarios with full execution logs.

Choose the right model for your codebase, not the leaderboard
When a new model drops, the first question is always the same: does it actually perform better on our work? Two models with identical SWE-Bench scores can behave very differently on a codebase with heavy TypeScript, custom linters, or non-standard build tooling. Runloop makes AI vendor selection empirical. Build a benchmark suite from representative tasks in your codebase, then run the same suite against every model. The platform's comparative analysis produces side-by-side results: pass rate, duration, and score distribution.

Stop building benchmark harnesses -- start running benchmarks
Before Runloop, orchestrating a benchmark run meant writing custom scripts to manage parallel execution, handle retries, track failures, and aggregate results. Every team built its own harness. Every harness was fragile. Benchmark Jobs eliminate this overhead. Submit a declarative job specification and the platform handles concurrency limits, retry policies, timeout enforcement, and graceful resume across deploys. Individual scenario failures do not collapse the entire job; partial results are always captured.

From scenario definition to comparative results
Four steps from evaluation design to actionable data.
Encode tasks from your codebase as scenario definitions. Each specifies the task, runtime environment, and scoring contract. Reference private repos through secure credential injection.

Select agents and models to test. Set concurrency, retry policy, and timeout parameters. Use a BenchmarkJobDef template for standardized evaluations or submit an ad-hoc specification.

Submit via API or CLI. The platform schedules across isolated sandboxes [VERIFY: MicroVM architecture], manages the lifecycle, and streams progress to the Job Dashboard.

View side-by-side comparisons across configurations. Filter by scenario to identify divergence. Export structured results through the API for internal reporting tools.

What competing platforms do not offer
Three capabilities that separate Runloop from every alternative.
Other agent sandbox providers offer execution but no benchmarking. Runloop is the only platform where you evaluate models against your own codebase with credential protection.

Competing platforms leave job orchestration to you. Runloop provides declarative submission with managed concurrency, retry policies, timeout enforcement, and resilient execution.

No competing platform offers built-in side-by-side evaluation. Run scenarios across model variants and see pass rate, duration, and score distribution in a unified view.

Custom benchmarks questions
Common questions about building private AI benchmark suites with Runloop.
Yes. Custom benchmark scenarios reference your private repositories through the Credential Gateway, which injects authentication tokens as opaque, devbox-bound credentials. Your source code runs inside isolated sandboxes that are destroyed after each evaluation. Real credentials never exist in the benchmark environment, and results are private to your organization. For additional detail on credential protection and isolation architecture, see the [Security and Compliance](/security) page.
Runloop supports multiple scoring approaches through scoring contracts defined per scenario. The most common are test suite pass (the agent's output must pass a specified test suite), diff match (the agent's patch is compared against a known-good solution), and custom scoring functions (you define arbitrary scoring logic). Scoring contracts are part of the scenario definition, so different scenarios within the same benchmark job can use different scoring methods.
SWE-Bench Verified is available on-demand through Runloop and is a strong baseline for broad model comparison. Custom benchmarks go further by testing models on tasks drawn from your own codebase -- your languages, your conventions, your internal tooling. Both can run through the same Benchmark Job infrastructure. Many teams use SWE-Bench Verified as a first-pass filter and then run custom benchmarks on shortlisted models to make the final decision. The [Model Evaluation](/solutions/model-evaluation) page describes the full evaluation workflow.
Yes. BenchmarkJobDef templates let you save a complete evaluation configuration -- scenario set, model and agent variants, orchestration parameters -- and re-run it whenever a variable changes. When a new model version is released, re-run the template with the updated model parameter. Over time, this builds a longitudinal performance history for your organization
Runloop supports Deploy to VPC (Bring Your Own Cloud), which places the entire benchmark platform inside your own cloud boundary. Your code, your scenarios, and your results remain within your infrastructure perimeter. This is designed for organizations where sending proprietary code to a third-party evaluation service is not an option due to regulatory, contractual, or security policy constraints. See the [Deploy to VPC](/solutions/deploy-to-vpc) solution page for architecture details and supported cloud providers.