Evaluation is not just accuracy
Production AI agent decisions require measurement across four dimensions. Runloop's evaluation infrastructure supports all of them.
Does the agent solve the task? Pass rate, duration, and score distribution across scenarios. The baseline question, but not the only one. [Model Evaluation](/solutions/model-evaluation) develops this fully.

Two models with identical pass rates can differ 10x in token consumption and API cost. Cost-per-task metrics turn model selection from a capability question into an ROI decision. [Model Evaluation](/solutions/model-evaluation) covers cost-performance analysis.

Does the agent respect credential boundaries? Does it stay within its permitted tool set? Does it handle PII according to policy? Security and compliance evaluation uses the same benchmark infrastructure with different scoring contracts. [Agent Testing](/solutions/agent-testing) and [Security](/security) develop this fully.

Does the agent follow your organization's conventions, coding standards, and domain-specific rules? Custom benchmarks built on your codebase measure what generic benchmarks cannot. [Custom Benchmarks](/solutions/custom-benchmarks) develops this fully.

Run industry-standard AI agent evaluations on demand
Access established benchmarks like SWE-Bench Verified directly through the platform. Run your models and agents against the same test suites the industry uses for comparison, without building your own evaluation harness or managing benchmark infrastructure. Results are private to your organization -- not submitted to a public leaderboard. Run against any model or agent configuration through a single API call.

Isolate and measure specific AI agent capabilities
Curated benchmarks are themed scenario collections that isolate specific agent capabilities. Developed in partnership with academic researchers, they measure dimensions that broad benchmarks like SWE-Bench compress into a single aggregate score. Run 40 targeted scenarios instead of 300 broad ones. Tighter feedback loops identify exactly where a model or agent configuration excels or struggles -- across performance, safety behavior, and adherence to coding conventions.

Build private AI agent benchmark suites on your own code
Custom benchmarks let enterprises build evaluation suites using their own repositories, task definitions, and scoring criteria. Every scenario runs in an isolated sandbox configured to match your production environment. Credentials are managed through the Credential Gateway. This is where the four evaluation dimensions converge: performance on your actual tasks, cost efficiency against your budget constraints, compliance with your regulatory posture, and adherence to your organization's coding standards and business rules.

Side-by-side comparison across models and agent versions
Evaluation mode answers selection questions: which model should we use? How does the new agent version compare? Does changing the tool configuration improve performance? Hold one variable constant, change another, and compare results. The comparison dashboard surfaces where configurations diverge -- not just on aggregate pass rate, but on per-scenario behavior, duration distribution, and (when available) cost per task.

Catch silent AI agent performance degradation in CI/CD
Regression testing mode integrates benchmark evaluation into your deployment pipeline. Define a baseline, then run it automatically when variables change: model updates, agent framework upgrades, dependency changes, prompt template modifications, or tool configuration changes. Results serve as CI/CD gates. This catches more than performance regressions -- scoring contracts can detect compliance drift (an agent that starts accessing tools it previously avoided) and business logic violations (an agent that stops following conventions after a prompt change).

Execution infrastructure for RFT and SFT at training-loop speed
Fine-tuning workloads use Runloop's sandbox infrastructure as the execution substrate for training loops that require code execution and scoring. Reinforcement Fine-Tuning generates candidate solutions, executes them in isolated sandboxes, and feeds pass/fail signals back to the training process. Supervised Fine-Tuning validates training data quality at scale by executing and scoring examples in production-representative environments. Scoring contracts can encode not just functional correctness but also compliance criteria, safety boundaries, and business logic conformance into the reward signal.

Declarative benchmark job execution at scale
Submit a job specification. The platform handles everything between submission and structured results.
Three input formats: Harbor YAML for external configurations, BenchmarkJobDef references for established suite templates, or direct scenario lists for ad-hoc evaluations.

The platform provisions sandboxes, distributes scenarios at your concurrency limit, manages retry policies, and enforces timeouts. Individual failures do not collapse the job.

Real-time streaming through the Job Dashboard. Drill from job to benchmark run to individual scenario. Graceful interruption and deterministic resume survive platform deploys.

Structured API output for programmatic consumption. Comparison dashboard for visual analysis. Pass rate, duration, score distribution, and (planned) cost-per-task metrics.

One API for every evaluation workflow
The same Benchmark Jobs API powers model evaluation, regression testing, and fine-tuning signal generation
```python
import runloop
# Submit a benchmark job -- same API for all evaluation workflows
job = runloop.benchmark_jobs.create(
name="q1-model-evaluation",
benchmark_job_def="agent-v3-full-suite",
variants=[
{"model": "claude-sonnet-4-5", "agent": "code-agent-v3"},
{"model": "gpt-4.1", "agent": "code-agent-v3"},
],
config={"concurrency": 100, "timeout_seconds": 300}
)
# Stream results as scenarios complete
for result in runloop.benchmark_jobs.stream_results(job.id):
print(f"{result.scenario}: {result.status} (score: {result.score})")
```
```typescript
import Runloop from 'runloop';
const job = await runloop.benchmarkJobs.create({
name: 'q1-model-evaluation',
benchmarkJobDef: 'agent-v3-full-suite',
variants: [
{ model: 'claude-sonnet-4-5', agent: 'code-agent-v3' },
{ model: 'gpt-4.1', agent: 'code-agent-v3' },
],
config: { concurrency: 100, timeoutSeconds: 300 }
});
for await (const result of runloop.benchmarkJobs.streamResults(job.id)) {
console.log(`${result.scenario}: ${result.status} (score: ${result.score})`);
}
``````bash
runloop benchmark run \
--name "q1-model-evaluation" \
--job-def "agent-v3-full-suite" \
--variant "model=claude-sonnet-4-5,agent=code-agent-v3" \
--variant "model=gpt-4.1,agent=code-agent-v3" \
--concurrency 100 \
--timeout 300
runloop benchmark results --job q1-model-evaluation --format table
```