30,000+

concurrent environments

<10ms

Credential Gateway latencyenvironments

50ms

command execution

<20ms

MCP Hub routing
THE SILENT REGRESSION PROBLEM
Traditional software regressions are loud -- a broken test, a failed build, a crashed service. Agent regressions are silent. A model provider pushes an update and your pass rate on a specific task category drops from 78% to 61%. A dependency upgrade subtly changes a tool's output format. A prompt template change improves average performance but creates a new failure mode on edge cases. None of these show up in unit tests. The agent still runs, still produces output, still looks functional. The degradation only becomes visible through systematic evaluation against a known baseline.
Learn how Runloop solves this
-- Engineering Lead, Series B AI Startup
EVALUATION IN YOUR CI/CD PIPELINE

Run your regression suite on every change that matters

Runloop integrates benchmark evaluation into your deployment pipeline as a first-class CI/CD check. Define a benchmark suite that captures your agent's expected capability profile, then run it automatically whenever a variable changes. Results become a gate -- changes that degrade performance are flagged before they reach production.

```python
import runloop

# Define regression baseline from a stored template
job = runloop.benchmark_jobs.create(
    benchmark_job_def="agent-v3-regression-baseline",
    params={"model": "claude-sonnet-4-5", "agent_build": "pr-4521"},
    config={"concurrency": 50, "timeout_seconds": 300}
)

# Wait for results and check against thresholds
results = runloop.benchmark_jobs.wait(job.id)
assert results.pass_rate >= 0.75, f"Regression: pass rate {results.pass_rate}"
assert results.avg_duration_ms <= 45000, f"Regression: too slow"
```
npm install @runloop/api-client
```typescript
import Runloop from 'runloop';

// Define regression baseline from a stored template
const job = await runloop.benchmarkJobs.create({
  benchmarkJobDef: 'agent-v3-regression-baseline',
  params: { model: 'claude-sonnet-4-5', agentBuild: 'pr-4521' },
  config: { concurrency: 50, timeoutSeconds: 300 }
});

// Wait for results and check against thresholds
const results = await runloop.benchmarkJobs.wait(job.id);
if (results.passRate < 0.75) throw new Error(`Regression: pass rate ${results.passRate}`);
if (results.avgDurationMs > 45000) throw new Error(`Regression: too slow`);
```
npm install @runloop/api-client
```bash
# Run regression suite from a stored template
runloop benchmark run \
  --job-def "agent-v3-regression-baseline" \
  --param "model=claude-sonnet-4-5" \
  --param "agent_build=pr-4521" \
  --concurrency 50 \
  --timeout 300

# Check results against thresholds
runloop benchmark results --job agent-v3-regression-baseline \
  --assert "pass_rate >= 0.75" \
  --assert "avg_duration_ms <= 45000"
```
npm install @runloop/api-client
badge

Continuous agent evaluation infrastructure

Four primitives from the [Runloop Platform](/product) that make regression detection systematic, not manual.

BenchmarkJobDef Templates

Define regression baselines as stored, reusable evaluation configurations. Fixed parameters codify the test suite; variable parameters inject the model version or agent build at run time.

Scoring Contracts

Define what passing means per scenario: binary pass/fail, graded rubrics, or composite scoring functions. Harbor YAML support for scenario definition and import.

Parallel Orchestration

Configurable concurrency, retry policies, and timeout enforcement per job. Individual scenario failures do not block the suite. Real-time progress streaming.

Comparative Analysis

Side-by-side results between baseline and current run. Filter by scenario to find exactly where performance diverged. Structured JSON for CI pipeline integration.

From baseline definition to automated regression gates

Four steps from regression baseline to deployment confidence.

Process card icon.
Radar.
01
Define Your Regression Baseline

Select scenarios that represent your agent's expected capability profile. Define scoring contracts. Save the complete configuration as a BenchmarkJobDef template

Green light and grid of the progress.
chat bubble icon
Radar.
02
Configure Evaluation Parameters

Set concurrency, retry policy, and timeout limits. Specify which variables will be injected at run time: model version, agent build ID, prompt variant. Configure pass/fail thresholds.

White light and grid of the progress.
chat bubble icon
Radar.
02
Run on Every Change

Trigger from CI/CD, on a schedule, or manually via API/CLI. The platform executes across isolated sandboxes and streams progress in real time through the Job Dashboard.

White light and grid of the progress.
chat bubble icon
Radar.
02
Analyze and Gate Deployments

View comparative results against baseline. Structured results API returns JSON for CI evaluation. Filter by scenario to isolate regressions. Build longitudinal history across changes.

White light and grid of the progress.

No other AI agent infrastructure platform offers CI/CD-integrated benchmark evaluation as a product capability. Regression testing for AI agents is not a feature competitors provide -- it is a category Runloop created.

Why teams choose Runloop for agent performance monitoring

Three reasons no alternative matches this capability

Product Primitive

Other platforms provide execution environments but no evaluation layer. Runloop treats benchmark evaluation as infrastructure. BenchmarkJobDef templates, scoring contracts, and structured results are built-in primitives.

Cross-Use-Case Baselines

A scenario suite built for agent testing becomes your regression baseline. An evaluation from model selection becomes a regression template. Every scenario serves multiple purposes.

Pipeline-Native Results

Results return as structured JSON, not dashboards you check manually. Your CI system evaluates pass rate, duration, and score distribution against thresholds. No custom harness, no parsing scripts.

F.A.Q

Regression testing questions

Common questions about detecting and preventing AI agent performance regressions.

What counts as a 'silent regression' in AI agents?
How do I integrate Runloop regression testing into GitHub Actions or other CI pipelines?
How often should I run regression suites?
What is a BenchmarkJobDef template and how does it define baselines?
How does regression testing relate to model evaluation and agent testing?
More questions? Visit our docs or send us a message