AI Agent Regression Testing for CI/CD

30,000+

concurrent environments

<10ms

Credential Gateway latencyenvironments

50ms

command execution

<20ms

MCP Hub routing

THE SILENT REGRESSION PROBLEM

Traditional software regressions are loud -- a broken test, a failed build, a crashed service. Agent regressions are silent. A model provider pushes an update and your pass rate on a specific task category drops from 78% to 61%. A dependency upgrade subtly changes a tool's output format. A prompt template change improves average performance but creates a new failure mode on edge cases. None of these show up in unit tests. The agent still runs, still produces output, still looks functional. The degradation only becomes visible through systematic evaluation against a known baseline.

Learn how Runloop solves this

-- Engineering Lead, Series B AI Startup

EVALUATION IN YOUR CI/CD PIPELINE

Run your regression suite on every change that matters

Runloop integrates benchmark evaluation into your deployment pipeline as a first-class CI/CD check. Define a benchmark suite that captures your agent's expected capability profile, then run it automatically whenever a variable changes. Results become a gate -- changes that degrade performance are flagged before they reach production.

```python
import runloop

# Define regression baseline from a stored template
job = runloop.benchmark_jobs.create(
    benchmark_job_def="agent-v3-regression-baseline",
    params={"model": "claude-sonnet-4-5", "agent_build": "pr-4521"},
    config={"concurrency": 50, "timeout_seconds": 300}
)

# Wait for results and check against thresholds
results = runloop.benchmark_jobs.wait(job.id)
assert results.pass_rate >= 0.75, f"Regression: pass rate {results.pass_rate}"
assert results.avg_duration_ms <= 45000, f"Regression: too slow"
```

npm install @runloop/api-client

```typescript
import Runloop from 'runloop';

// Define regression baseline from a stored template
const job = await runloop.benchmarkJobs.create({
  benchmarkJobDef: 'agent-v3-regression-baseline',
  params: { model: 'claude-sonnet-4-5', agentBuild: 'pr-4521' },
  config: { concurrency: 50, timeoutSeconds: 300 }
});

// Wait for results and check against thresholds
const results = await runloop.benchmarkJobs.wait(job.id);
if (results.passRate < 0.75) throw new Error(`Regression: pass rate ${results.passRate}`);
if (results.avgDurationMs > 45000) throw new Error(`Regression: too slow`);
```

npm install @runloop/api-client

```bash
# Run regression suite from a stored template
runloop benchmark run \
  --job-def "agent-v3-regression-baseline" \
  --param "model=claude-sonnet-4-5" \
  --param "agent_build=pr-4521" \
  --concurrency 50 \
  --timeout 300

# Check results against thresholds
runloop benchmark results --job agent-v3-regression-baseline \
  --assert "pass_rate >= 0.75" \
  --assert "avg_duration_ms <= 45000"
```

npm install @runloop/api-client

badge

Continuous agent evaluation infrastructure

Four primitives from the [Runloop Platform](/product) that make regression detection systematic, not manual.

BenchmarkJobDef Templates

Define regression baselines as stored, reusable evaluation configurations. Fixed parameters codify the test suite; variable parameters inject the model version or agent build at run time.

Scoring Contracts

Define what passing means per scenario: binary pass/fail, graded rubrics, or composite scoring functions. Harbor YAML support for scenario definition and import.

Parallel Orchestration

Configurable concurrency, retry policies, and timeout enforcement per job. Individual scenario failures do not block the suite. Real-time progress streaming.

Comparative Analysis

Side-by-side results between baseline and current run. Filter by scenario to find exactly where performance diverged. Structured JSON for CI pipeline integration.

Use Case

Benchmark evaluation as a pull request gate

A pull request that modifies agent code, prompt templates, or dependency versions triggers a Benchmark Job as part of the PR check workflow. The job runs the regression suite defined in a BenchmarkJobDef template, and the structured results are evaluated against the thresholds your pipeline defines. Merges are blocked if performance drops below the threshold. The feedback loop is the same one your team already uses for linting, type checking, and unit tests -- except now it covers agent behavior.

PR-level evaluation : Every change to agent code, prompts, or dependencies triggers a benchmark regression check automatically

Comparable results: Same scenario set and orchestration configuration every time, making results directly comparable across PRs

Threshold enforcement: Pass-rate floors and per-scenario regression detection as programmable pipeline gates

Single API call: For GitHub Actions, GitLab CI, or any pipeline -- structured JSON, no custom evaluation harness required

View CI/CD integration docs

Use Case

Detect LLM regression when model providers push updates

Model version changes happen outside your codebase. Your LLM provider ships an update, and your agent's behavior shifts in ways no code review would catch. Some changes improve performance; others introduce subtle regressions on specific task categories that aggregate metrics can mask. The BenchmarkJobDef template makes monitoring straightforward: re-run the same template with the new model parameter and compare against the stored baseline. The comparison dashboard shows where the current run diverges.

External change detection: Catch regressions introduced by model provider updates that happen outside your codebase

Scenario-level diff: See which scenarios flipped from pass to fail and how aggregate metrics shifted

Compound investment: Evaluation configurations from model selection become regression templates for ongoing monitoring

Scheduled monitoring: Run recurring evaluation on a schedule and alert on threshold violations

Model Evaluation

From baseline definition to automated regression gates

Four steps from regression baseline to deployment confidence.

Radar.

Define Your Regression Baseline

Select scenarios that represent your agent's expected capability profile. Define scoring contracts. Save the complete configuration as a BenchmarkJobDef template

Radar.

Configure Evaluation Parameters

Set concurrency, retry policy, and timeout limits. Specify which variables will be injected at run time: model version, agent build ID, prompt variant. Configure pass/fail thresholds.

Radar.

Run on Every Change

Trigger from CI/CD, on a schedule, or manually via API/CLI. The platform executes across isolated sandboxes and streams progress in real time through the Job Dashboard.

Radar.

Analyze and Gate Deployments

View comparative results against baseline. Structured results API returns JSON for CI evaluation. Filter by scenario to isolate regressions. Build longitudinal history across changes.

Why teams choose Runloop for agent performance monitoring

Three reasons no alternative matches this capability

Product Primitive

Other platforms provide execution environments but no evaluation layer. Runloop treats benchmark evaluation as infrastructure. BenchmarkJobDef templates, scoring contracts, and structured results are built-in primitives.

Cross-Use-Case Baselines

A scenario suite built for agent testing becomes your regression baseline. An evaluation from model selection becomes a regression template. Every scenario serves multiple purposes.

Pipeline-Native Results

Results return as structured JSON, not dashboards you check manually. Your CI system evaluates pass rate, duration, and score distribution against thresholds. No custom harness, no parsing scripts.

F.A.Q

Regression testing questions

Common questions about detecting and preventing AI agent performance regressions.

More questions? Visit our docs or send us a message

30,000+

<10ms

50ms

<20ms

Run your regression suite on every change that matters

Continuous agent evaluation infrastructure

Benchmark evaluation as a pull request gate

Detect LLM regression when model providers push updates

From baseline definition to automated regression gates

Why teams choose Runloop for agent performance monitoring

Regression testing questions

Get Started With Runloop

Get Started With Runloop