Run your regression suite on every change that matters
Runloop integrates benchmark evaluation into your deployment pipeline as a first-class CI/CD check. Define a benchmark suite that captures your agent's expected capability profile, then run it automatically whenever a variable changes. Results become a gate -- changes that degrade performance are flagged before they reach production.
```python
import runloop
# Define regression baseline from a stored template
job = runloop.benchmark_jobs.create(
benchmark_job_def="agent-v3-regression-baseline",
params={"model": "claude-sonnet-4-5", "agent_build": "pr-4521"},
config={"concurrency": 50, "timeout_seconds": 300}
)
# Wait for results and check against thresholds
results = runloop.benchmark_jobs.wait(job.id)
assert results.pass_rate >= 0.75, f"Regression: pass rate {results.pass_rate}"
assert results.avg_duration_ms <= 45000, f"Regression: too slow"
``````typescript
import Runloop from 'runloop';
// Define regression baseline from a stored template
const job = await runloop.benchmarkJobs.create({
benchmarkJobDef: 'agent-v3-regression-baseline',
params: { model: 'claude-sonnet-4-5', agentBuild: 'pr-4521' },
config: { concurrency: 50, timeoutSeconds: 300 }
});
// Wait for results and check against thresholds
const results = await runloop.benchmarkJobs.wait(job.id);
if (results.passRate < 0.75) throw new Error(`Regression: pass rate ${results.passRate}`);
if (results.avgDurationMs > 45000) throw new Error(`Regression: too slow`);
``````bash
# Run regression suite from a stored template
runloop benchmark run \
--job-def "agent-v3-regression-baseline" \
--param "model=claude-sonnet-4-5" \
--param "agent_build=pr-4521" \
--concurrency 50 \
--timeout 300
# Check results against thresholds
runloop benchmark results --job agent-v3-regression-baseline \
--assert "pass_rate >= 0.75" \
--assert "avg_duration_ms <= 45000"
```Continuous agent evaluation infrastructure
Four primitives from the [Runloop Platform](/product) that make regression detection systematic, not manual.
Define regression baselines as stored, reusable evaluation configurations. Fixed parameters codify the test suite; variable parameters inject the model version or agent build at run time.

Define what passing means per scenario: binary pass/fail, graded rubrics, or composite scoring functions. Harbor YAML support for scenario definition and import.

Configurable concurrency, retry policies, and timeout enforcement per job. Individual scenario failures do not block the suite. Real-time progress streaming.

Side-by-side results between baseline and current run. Filter by scenario to find exactly where performance diverged. Structured JSON for CI pipeline integration.

Benchmark evaluation as a pull request gate
A pull request that modifies agent code, prompt templates, or dependency versions triggers a Benchmark Job as part of the PR check workflow. The job runs the regression suite defined in a BenchmarkJobDef template, and the structured results are evaluated against the thresholds your pipeline defines. Merges are blocked if performance drops below the threshold. The feedback loop is the same one your team already uses for linting, type checking, and unit tests -- except now it covers agent behavior.

Detect LLM regression when model providers push updates
Model version changes happen outside your codebase. Your LLM provider ships an update, and your agent's behavior shifts in ways no code review would catch. Some changes improve performance; others introduce subtle regressions on specific task categories that aggregate metrics can mask. The BenchmarkJobDef template makes monitoring straightforward: re-run the same template with the new model parameter and compare against the stored baseline. The comparison dashboard shows where the current run diverges.

From baseline definition to automated regression gates
Four steps from regression baseline to deployment confidence.
Select scenarios that represent your agent's expected capability profile. Define scoring contracts. Save the complete configuration as a BenchmarkJobDef template

Set concurrency, retry policy, and timeout limits. Specify which variables will be injected at run time: model version, agent build ID, prompt variant. Configure pass/fail thresholds.

Trigger from CI/CD, on a schedule, or manually via API/CLI. The platform executes across isolated sandboxes and streams progress in real time through the Job Dashboard.

View comparative results against baseline. Structured results API returns JSON for CI evaluation. Filter by scenario to isolate regressions. Build longitudinal history across changes.

Why teams choose Runloop for agent performance monitoring
Three reasons no alternative matches this capability
Other platforms provide execution environments but no evaluation layer. Runloop treats benchmark evaluation as infrastructure. BenchmarkJobDef templates, scoring contracts, and structured results are built-in primitives.

A scenario suite built for agent testing becomes your regression baseline. An evaluation from model selection becomes a regression template. Every scenario serves multiple purposes.

Results return as structured JSON, not dashboards you check manually. Your CI system evaluates pass rate, duration, and score distribution against thresholds. No custom harness, no parsing scripts.

Regression testing questions
Common questions about detecting and preventing AI agent performance regressions.
A silent regression is a performance degradation that does not trigger traditional test failures. The agent still runs, still produces output, and still appears functional, but its success rate on specific task categories has dropped. Common causes include model provider updates that change generation behavior, dependency upgrades that alter tool output formats, and prompt changes that improve average performance while introducing new failure modes on edge cases. Because agent behavior is probabilistic, these regressions are invisible without systematic evaluation against a known baseline.
The integration is a single API call. Your CI pipeline triggers a Benchmark Job using a stored BenchmarkJobDef template, waits for completion, and evaluates the structured JSON response against your defined thresholds. For GitHub Actions, this is a step in your workflow YAML that calls the Runloop API and asserts on the returned pass rate, duration, and score metrics. The same pattern works with GitLab CI, CircleCI, Jenkins, or any pipeline system that can make HTTP requests and evaluate JSON responses. See the [Runloop documentation](https://docs.runloop.ai) for pipeline integration examples.
Frequency depends on your change velocity and risk tolerance. Most teams run focused regression suites on every pull request that modifies agent code, prompt templates, or dependency versions. For model provider updates, which happen outside your codebase, teams typically run the suite manually when an update is announced or on a recurring schedule (daily or weekly). Broader, more comprehensive suites that take longer to execute are often run nightly or on release branches rather than on every PR.
A BenchmarkJobDef is a stored, reusable evaluation configuration. It codifies the fixed parameters of your regression test: which scenarios to run, what scoring criteria to apply, which orchestration settings to use (concurrency, retry policy, timeouts). Variable parameters like model version, agent build ID, or prompt variant are left as injection points that get specified at run time. The template ensures that every regression run uses identical evaluation infrastructure, so results are directly comparable. The baseline is established by running the template against a known-good configuration, and subsequent runs compare against that reference point.
All three use the same underlying infrastructure: Benchmark Jobs, scenario definitions, isolated sandboxes, scoring contracts, and comparative analysis. The difference is the question being asked. [Model evaluation](/solutions/model-evaluation) asks which model to use and produces a selection decision. [Agent testing](/solutions/agent-testing) asks whether the agent works correctly and safely, producing a capability assessment. Regression testing asks whether a specific change made things worse, producing a go/no-go signal in a deployment pipeline. Because all three share the same platform primitives, a scenario suite built for agent testing becomes a regression baseline, and an evaluation template built for vendor selection becomes a regression monitor. All evaluation runs execute inside isolated sandboxes with [Credential Gateway and MCP Hub security boundaries](/security), so regression testing inherits the same credential protection and tool-level access control as production.