AI Agent Testing Infrastructure

30,000+

concurrent environments

<10ms

Credential Gateway latency

50ms

command execution

<20ms

MCP Hub routing

THE AGENT TESTING GAP

Traditional software testing validates deterministic behavior: given input X, expect output Y. Agent testing is fundamentally different. Agents make sequences of decisions about which tools to call, what code to write, and how to recover from errors. The same agent given the same task may take a different path each time. Testing them requires observing the full execution trajectory, running enough trials for statistical confidence, and validating that agents respect credential, tool, and network boundaries -- none of which existing tools address.

Learn how Runloop solves this

-- Engineering Lead, Series B AI Startup

PURPOSE-BUILT AGENT EVALUATION

Isolated environments instrumented for evaluation

Every test runs in its own isolated sandbox with production-identical toolchains. The Credential Gateway injects API keys as opaque tokens bound to each sandbox. Benchmark Jobs orchestrate evaluation at scale with structured results that feed directly into CI/CD.

```python
import runloop

# Create an isolated sandbox for agent testing
devbox = runloop.devboxes.create(
    blueprint_id="agent-test-env",
    metadata={"test_suite": "pr-validation"}
)

# Run a benchmark job across scenarios
job = runloop.benchmark_jobs.create(
    name="agent-v2.1-safety-suite",
    scenarios=scenario_suite,
    config={"concurrency": 50, "timeout_seconds": 300}
)

# Stream results as scenarios complete
for result in runloop.benchmark_jobs.stream_results(job.id):
    print(f"{result.scenario}: {result.status} ({result.duration_ms}ms)")
```

npm install @runloop/api-client

```typescript
import Runloop from 'runloop';

// Create an isolated sandbox for agent testing
const devbox = await runloop.devboxes.create({
  blueprintId: 'agent-test-env',
  metadata: { testSuite: 'pr-validation' }
});

// Run a benchmark job across scenarios
const job = await runloop.benchmarkJobs.create({
  name: 'agent-v2.1-safety-suite',
  scenarios: scenarioSuite,
  config: { concurrency: 50, timeoutSeconds: 300 }
});

// Stream results as scenarios complete
for await (const result of runloop.benchmarkJobs.streamResults(job.id)) {
  console.log(`${result.scenario}: ${result.status} (${result.durationMs}ms)`);
}
```

npm install @runloop/api-client

```bash
# Create a sandbox from a blueprint
runloop devbox create --blueprint agent-test-env

# Submit a benchmark job
runloop benchmark run \
  --name "agent-v2.1-safety-suite" \
  --scenarios ./scenario-suite.yaml \
  --concurrency 50 \
  --timeout 300

# View results
runloop benchmark results --job agent-v2.1-safety-suite --format table
```

---

npm install @runloop/api-client

badge

Agent testing with security built in

Four infrastructure primitives that make agent evaluation safe, repeatable, and scalable.

Sandbox Isolation

Full Linux environment per test via Blueprints. Ephemeral -- no state leakage between runs. Same isolation as production devboxes.

Credential Gateway

Opaque tokens bound to each sandbox. Under 10ms latency. Agents call external services during evaluation without credential exposure.

MCP Hub

Tool-level ACL with pattern matching. Restricted tools are invisible to the agent. Under 20ms routing with full audit trail.

Benchmark Jobs

Parallel execution with configurable concurrency and retry. BenchmarkJobDef templates for repeatable evaluations. Structured results API.

Use Case

Run thousands of agent test scenarios in parallel

Agent behavior is stochastic. A single evaluation run tells you very little about actual capability. Statistical confidence requires volume: running the same scenario multiple times, running many different scenarios, and doing both across every agent change. Runloop's orchestration layer manages this through Benchmark Jobs -- you define scenario suites covering the capability dimensions you care about, and the platform handles parallel execution across hundreds of concurrent sandboxes with configurable concurrency, timeout enforcement, and retry logic.

BenchmarkJobDef templates: Codify your evaluation configuration once, then re-run with a single API call whenever you need to validate a change

Comparative analysis: Run two agent versions against the same scenarios side by side and compare trajectories, not just aggregate scores

Harbor YAML support: Import existing benchmark definitions from other frameworks for teams migrating evaluation workflows

Structured results API: Pass rates, duration distributions, and score breakdowns without manual post-processing

View evaluation docs

Use Case

Validate what agents cannot do, not just what they can

Agent testing is not only about capability -- it is about containment. Before deploying an agent to production, you need to know that it respects the boundaries you have set. Does it stay within its permitted tool set? Does it attempt to access services it should not? Runloop's security infrastructure doubles as testing infrastructure for these questions. MCP Hub's tool-level access control lets you define exactly which tools an agent can see and call, and the audit trail records every invocation and every boundary the agent attempted to cross. The Credential Gateway ensures that even during safety testing, real credentials never exist on the sandbox.

Invisible boundaries: Restricted tools are hidden from the agent, not just blocked -- creating realistic permission boundaries without artificial signals

Network controls:"Verify that agents only contact the external endpoints you intend by restricting sandbox network reach

redential containment: Opaque tokens bound to the test environment expire on sandbox termination -- credential exposure is structurally impossible

Full audit trail: Every tool invocation, blocked access attempt, and permission boundary crossing is logged for review

Security architecture

From scenarios to structured results

Four steps from evaluation definition to actionable data.

Radar.

Define Your Scenarios

Write scenario definitions with tasks, constraints, and scoring criteria. Import existing definitions via Harbor YAML or define them through the SDK.

Radar.

Configure the Environment

Set up sandboxes via Blueprints. Attach Credential Gateway tokens and MCP Hub configurations. Define network access controls.

Radar.

Run the Evaluation

Submit a Benchmark Job. The platform provisions sandboxes, distributes scenarios at your concurrency limit, and streams progress in real time

Radar.

Analyze Results

Review pass rates, duration distributions, and score breakdowns. Compare agent versions on the same scenarios side by side via API or dashboard.

What other platforms cannot offer

Three capabilities with zero equivalents across competing platforms.

Credential Safety

Other sandboxes expose API keys as env vars. Credential Gateway uses opaque, sandbox-bound tokens that cannot be decoded or reused.

Tool-Level ACL

Other platforms offer all-or-nothing tool access. MCP Hub enforces per-tool permissions and makes restricted tools invisible to the agent.

Built-In Benchmarks

Other providers offer compute but no evaluation. Runloop includes orchestration, templates, and comparative analysis as native capabilities.

F.A.Q

Agent testing questions

Common questions about testing AI agents with Runloop's evaluation infrastructure.

More questions? Visit our docs or send us a message

30,000+

<10ms

50ms

<20ms

Isolated environments instrumented for evaluation

Agent testing with security built in

Run thousands of agent test scenarios in parallel

Validate what agents cannot do, not just what they can

From scenarios to structured results

What other platforms cannot offer

Agent testing questions

Get Started With Runloop

Get Started With Runloop