30,000+

concurrent environments

<10ms

Credential Gateway latency

50ms

command execution

<20ms

MCP Hub routing
THE AGENT TESTING GAP
Traditional software testing validates deterministic behavior: given input X, expect output Y. Agent testing is fundamentally different. Agents make sequences of decisions about which tools to call, what code to write, and how to recover from errors. The same agent given the same task may take a different path each time. Testing them requires observing the full execution trajectory, running enough trials for statistical confidence, and validating that agents respect credential, tool, and network boundaries -- none of which existing tools address.
Learn how Runloop solves this
-- Engineering Lead, Series B AI Startup
PURPOSE-BUILT AGENT EVALUATION

Isolated environments instrumented for evaluation

Every test runs in its own isolated sandbox with production-identical toolchains. The Credential Gateway injects API keys as opaque tokens bound to each sandbox. Benchmark Jobs orchestrate evaluation at scale with structured results that feed directly into CI/CD.

```python
import runloop

# Create an isolated sandbox for agent testing
devbox = runloop.devboxes.create(
    blueprint_id="agent-test-env",
    metadata={"test_suite": "pr-validation"}
)

# Run a benchmark job across scenarios
job = runloop.benchmark_jobs.create(
    name="agent-v2.1-safety-suite",
    scenarios=scenario_suite,
    config={"concurrency": 50, "timeout_seconds": 300}
)

# Stream results as scenarios complete
for result in runloop.benchmark_jobs.stream_results(job.id):
    print(f"{result.scenario}: {result.status} ({result.duration_ms}ms)")
```
npm install @runloop/api-client
```typescript
import Runloop from 'runloop';

// Create an isolated sandbox for agent testing
const devbox = await runloop.devboxes.create({
  blueprintId: 'agent-test-env',
  metadata: { testSuite: 'pr-validation' }
});

// Run a benchmark job across scenarios
const job = await runloop.benchmarkJobs.create({
  name: 'agent-v2.1-safety-suite',
  scenarios: scenarioSuite,
  config: { concurrency: 50, timeoutSeconds: 300 }
});

// Stream results as scenarios complete
for await (const result of runloop.benchmarkJobs.streamResults(job.id)) {
  console.log(`${result.scenario}: ${result.status} (${result.durationMs}ms)`);
}
```
npm install @runloop/api-client
```bash
# Create a sandbox from a blueprint
runloop devbox create --blueprint agent-test-env

# Submit a benchmark job
runloop benchmark run \
  --name "agent-v2.1-safety-suite" \
  --scenarios ./scenario-suite.yaml \
  --concurrency 50 \
  --timeout 300

# View results
runloop benchmark results --job agent-v2.1-safety-suite --format table
```

---
npm install @runloop/api-client
badge

Agent testing with security built in

Four infrastructure primitives that make agent evaluation safe, repeatable, and scalable.

Sandbox Isolation

Full Linux environment per test via Blueprints. Ephemeral -- no state leakage between runs. Same isolation as production devboxes.

Credential Gateway

Opaque tokens bound to each sandbox. Under 10ms latency. Agents call external services during evaluation without credential exposure.

MCP Hub

Tool-level ACL with pattern matching. Restricted tools are invisible to the agent. Under 20ms routing with full audit trail.

Benchmark Jobs

Parallel execution with configurable concurrency and retry. BenchmarkJobDef templates for repeatable evaluations. Structured results API.

From scenarios to structured results

Four steps from evaluation definition to actionable data.

Process card icon.
Radar.
01
Define Your Scenarios

Write scenario definitions with tasks, constraints, and scoring criteria. Import existing definitions via Harbor YAML or define them through the SDK.

Green light and grid of the progress.
chat bubble icon
Radar.
02
Configure the Environment

Set up sandboxes via Blueprints. Attach Credential Gateway tokens and MCP Hub configurations. Define network access controls.

White light and grid of the progress.
chat bubble icon
Radar.
02
Run the Evaluation

Submit a Benchmark Job. The platform provisions sandboxes, distributes scenarios at your concurrency limit, and streams progress in real time

White light and grid of the progress.
chat bubble icon
Radar.
02
Analyze Results

Review pass rates, duration distributions, and score breakdowns. Compare agent versions on the same scenarios side by side via API or dashboard.

White light and grid of the progress.

No other AI agent infrastructure platform offers custom benchmark suites on proprietary code with integrated orchestration, comparative analysis, and a full security model.

What other platforms cannot offer

Three capabilities with zero equivalents across competing platforms.

Credential Safety

Other sandboxes expose API keys as env vars. Credential Gateway uses opaque, sandbox-bound tokens that cannot be decoded or reused.

Tool-Level ACL

Other platforms offer all-or-nothing tool access. MCP Hub enforces per-tool permissions and makes restricted tools invisible to the agent.

Built-In Benchmarks

Other providers offer compute but no evaluation. Runloop includes orchestration, templates, and comparative analysis as native capabilities.

F.A.Q

Agent testing questions

Common questions about testing AI agents with Runloop's evaluation infrastructure.

How is AI agent testing different from traditional software testing?
Can I integrate Runloop agent testing into my CI/CD pipeline?
How does Runloop prevent credential exposure during agent testing?
What metrics does the agent testing platform track?
Can I test agent safety and containment boundaries?
More questions? Visit our docs or send us a message