Isolated environments instrumented for evaluation
Every test runs in its own isolated sandbox with production-identical toolchains. The Credential Gateway injects API keys as opaque tokens bound to each sandbox. Benchmark Jobs orchestrate evaluation at scale with structured results that feed directly into CI/CD.
```python
import runloop
# Create an isolated sandbox for agent testing
devbox = runloop.devboxes.create(
blueprint_id="agent-test-env",
metadata={"test_suite": "pr-validation"}
)
# Run a benchmark job across scenarios
job = runloop.benchmark_jobs.create(
name="agent-v2.1-safety-suite",
scenarios=scenario_suite,
config={"concurrency": 50, "timeout_seconds": 300}
)
# Stream results as scenarios complete
for result in runloop.benchmark_jobs.stream_results(job.id):
print(f"{result.scenario}: {result.status} ({result.duration_ms}ms)")
``````typescript
import Runloop from 'runloop';
// Create an isolated sandbox for agent testing
const devbox = await runloop.devboxes.create({
blueprintId: 'agent-test-env',
metadata: { testSuite: 'pr-validation' }
});
// Run a benchmark job across scenarios
const job = await runloop.benchmarkJobs.create({
name: 'agent-v2.1-safety-suite',
scenarios: scenarioSuite,
config: { concurrency: 50, timeoutSeconds: 300 }
});
// Stream results as scenarios complete
for await (const result of runloop.benchmarkJobs.streamResults(job.id)) {
console.log(`${result.scenario}: ${result.status} (${result.durationMs}ms)`);
}
``````bash
# Create a sandbox from a blueprint
runloop devbox create --blueprint agent-test-env
# Submit a benchmark job
runloop benchmark run \
--name "agent-v2.1-safety-suite" \
--scenarios ./scenario-suite.yaml \
--concurrency 50 \
--timeout 300
# View results
runloop benchmark results --job agent-v2.1-safety-suite --format table
```
---Agent testing with security built in
Four infrastructure primitives that make agent evaluation safe, repeatable, and scalable.
Full Linux environment per test via Blueprints. Ephemeral -- no state leakage between runs. Same isolation as production devboxes.

Opaque tokens bound to each sandbox. Under 10ms latency. Agents call external services during evaluation without credential exposure.

Tool-level ACL with pattern matching. Restricted tools are invisible to the agent. Under 20ms routing with full audit trail.

Parallel execution with configurable concurrency and retry. BenchmarkJobDef templates for repeatable evaluations. Structured results API.

Run thousands of agent test scenarios in parallel
Agent behavior is stochastic. A single evaluation run tells you very little about actual capability. Statistical confidence requires volume: running the same scenario multiple times, running many different scenarios, and doing both across every agent change. Runloop's orchestration layer manages this through Benchmark Jobs -- you define scenario suites covering the capability dimensions you care about, and the platform handles parallel execution across hundreds of concurrent sandboxes with configurable concurrency, timeout enforcement, and retry logic.

Validate what agents cannot do, not just what they can
Agent testing is not only about capability -- it is about containment. Before deploying an agent to production, you need to know that it respects the boundaries you have set. Does it stay within its permitted tool set? Does it attempt to access services it should not? Runloop's security infrastructure doubles as testing infrastructure for these questions. MCP Hub's tool-level access control lets you define exactly which tools an agent can see and call, and the audit trail records every invocation and every boundary the agent attempted to cross. The Credential Gateway ensures that even during safety testing, real credentials never exist on the sandbox.

From scenarios to structured results
Four steps from evaluation definition to actionable data.
Write scenario definitions with tasks, constraints, and scoring criteria. Import existing definitions via Harbor YAML or define them through the SDK.

Set up sandboxes via Blueprints. Attach Credential Gateway tokens and MCP Hub configurations. Define network access controls.

Submit a Benchmark Job. The platform provisions sandboxes, distributes scenarios at your concurrency limit, and streams progress in real time

Review pass rates, duration distributions, and score breakdowns. Compare agent versions on the same scenarios side by side via API or dashboard.

What other platforms cannot offer
Three capabilities with zero equivalents across competing platforms.
Other sandboxes expose API keys as env vars. Credential Gateway uses opaque, sandbox-bound tokens that cannot be decoded or reused.

Other platforms offer all-or-nothing tool access. MCP Hub enforces per-tool permissions and makes restricted tools invisible to the agent.

Other providers offer compute but no evaluation. Runloop includes orchestration, templates, and comparative analysis as native capabilities.

Agent testing questions
Common questions about testing AI agents with Runloop's evaluation infrastructure.
Traditional testing validates deterministic input-output behavior. Agent testing evaluates stochastic, multi-step decision-making where the same agent may take different paths to correct solutions. This requires observing full execution trajectories -- which tools were invoked, in what order, with what arguments, and how the agent adapted when something went wrong. It requires running enough trials to distinguish genuine capability from statistical noise. And it requires testing security boundaries: whether the agent respects credential access rules, tool permissions, and network reach. Runloop provides infrastructure for all three dimensions: isolated sandbox environments for safe execution, Benchmark Jobs for scale, and Credential Gateway plus MCP Hub for security boundary validation.
Yes. Benchmark Jobs can be triggered via the Runloop API or CLI from any CI/CD system. The typical integration point is a pull request check: when a PR modifies agent code, prompt templates, or dependency versions, a benchmark job runs the regression suite and posts structured results back as a status check. Merges can be gated on performance thresholds you define. BenchmarkJobDef templates make this repeatable without reconfiguring the evaluation each time. For the full CI/CD integration pattern, see the [Regression Testing](/solutions/regression-testing) solution page.
The Credential Gateway replaces real API keys with opaque tokens that are bound to a specific sandbox and expire when that sandbox terminates. The agent uses these tokens exactly as it would use real credentials -- same SDK calls, same API patterns -- but the actual credentials are injected server-side by the gateway and never exist on the sandbox. Even if an attacker extracts a token through prompt injection, it cannot be decoded or reused outside the originating environment. The gateway supports bearer, header, basic, and query parameter authentication types with less than 10ms of added latency. For the complete security architecture, see the [Security and Compliance](/security) page.
The platform currently tracks pass rate, average duration, and score distribution at the job, run, and individual scenario levels. Comparative analysis shows where a candidate diverges from a baseline: which scenarios changed status, how aggregate metrics shifted, and trajectory differences between agent versions. Token usage, cost per task, and tool call patterns are planned metrics that will ship with the upcoming LLM Proxy integration.
Yes. MCP Hub lets you define exactly which tools an agent can access using pattern-based permissions, and restricted tools are invisible to the agent. The audit trail records every tool invocation, so you can verify whether the agent attempted to exceed its permitted boundaries. Network access controls restrict which external services the sandbox can reach. [VERIFY: network controls specifics] The Credential Gateway ensures credentials are never exposed during safety testing. Together, these primitives let you test both what an agent can do and what it cannot do under realistic production constraints.