Fine-Tuning Infrastructure for Code Agents | Runloop

30,000+

concurrent environments

<10ms

Credential Gateway latencyenvironments

50ms

command execution

<20ms

MCP Hub routing

THE EXECUTION BOTTLENECK

Fine-tuning a model for code generation requires execution. The model generates a candidate patch, that solution must be applied to a real codebase in a real environment, tests must run, and the pass/fail signal feeds back into the training process. For a typical RFT experiment, that is thousands to tens of thousands of sandbox instantiations per training run. Most teams cobble this together with Docker containers, custom orchestration scripts, and significant engineering effort that has nothing to do with the actual research. The infrastructure overhead delays experiments, limits scale, and contaminates training signals with noise.

Learn how Runloop solves this

-- Engineering Lead, Series B AI Startup

EXECUTION INFRASTRUCTURE FOR TRAINING LOOPS

Thousands of sandboxes, orchestrated for training signal generation

Runloop provisions isolated sandboxes at the scale training loops demand -- fast enough that environment provisioning does not bottleneck the experiment, isolated enough that concurrent executions do not interfere, and instrumented enough that every execution produces clean scoring signals. Snapshots enable efficient environment reuse across training steps.

```python
import runloop

# Configure execution environments for RFT training loop
blueprint = runloop.blueprints.get("python-ml-training-env")

# Submit scenario batch for reward signal generation
job = runloop.benchmark_jobs.create(
    name="rft-epoch-42-reward-signals",
    scenarios=candidate_solutions,
    blueprint_id=blueprint.id,
    config={
        "concurrency": 200,
        "timeout_seconds": 120,
        "retry_attempts": 1
    }
)
# Collect scoring signals for training loop
results = runloop.benchmark_jobs.wait(job.id)
reward_signals = [(r.scenario_id, r.score) for r in results.runs]
```

npm install @runloop/api-client

```typescript
import Runloop from 'runloop';

// Configure execution environments for RFT training loop
const blueprint = await runloop.blueprints.get('python-ml-training-env');

// Submit scenario batch for reward signal generation
const job = await runloop.benchmarkJobs.create({
  name: 'rft-epoch-42-reward-signals',
  scenarios: candidateSolutions,
  blueprintId: blueprint.id,
  config: {
    concurrency: 200,
    timeoutSeconds: 120,
    retryAttempts: 1
  }
});

// Collect scoring signals for training loop
const results = await runloop.benchmarkJobs.wait(job.id);
const rewardSignals = results.runs.map(r => [r.scenarioId, r.score]);
```

npm install @runloop/api-client

```bash
# Submit scenario batch for reward signal generation
runloop benchmark run \
  --name "rft-epoch-42-reward-signals" \
  --scenarios ./candidate-solutions.yaml \
  --blueprint "python-ml-training-env" \
  --concurrency 200 \
  --timeout 120 \
  --retry-attempts 1

# Collect scoring signals
runloop benchmark results --job rft-epoch-42-reward-signals --format json
```

npm install @runloop/api-client

EXECUTION INFRASTRUCTURE FOR TRAINING LOOPS

Thousands of sandboxes, orchestrated for training signal generation

Runloop provisions isolated sandboxes at the scale training loops demand -- fast enough that environment provisioning does not bottleneck the experiment, isolated enough that concurrent executions do not interfere, and instrumented enough that every execution produces clean scoring signals. Snapshots enable efficient environment reuse across training steps.

```python
import runloop

# Configure execution environments for RFT training loop
blueprint = runloop.blueprints.get("python-ml-training-env")

# Submit scenario batch for reward signal generation
job = runloop.benchmark_jobs.create(
    name="rft-epoch-42-reward-signals",
    scenarios=candidate_solutions,
    blueprint_id=blueprint.id,
    config={
        "concurrency": 200,
        "timeout_seconds": 120,
        "retry_attempts": 1
    }
)
# Collect scoring signals for training loop
results = runloop.benchmark_jobs.wait(job.id)
reward_signals = [(r.scenario_id, r.score) for r in results.runs]
```

npm install @runloop/api-client

```typescript
import Runloop from 'runloop';

// Configure execution environments for RFT training loop
const blueprint = await runloop.blueprints.get('python-ml-training-env');

// Submit scenario batch for reward signal generation
const job = await runloop.benchmarkJobs.create({
  name: 'rft-epoch-42-reward-signals',
  scenarios: candidateSolutions,
  blueprintId: blueprint.id,
  config: {
    concurrency: 200,
    timeoutSeconds: 120,
    retryAttempts: 1
  }
});

// Collect scoring signals for training loop
const results = await runloop.benchmarkJobs.wait(job.id);
const rewardSignals = results.runs.map(r => [r.scenarioId, r.score]);
```

npm install @runloop/api-client

```bash
# Submit scenario batch for reward signal generation
runloop benchmark run \
  --name "rft-epoch-42-reward-signals" \
  --scenarios ./candidate-solutions.yaml \
  --blueprint "python-ml-training-env" \
  --concurrency 200 \
  --timeout 120 \
  --retry-attempts 1

# Collect scoring signals
runloop benchmark results --job rft-epoch-42-reward-signals --format json
```

npm install @runloop/api-client

badge

Infrastructure primitives for RFT and SFT at scale

Four capabilities from the [Runloop Platform](/product) that eliminate the gap between training framework and execution environment.

Blueprint Environments

Define execution environments as code. Language runtimes, dependencies, and toolchains specified declaratively. Same Blueprints work for training, evaluation, and production.

Scoring Contracts

Define scoring criteria per scenario: pass/fail, numeric scores, or custom evaluation functions. Structured results API returns clean signal for every execution.

Parallel Orchestration

Configurable concurrency limits match your training throughput. Retry policies handle transient failures without corrupting signal. Submit a BenchmarkJobDef and the platform manages the fleet.

Credential Gateway

Opaque tokens injected into sandboxes. Real credentials never exposed. Devbox-bound tokens expire on termination. Secure access to private repos during training.

Use Case

Environment-grounded reward signals at training scale

RFT uses execution outcomes as reward signals. The training loop generates candidate solutions, executes them in sandboxed environments, scores against defined criteria, and feeds that signal back to the model. The infrastructure requirements are demanding: thousands of scenarios per epoch, each requiring a clean, isolated environment. Execution must be fast -- a training loop that waits minutes per sandbox will never converge. Scoring must be reliable -- noisy signals from flaky environments corrupt the training process.

Blueprint consistency: Every sandbox in the training run provisions from the same Blueprint template for deterministic execution

Snapshot reuse: Snapshot a configured environment and restore at each step to reduce setup time for heavy dependency tree

Side-by-side results: Pass rate, duration, and score distribution across all scenarios in a unified comparison view

Provider integration: Execution substrate for provider RFT programs -- the environments where candidate solutions run and get scored

View fine-tuning docs

Use Case

Generate and validate SFT training data through execution

SFT uses curated input-output pairs to teach a model specific behaviors. For code tasks, generating high-quality training data often requires execution -- running candidate solutions to verify they actually work before including them in the training set. Runloop supports SFT workflows by providing the execution environments needed to validate training examples at scale. Generate candidates with a base model, execute in sandboxes to verify correctness, and filter for high-quality examples.

Execution-validated data: Verify that candidate training examples actually produce correct output before inclusion in the dataset

Scale filtering: Run thousands of candidate validations in parallel to build large, high-quality training sets efficiently

Credential protection: Execution in ephemeral environments with Credential Gateway injection protects proprietary code and APIs

VPC deployment: For strict data governance, deploy the entire workflow inside your infrastructure boundary

Custom Benchmarks

From candidate solutions to training signals

Four steps from evaluation design to reward signal collection.

Process card icon.

Radar.

01

Define Your Scenarios

Encode training tasks as scenario definitions with candidate solution, runtime environment (via Blueprint), and scoring contract. Same format for both RFT reward generation and SFT data validation.

Green light and grid of the progress.

chat bubble icon

Radar.

02

Configure Execution Parameters

Set concurrency, retry policy, and timeouts to match training loop throughput. Use a BenchmarkJobDef template for standardized runs or submit ad-hoc for experimental iterations.

White light and grid of the progress.

chat bubble icon

Radar.

02

Submit and Execute

Submit via API. The platform schedules across isolated sandboxes, manages the lifecycle, and streams progress. Each scenario runs in its own environment -- no shared state, no cross-contamination.

White light and grid of the progress.

chat bubble icon

Radar.

02

Collect Reward Signals

Retrieve structured results for every scenario: identifier, score, duration, pass/fail status. Feed signals directly into your training loop for weight updates.

White light and grid of the progress.

The execution substrate that fine-tuning requires

Three capabilities no alternative provides

Training-Loop Speed

Standard container solutions require you to build scheduling, concurrency, retry, and result aggregation from scratch. Runloop provides this as infrastructure via the benchmark job orchestration layer.

Grounded Scoring

Other sandboxes provide execution but not scoring infrastructure. Runloop combines execution with scoring contracts, structured results, and scenario definitions. Reward signals come from the platform.

Training Data Security

Credential Gateway prevents candidate solutions from exfiltrating API keys. For regulated environments, VPC deployment keeps training data inside your boundary.

F.A.Q

Fine-tuning infrastructure questions

Common questions about using Runloop as the execution layer for RFT, SFT, and RL experiments.

More questions? Visit our docs or send us a message