Back
How Trajectory ran 10,000 concurrent devboxes on Runloop, and what we shipped to support them
Tony Deng
Developer Relations
AI Ecosystem

How Trajectory ran 10,000 concurrent devboxes on Runloop, and what we shipped to support them

Trajectory, the continual learning platform, consistently runs 10,000+ burst concurrent devboxes on Runloop for training and fine-tuning models. Their workload looked nothing like a demo: many concurrent sessions, long-poll control loops, older Ubuntu blueprints with pinned dependency graphs on top. It was a new usage pattern for us at this scale, and within a few runs, things that worked smoothly at smaller volumes started behaving differently. Simple operations timed out. Builds hung. Setup scripts exited on devboxes that booted cleanly.

None of it was the infrastructure. All of it was defaults that hadn't been shaped by a workload like Trajectory's yet.

This is the story of what surfaced, what we shipped, and why it matters if you're running agentic or evaluation workloads on anyone's infrastructure.

The SDK was the first thing to evolve

The first signal was slowness that backend health didn't explain. Operations were timing out even when the control plane looked fine.

The cause was in the SDK, not the platform. HTTP/1.1 connection pooling combined with long-poll semantics and high concurrency creates head-of-line blocking: a slow response on one connection holds up unrelated requests waiting behind it, and the connection pool quietly becomes the throttle. From Trajectory's harness it looked like the platform was flaky. From our side it looked healthy.

The fix shipped in SDK 1.20.2 as HTTP/2 multiplexing. More work shares fewer connections, slow responses stop blocking unrelated requests, and the tail behavior smooths out. The interesting takeaway isn't a better p50. It's that transport defaults matter a lot when a well-behaved client is operating at scale, and shaping those defaults around real workloads is part of the product.

Blueprints, image variance, and the dependency stack

The transport story was the sharpest example of defaults evolving with usage, but it wasn't the most interesting one. Trajectory's blueprints were built on Ubuntu base images from 2018 and 2020. That's normal for evaluation work. When you care about reproducibility, you pin everything, including the OS. It's also where the most interesting work happened.

The sudoers story. Our base image setup adds an include line to /etc/sudoers. It worked smoothly on the Ubuntu versions in our test matrix. On a 2018 base, the file layout was different enough that the include needed adjusting, and on Trajectory's specific blueprint it produced a syntax error on line 31.

The setup script exited on the first sudo. From outside, this looked like a devbox that booted cleanly and then stopped for no reason. The runtime classified it as an internal error, which made it look more platform-side than it really was. The root cause was a file layout that's been stable for years on current images but shifts on a 2018 one. We've since expanded our test matrix to cover the full range of Ubuntu versions our customers actually run, which is a meaningful upgrade to how we ship base image changes.

The lesson is that "we support Ubuntu" is a starting point. "We support these specific Ubuntu versions and we know what's different about each one" is the contract our customers actually need, and it's the contract we now run against.

The npm mirror story. Pinned dependency graphs in evaluation environments need more than the public npm registry can reliably provide: rate limits, WAN variability, and cost all push toward a mirror. npm makes this interesting because it hard-codes the full URLs of artifacts into its lockfiles instead of using versions and hashes. Those URLs become the source of truth.

So the mirror has to rewrite manifest URLs to point at itself, and the client has to trust those rewrites. That means terminating TLS, injecting certs, and configuring DNS so the client treats the mirror as the registry. For public packages this looks like a transparent proxy serving binary blobs directly. For private packages it's a more conventional HTTPS proxy. The architecture takes some real engineering, because npm's design forces it to.

What the mirror buys you isn't speed. Most install time is client-side unpacking and signature checking, which a mirror can't help with. What it buys you is reliability under load. The user doesn't see "the mirror made things faster." The user sees a platform that holds steady when 200 devboxes try to install the same lockfile at once.

The registry timeout story. Tying these together, during the early runs we saw some blueprint builds fail with errors like:

ERROR: failed to do request:

Head "https://<registry>/v2/library/ubuntu/manifests/latest"

dial tcp <registry>: i/o timeout

From the user's side: a build was slow, or hanging. From our side: a container registry mirror behaving unpredictably under concurrent pulls. Same shape as the npm case. A piece of infrastructure that exists to make dependency fetching reliable, learning a new load pattern and needing to adapt to it.

For an evaluation platform, the dependency stack is the workload. Not the compute, not the orchestration. Telling a customer where the time went (DNS, registry, mirror, unpack, signature check, disk) and which layer needs attention is part of what a platform owes you. That observability is what we keep investing in.

Where these problems actually live

The fastest way to mis-debug a benchmark failure is to assume it's infrastructure. Most of these signals cross layers. A transport issue can look like scheduler slowness, a registry stall can look like compute, an image quirk can look like an SDK flake. A platform that wants to earn trust at scale has to help customers localize across those boundaries, not just within them.

What this means

Most teams can build the happy path for agent infrastructure quickly. What takes longer is everything after: lifecycle semantics that hold up under retries, timeouts that don't amplify failures, blueprint behavior across years of Ubuntu releases, mirror architectures for package managers that hard-code URLs into lockfiles, observability that turns variability into actionable signal.

Agents accelerate every part of this. They retry, they parallelize, they loop. If a devbox platform sits inside an agent loop or an evaluation harness, reliability becomes a throughput feature. It determines how fast you can iterate on the thing you actually care about, which is rarely the sandbox.

Runloop exists so that someone else owns this work. The Trajectory case showed us where defaults (transport, base images, mirrors, classification) matter most at scale, and shaped what we've shipped since. If you're running similar workloads and seeing similar patterns, that's not a sign your harness is wrong. It's a sign you're pushing the platform hard enough for us to learn alongside you. We'd love to hear from you!