Runloop dashboard showing the Public Benchmarks page with a sidebar menu on the left and several green progress cards in the main area.

Title for
Benchmarks

Benchmarks are structured evaluations made up of scenarios (individual test cases) that measure how well an AI agent performs on given tasks.

Runloop public benchmarks screenwhite gradient background
Tag

Title

Subtitle

Performance

Run 10k+ parallel sandboxes.
10GB image startup time in <2s.
All with leading reliability guarantees.

white gradient background
Scalability

Automatically scale up/down sandbox CPU or Memory based on your Agent needs in realtime.Pay only for what you use.

white gradient background
Custom Benchmarks

Monitoring/Observability/Logs + first class support.

white gradient background

Benchmarks Types

Run industry-standard benchmarks or create custom ones to measure what matters most

Public Benchmarks

Evaluate your agents against ready-made, industry-standard datasets to quickly measure baseline performance.

Learn About Public Benchmarks
Runloop create scenario screen
hand placing shapes icon

Curated Benchmarks

Turn your domain expertise into automated, high-marginAI verification standardsacross critical industry tasks.

Custom Benchmarks

Validate your agents with standard datasets or design tailored evaluations for your needs.

Learn About Custom Benchmarks

Features

Discover the tools that make building and testing easier.

Ruler icon
Standardized Evaluation

Consistently measure AI agent performance across multiple tasks and scenarios.

white gradient background
slider controls icon
Customizable Scenarios

[PENDING]

white gradient background
clipboard with pencil icon
Actionable Scoring

Design benchmarks tailored to your unique workflows, domains, or edge cases.

white gradient background
bar chart icon
Comparative Insights

Track results over time and compare agents against industry or internal baselines.

white gradient background
lightning bolt icon
Automated Runs

Easily execute scenarios with built-in environment setup and result collection.

white gradient background
external link arrow icon
Scalable Testing

Evaluate small experiments or large suites of scenarios with the same framework.

white gradient background
Case study

Title

Subtitle

Title

Body

Title

Body

Quote

Person, Title, Company

Title

Body