Runloop dashboard showing the Public Benchmarks page with a sidebar menu on the left and several green progress cards in the main area.

Evaluate & Train with
Benchmarks

Benchmarks are structured evaluations that measure how well an AI agent performs on given tasks. Use the very best public benchmarks or create your own.

Runloop public benchmarks screenwhite gradient background
Benchmarks With Runloop AI

Benchmarks at Scale

Launch training runs to drive agent performance or evaluate current capabilities

Measure

Measure & compare agent performance
Detect regressions and automate testing
Automate prompt and context improvements

white gradient background
Train

Train at scale with state of the art benchmarks
Use the latest in post-training algorithms like RFT and SFT to take your agent to the next level
Run tens of thousands of test cases simultaneously

white gradient background
Customize

Create, clone & modify test cases for your business
Secure, isolated sandboxes keep your data secure

white gradient background

Benchmarks Types

Run industry-standard benchmarks or create custom evals to measure what matters most

Public Benchmarks

Evaluate your agents against ready-made, industry-standard datasets to quickly measure baseline performance.

Learn About Public Benchmarks
Runloop create scenario screen
hand placing shapes icon

Curated Benchmarks

Run test cases from across multiple benchmarks that match criteria that you care about. From security & compliance to tool use, Runloop has a targeted benchmark for you.

Custom Benchmarks

Turn your domain expertise into automated tests that can be run at massive scale and low latency.

Learn About Custom Benchmarks

Features

Discover the tools that make building and testing easier.

Ruler icon
Standardized Evaluation

Consistently measure AI agent performance across multiple tasks and scenarios.

white gradient background
slider controls icon
Customizable Scenarios

Teach your agent to solve real problems. Build evaluation sets from pull requests, or create synthetic problems to maximize training efficacy.

white gradient background
clipboard with pencil icon
Dynamic Scoring

Define what success looks like by tailoring scoring rewards, so that "correct" also means low cost and secure.

white gradient background
bar chart icon
Comparative Insights

Track results over time and compare agents against industry or internal baselines.

white gradient background
lightning bolt icon
Optimize Context

Improve your agents' performance by testing changes to prompt or context against benchmarks.

white gradient background
external link arrow icon
Train Your Agent

Use sophisticated training methods like Reinforcement Fine Tuning (RFT) and Supervised Fine Tuning (SFT) to improve agent performance.

white gradient background
Case study

Trusted by

Subtitle

Title

Body

Title

Body

Quote

Person, Title, Company

Title

Body