Evaluate & Train with
Benchmarks

Benchmarks are structured evaluations that measure how well an AI agent performs on given tasks. Use the very best public benchmarks or create your own.

Benchmarks With Runloop AI

Benchmarks at Scale

Launch training runs to drive agent performance or evaluate current capabilities

Measure

Measure & compare agent performance
Detect regressions and automate testing
Automate prompt and context improvements

Train

Train at scale with state of the art benchmarks
Use the latest in post-training algorithms like RFT and SFT to take your agent to the next level
Run tens of thousands of test cases simultaneously
‍

Customize

Create, clone & modify test cases for your business
Secure, isolated sandboxes keep your data secure

Benchmarks Types

Run industry-standard benchmarks or create custom evals to measure what matters most

Public Benchmarks

Evaluate your agents against ready-made, industry-standard datasets to quickly measure baseline performance.

Learn About Public Benchmarks

Curated Benchmarks

Run test cases from across multiple benchmarks that match criteria that you care about. From security & compliance to tool use, Runloop has a targeted benchmark for you.