Runloop - Public Benchmarks

Use Cases

Turn your domain expertise into automated, high-margin AI verification standards across critical industry tasks.

BigCodeBench

Evaluates LLMs on practical and challenging programming tasks with diverse function calls and complex instructions across 139 Python libraries.

Scenarios

1,140

Release Dates

2024

Attributions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim

SWE bench

Evaluates AI agents' ability to solve real-world GitHub issues by producing code edits as patch files. Uses authentic software engineering problems from popular open-source repositories.

Scenarios

2,294 / 500

SWE bench full / SWE bench verified

Release Dates

2023

Attributions

Carlos E. Jimenez, John Yang, Alexander Wettig

SWE-Smith

Automated pipeline for generating large-scale software engineering training data, creating synthetic bug-fixing tasks from real codebases.

Scenarios

50,000+

Release Dates

2025

Attributions

John Yang, Kyle Leret, Carlos E. Jimenez

SWE-bench Verified

Human-validated subset of SWE-bench with 500 carefully verified samples, providing more reliable evaluation of AI models' software engineering capabilities.

Scenarios

500

Release Dates

2024

Attributions

Carlos E. Jimenez, John Yang, Alexander Wettig (original SWE-bench authors), OpenAI Preparedness Team

DS-1000

Data science code generation benchmark with 1,000 problems spanning seven Python libraries including NumPy, Pandas, and Matplotlib

Scenarios

1,000

Release Dates

2022

Attributions

Yuhang Lai, Chengxi Li, Yiming Wang

Terminal-Bench

Terminal-Bench contains hundreds of user contributed tasks covering a wide variety of test cases ranging from solving logic puzzles to compiling the linux kernel.

Scenarios

200+

Scenarios

Release Dates

2025

Attributions

Mike Merrill, Alex Shaw

MarsCode Agent (ByteDance)

AI-native automated bug fixing agent that achieves state-of-the-art performance on SWE-bench, demonstrating advanced software engineering capabilities.

Scenarios evaluated on SWE-bench

39,33%

Success rate

Release Dates

2024

Attributions

Yuntong Liu, Peng Gao, Xinyu Wang

OpenAI HumanEval

The original OpenAI benchmark for evaluating large language models trained on code, featuring carefully crafted evaluation sets that measure functional correctness.Evaluates AI agents´ability to solve real-world GitHub issues by producing code edits as patch files.

Scenarios

164

Release Dates

2021

Attributions

Mark Chen, Jerry Tworek, Heewoo Jun

SWE-Smith

Automated pipeline for generating large-scale software engineering training data, creating synthetic bug-fixing tasks from real codebases.

SWE-Smith

Automated pipeline for generating large-scale software engineering training data, creating synthetic bug-fixing tasks from real codebases.

And That’s Not All

We have over 150,000 use cases ready for you to use

Talk to An Expert

Case study

The Evolution to Verification

Fermatix.ai, renowned for creating expert-level training data tailored to industry-critical tasks, with annotators who are practicing industry experts, partnered with Runloop.ai to strategically evolve their offering.

Challenge

Fermatix.ai needed a way to move beyond providing one-time training data to establishing ongoing testing standards and verification for their enterprise clients, ensuring AI agent performance against specific proprietary logic.

Solution: Runloop Custom Benchmarks

By leveraging Runloop.ai’s Custom Benchmarks infrastructure, Fermatix.ai is now able to offer custom, in-house verification for its clients. This allows them to build specialized, private benchmarks that accurately measure and refine AI agents on unique codebases and business logic.

This partnership... represents a strategic evolution—moving beyond one-time data labeling to creating reusable benchmarks that deliver ongoing value to our clients. By leveraging our domain expertise and Runloop’s infrastructure, we’re not just providing data anymore; we’re building the testing standards that will define how enterprises evaluate their AI agents across industry-critical tasks

Sergey Anchutin, CEO and Founder, Fermatix.ai

Outcome

Fermatix.ai strategically expanded its capabilities, using its domain expertise to create high-fidelity, multilingual benchmarks on a secure, scalable platform. They are now positioned to offer a new level of assurance and become the verification layer for their clients' AI agent deployments.

Introducing Public Benchmarks by runloop

Use Cases

BigCodeBench

1,140

2024

SWE bench

2,294 / 500

2023

SWE-Smith

50,000+

2025

SWE-bench Verified

500

2024

DS-1000

1,000

2022

Terminal-Bench

200+

2025

MarsCode Agent (ByteDance)

39,33%

2024

OpenAI HumanEval

164

2021

SWE-Smith

SWE-Smith

And That’s Not All

The Evolution to Verification

Powerful Features at Low Costs

Looking For a Customized Solution?

Get Started With Runloop

Get Started With Runloop

Cookie Settings

Introducing Public Benchmarks by runloop

Use Cases

BigCodeBench

1,140

2024

SWE bench

2,294 / 500

2023

SWE-Smith

50,000+

2025

SWE-bench Verified

500

2024

DS-1000

1,000

2022

Terminal-Bench

200+

2025

MarsCode Agent (ByteDance)

39,33%

2024

OpenAI HumanEval

164

2021

SWE-Smith

SWE-Smith

And That’s Not All

The Evolution to Verification

Powerful Features at Low Costs

Looking For a Customized Solution?

Get Started With Runloop

Get Started With Runloop