Evaluate AI coding agents with precision using Runloop's Public Benchmarks. Our platform offers standardized performance metrics that help developers and researchers assess capabilities across different tasks and domains.


Turn your domain expertise into automated, high-margin AI verification standards across critical industry tasks.

Evaluates LLMs on practical and challenging programming tasks with diverse function calls and complex instructions across 139 Python libraries.
Scenarios
Release Dates
Attributions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim
Evaluates AI agents' ability to solve real-world GitHub issues by producing code edits as patch files. Uses authentic software engineering problems from popular open-source repositories.
Scenarios
SWE bench full / SWE bench verified
Release Dates
Attributions
Carlos E. Jimenez, John Yang, Alexander Wettig
Automated pipeline for generating large-scale software engineering training data, creating synthetic bug-fixing tasks from real codebases.
Scenarios
Release Dates
Attributions
John Yang, Kyle Leret, Carlos E. Jimenez
Human-validated subset of SWE-bench with 500 carefully verified samples, providing more reliable evaluation of AI models' software engineering capabilities.
Scenarios
Release Dates
Attributions
Carlos E. Jimenez, John Yang, Alexander Wettig (original SWE-bench authors), OpenAI Preparedness Team
Data science code generation benchmark with 1,000 problems spanning seven Python libraries including NumPy, Pandas, and Matplotlib
Scenarios
Release Dates
Attributions
Yuhang Lai, Chengxi Li, Yiming Wang
First multilingual code fix benchmark covering seven programming languages, designed to evaluate large models' self-debugging and code repair capabilities across diverse codebases.
Scenarios
Scenarios
Release Dates
Attributions
Daoguang Zan, Zhirong Huang, Wei Liu
AI-native automated bug fixing agent that achieves state-of-the-art performance on SWE-bench, demonstrating advanced software engineering capabilities.
Scenarios evaluated on SWE-bench
Success rate
Release Dates
Attributions
Yuntong Liu, Peng Gao, Xinyu Wang
The original OpenAI benchmark for evaluating large language models trained on code, featuring carefully crafted evaluation sets that measure functional correctness.Evaluates AI agents´ability to solve real-world GitHub issues by producing code edits as patch files.
Scenarios
Release Dates
Attributions
Mark Chen, Jerry Tworek, Heewoo Jun
Automated pipeline for generating large-scale software engineering training data, creating synthetic bug-fixing tasks from real codebases.
Automated pipeline for generating large-scale software engineering training data, creating synthetic bug-fixing tasks from real codebases.