Back
MBPP: The Benchmark That Democratized Code Generation
Abigail Wall
Product Manager
Benchmarks

MBPP: The Benchmark That Democratized Code Generation

MBPP made AI code generation easier by using simple Python tasks. It changed how models are judged and helped more people get into coding benchmarks.

TL;DR:

Released in August 2021 by Google Research [Austin et al., 2021], MBPP (Mostly Basic Python Problems) introduced 974 entry-level programming tasks designed to be solvable by novice programmers. Unlike HumanEval's function-level challenges, MBPP focused on fundamental programming concepts with visible test cases, creating a more accessible yet comprehensive evaluation framework. Today's models achieve 95%+ accuracy on MBPP, but the benchmark's true legacy lies in its democratic approach to measuring programming competence.

Introduction

One month after OpenAI released HumanEval, a team at Google Research was putting the finishing touches on their own contribution to the emerging field of code generation evaluation. But where HumanEval had focused on the kind of problems that might appear in a technical interview, Jacob Austin, Augustus Odena, and their colleagues had a different vision in mind [Austin et al., 2021].

They called it MBPP—Mostly Basic Python Problems—and the name captured something essential about their approach. These weren't the algorithmic puzzles that computer science students might encounter in advanced courses. They were the fundamental building blocks of programming: string manipulation, list processing, basic mathematical operations, and simple logical reasoning. The kind of problems that anyone learning to program would need to master.

The timing was perfect. In the summer of 2021, the AI research community was grappling with a fundamental question: how do you evaluate whether an artificial intelligence system can truly code? HumanEval had provided one answer, but Austin and his team recognized that a single benchmark, no matter how well-designed, couldn't capture the full spectrum of programming competence.

Their insight was both simple and profound: if you want to understand whether AI can program, start with the basics. Don't begin with complex algorithms or esoteric data structures. Begin with the fundamental skills that every programmer must master—the ability to read a problem description, understand what's being asked, and translate that understanding into working code.

MBPP's 974 problems were crowd-sourced, reflecting the collective wisdom of the programming community about what constituted essential programming skills. Each problem came with a clear description, visible test cases, and a focus on practical programming tasks rather than theoretical computer science concepts.

But MBPP's true innovation wasn't in its problem selection—it was in its philosophy. Where other benchmarks might test the limits of AI capabilities, MBPP tested their foundations. It asked a deceptively simple question: can AI systems master the basic building blocks of programming that human beginners learn first?

Background and Methodology

The story of MBPP begins with a recognition that would reshape how we think about AI evaluation: the importance of fundamentals. In August 2021, as the AI research community was still absorbing the implications of HumanEval, the Google Research team was taking a different approach to the same fundamental question [Austin et al., 2021].

Where HumanEval had focused on function-level programming tasks that might appear in technical interviews, MBPP took inspiration from a different source: the problems that programming instructors use to teach beginners. The team crowd-sourced 974 programming problems, each designed to be solvable by entry-level programmers and covering the fundamental concepts that form the foundation of programming competence.

The methodology was deliberately inclusive. Rather than hand-crafting problems to test specific algorithmic knowledge, the team gathered problems that reflected the collective wisdom of the programming education community. These were the problems that instructors had found effective for teaching basic programming concepts, refined through years of classroom experience.

Each MBPP problem followed a consistent structure that differed meaningfully from HumanEval's approach. Where HumanEval provided function signatures and docstrings, MBPP problems included the test cases directly in the problem description. This transparency was intentional—it reflected how programming is actually taught, with clear examples of expected behavior rather than hidden test cases.

Consider this example from MBPP:

The elegance was in the accessibility. The problem statement was clear and direct. The test cases provided concrete examples of expected behavior. The task required understanding basic programming concepts—sets, tuples, iteration—without demanding specialized algorithmic knowledge.

MBPP's evaluation methodology built on HumanEval's innovations while adding its own insights. The team used the same pass@k metrics that had proven effective for measuring code generation performance, but they applied them to a much larger and more diverse set of problems. With 974 problems compared to HumanEval's 164, MBPP provided a more comprehensive assessment of programming competence.

The benchmark's data splits were carefully designed to support both few-shot learning and fine-tuning approaches [Google Research MBPP GitHub]. Task IDs 11-510 were reserved for testing, while IDs 1-10 were used for few-shot prompting. This structure allowed researchers to evaluate models in different learning regimes while maintaining clear separation between training and evaluation data.

The team's approach to prompt engineering reflected their focus on accessibility and clarity. Their standard prompt format was straightforward: "You are an expert Python programmer, and here is your task: {prompt} Your code should pass these tests: {tests}." This direct approach avoided the complex prompt engineering that some benchmarks required, making the evaluation more accessible to researchers with different levels of prompt optimization expertise.

MBPP's focus on entry-level problems served a crucial purpose in the broader landscape of AI evaluation. While other benchmarks pushed the boundaries of what AI systems could accomplish, MBPP established a foundation—a set of basic competencies that any system claiming programming ability should master. This complementary role proved invaluable as the field evolved.

Current State-of-the-Art Results

The evolution of MBPP performance tells a story of remarkable progress in AI's mastery of fundamental programming concepts. When the benchmark was first released in August 2021, the largest models achieved 59.6% pass@1 using few-shot learning—a score that demonstrated genuine programming competence but left substantial room for improvement [Austin et al., 2021].

Today's landscape presents a dramatically different picture. OpenAI's O1 Preview achieves 95.5% pass@1 on the original MBPP benchmark, with O1 Mini close behind at 93.1% [EvalPlus Leaderboard]. This represents a 60% improvement over the original results—a testament to the rapid advancement in AI code generation capabilities.

The current leaderboard showcases the breadth of high-performing systems. Qwen2.5-Coder-32B-Instruct reaches 90.5%, while Gemini 1.5 Pro 002 achieves 89.7%. DeepSeek-Coder-V2-Instruct and Claude Sonnet 3.5 both score 89.4%, demonstrating that multiple research groups have achieved near-human performance on these fundamental programming tasks.

However, the EvalPlus results reveal the familiar pattern of performance degradation under more rigorous testing. The same O1 Preview model that achieves 95.5% on base MBPP tests scores 80.2% on EvalPlus tests [EvalPlus Leaderboard]. This 15-percentage-point gap highlights the ongoing challenge of robustness in AI code generation.

Technical Analysis and Future Outlook

MBPP's contribution to the field extends beyond its specific problems and metrics. The benchmark established the principle that AI evaluation should encompass the full spectrum of programming competence, from basic skills to advanced capabilities. Its focus on entry-level problems provided a foundation that other benchmarks could build upon.

The performance patterns on MBPP reveal important insights about current AI capabilities. Models consistently perform slightly worse on MBPP than on HumanEval, despite MBPP's focus on simpler problems. This suggests that the larger problem set and different evaluation approach of MBPP may provide a more challenging assessment of programming competence.

The benchmark's influence on model development has been significant. The pursuit of high MBPP scores has driven improvements in models' ability to handle basic programming concepts, string manipulation, and logical reasoning. These fundamental skills form the foundation for more advanced programming capabilities.

Looking forward, MBPP's legacy lies in its demonstration that comprehensive evaluation requires diverse benchmarks testing different aspects of programming competence. As AI systems achieve near-perfect scores on both MBPP and HumanEval, the field is moving toward more complex evaluation frameworks that test real-world programming scenarios.

The benchmark's emphasis on accessibility and transparency has influenced the design of subsequent evaluation frameworks. The practice of including test cases in problem descriptions, pioneered by MBPP, has become common in many newer benchmarks.

Challenges and Limitations

Like HumanEval, MBPP faces the contamination problem, with its problems widely available since 2021. The benchmark's focus on basic programming concepts, while valuable for establishing foundational competence, may not capture the full complexity of real-world programming tasks.

The evaluation methodology's binary pass/fail approach doesn't account for code quality, efficiency, or maintainability. Additionally, the benchmark's Python-specific focus limits its generalizability to other programming languages and paradigms.

References

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., & Sutton, C. (2021). Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732. https://arxiv.org/abs/2108.07732

EvalPlus Leaderboard. (2024). EvalPlus evaluates AI Coders with rigorous tests. https://evalplus.github.io/leaderboard.html

Google Research MBPP GitHub Repository. (2021). Mostly Basic Python Problems Dataset. https://github.com/google-research/google-research/blob/master/mbpp/README.md