Background: What Is AlphaEvolve?

Ask an LLM to write a sorting function. You'll get correct, textbook code — maybe a clean merge sort or a well-structured quicksort. It compiles, passes tests, handles edge cases. But "correct" is not "optimal." For a specific input distribution — say, nearly-sorted arrays with a few outliers — a hand-tuned heuristic can be 3–10x faster. The LLM doesn't know your distribution. It optimizes for plausibility given its training data, not for your metric.

This is the gap between generation and optimization. Google DeepMind's AlphaEvolve (announced May 2025) closes it — and the pattern it uses changes how you should think about applying AI to hard engineering problems.

In this tutorial, we'll build a Claude Code skill inspired by AlphaEvolve's core ideas. But first, you need to understand what AlphaEvolve is, what it can do, and why the "single LLM call" approach has a hard ceiling.

The Single-Call Ceiling#

A single LLM call is next-token prediction. The model picks the most likely continuation given the prompt and its training distribution. This gives you working code, reasonable designs, coherent explanations. But it has a fundamental limitation: it cannot optimize for a custom metric that wasn't in the training objective.

Want to minimize p99 latency for your specific workload? Reduce memory allocations in a hot loop? Find a matrix multiplication algorithm that uses fewer operations than anything known? Next-token prediction cannot get you there in one shot. It produces plausible code, not optimal code.

Single LLM Call
Rendering diagram...
Evolutionary Optimization
Rendering diagram...

A single LLM call is asking a smart friend for advice once. AlphaEvolve is running a tournament where thousands of candidates compete, mutate, and the best survive across generations — with a scoreboard that measures exactly what you care about.

What AlphaEvolve Does#

AlphaEvolve is an evolutionary coding agent built by Google DeepMind. It evolves entire programs — classes, methods, configurations — in any programming language by wrapping LLMs in an automated optimization loop. It's the successor to FunSearch (2023), which could only evolve single functions.

The key insight: use LLMs as mutation operators within an evolutionary algorithm, not as one-shot generators. The LLM proposes changes; an automated evaluator scores them; the best survive to become parents of the next generation.

Architecture: Five Components in a Loop#

AlphaEvolve Architecture
Rendering diagram...
  1. Prompt Sampler — Assembles rich context for each generation attempt: problem description, parent programs, their scores, relevant code context, and meta-prompts that the system itself co-evolves over time.

  2. LLM Ensemble — Uses multiple models with different strengths. Google's implementation pairs Gemini Flash (fast, cheap — maximizes breadth of exploration) with Gemini Pro (slower, stronger — provides depth for breakthroughs).

  3. Evaluator Pool — Executes candidates and scores them on user-defined metrics. Supports evaluation cascades: cheap checks first (syntax, type-check, fast tests), expensive checks last (full benchmark suite). Can also use LLM-as-judge for properties that are hard to measure programmatically.

  4. Programs Database — Maintains the population using MAP-Elites (a quality-diversity algorithm) combined with an island model. MAP-Elites partitions the solution space into behavioral niches and keeps the best candidate per niche, ensuring the population stays diverse rather than collapsing to a single local optimum.

  5. Controller — An async distributed orchestrator that maximizes throughput across the pipeline.

The Evolutionary Loop#

1. Sample parent from population (biased toward quality, with diversity)
2. Build prompt: parent code + context + scores + mutation hints
3. LLM generates a code diff (or a full rewrite)
4. Apply changes to parent → child program
5. Evaluate child on automated metric
6. Add to population if it passes quality threshold
7. Repeat

Each iteration is cheap — a single LLM call plus an evaluation run. The power comes from running hundreds of iterations, accumulating improvements that no single call could produce.

Why We Need This Pattern#

Real-World Results#

AlphaEvolve isn't a research prototype — it's running in production at Google:

  • Google Borg Scheduler: A 7-line heuristic discovered by AlphaEvolve continuously saves 0.7% of Google's global compute resources. It was chosen over a deep RL solution because the evolved heuristic is interpretable — engineers can read and reason about 7 lines of code.
  • Gemini Training: 23% speedup on a critical training kernel, yielding a 1% reduction in overall Gemini training time. At Google's scale, 1% of training compute is enormous.
  • FlashAttention: Up to 32.5% speedup on XLA-generated FlashAttention code.
  • Matrix Multiplication: Found an algorithm for 4×4 complex matrix multiplication using 48 scalar multiplications — the first improvement over Strassen's 1969 algorithm in this setting.

When It Applies (and When It Doesn't)#

Good FitBad Fit
Clear objective to maximize/minimizeNo measurable "better" (feature implementation, one-off scripts)
Automated, quantifiable evaluation metricRequires physical experiments or subjective human judgment
Runtime, accuracy, resource usage, proof verificationNo clear scoring function exists
Search space is large enough to benefit from iterationThe correct answer is known and straightforward

Applying the Pattern to Everyday Engineering#

You don't need Google-scale infrastructure to use this approach. The core idea — generate variants, evaluate on a real metric, select, repeat — works at any scale:

  • Performance optimization: Define a benchmark, generate code variants, measure each one, keep the fastest.
  • Algorithm selection: Instead of picking the textbook algorithm, evolve a hybrid tuned to your actual data distribution.
  • Configuration tuning: Treat config parameters as the "genome" — evolve database pool sizes, cache TTLs, batch sizes that maximize throughput under realistic load.
  • Code quality: Use static analysis scores as the evaluator — evolve code that minimizes complexity while maintaining correctness.

What We'll Build#

In the following sections of this tutorial, we'll build a Claude Code skill that implements a simplified version of this evolutionary optimization loop:

  1. Define an evaluator — a scoring function that measures the property you care about
  2. Set up the evolution target — which code to optimize and the initial seed
  3. Assemble rich context — parent code, evaluation history, mutation hints
  4. Run the optimization loop — generate variants, evaluate, select, repeat

The result won't have AlphaEvolve's full MAP-Elites population model or island-based diversity, but it captures the core principles: population-based search, automated evaluation, and iterative refinement driven by real measurements rather than LLM guesswork.