AlphaEvolve: Why One LLM Call Isn't Enough

Ask an LLM to write a sorting function. You'll get correct, textbook code — maybe a clean merge sort or a well-structured quicksort. It compiles, passes tests, handles edge cases. But "correct" is not "optimal". For a specific input distribution — say, nearly-sorted arrays with a few outliers — a hand-tuned heuristic can be 3–10x faster. The LLM doesn't know your distribution. It optimizes for plausibility given its training data, not for your metric.

This is the gap between generation and optimization. Google DeepMind's AlphaEvolve (announced May 2025) closes it — and the way it works changes how you should think about applying AI to hard engineering problems.

Video Walkthrough#

If you prefer a video walkthrough, this covers the essentials quickly:

The Single-Call Ceiling#

A single LLM call is next-token prediction. The model picks the most likely continuation given the prompt and its training distribution. This is powerful — it gives you working code, reasonable designs, coherent explanations. But it has a fundamental limitation: it cannot optimize for a custom metric that wasn't in the training objective.

Want to minimize p99 latency for your specific workload? Reduce memory allocations in a hot loop? Find a matrix multiplication algorithm that uses fewer scalar operations than anything known? Next-token prediction cannot get you there in one shot. It produces plausible code, not optimal code.

Single LLM Call
Rendering diagram...
Evolutionary Optimization
Rendering diagram...

Think of it this way: a single LLM call is asking a smart friend for advice once. AlphaEvolve is running a tournament where thousands of candidates compete, mutate, and the best survive across generations — with a scoreboard that measures exactly what you care about.

What Is AlphaEvolve?#

AlphaEvolve is an evolutionary coding agent built by Google DeepMind. It evolves entire programs — classes, methods, configurations — in any programming language by wrapping LLMs in an automated optimization loop. It was published as a full paper and represents the successor to FunSearch (2023), which could only evolve single functions.

The key insight: use LLMs as mutation operators within an evolutionary algorithm, not as one-shot generators. The LLM proposes changes; an automated evaluator scores them; the best survive to become parents of the next generation.

Architecture: Five Components in a Loop#

AlphaEvolve Architecture
Rendering diagram...

1. Prompt Sampler — Assembles rich context for each generation attempt: problem description, parent programs, their scores, relevant code context (used classes, methods, imported packages), and meta-prompts that the system itself co-evolves over time.

2. LLM Ensemble — Uses multiple models with different strengths. Google's implementation pairs Gemini Flash (fast, cheap — maximizes breadth of exploration) with Gemini Pro (slower, stronger — provides depth for breakthroughs). The key idea: a fast model explores the search space broadly while a stronger model makes targeted, insightful improvements.

3. Evaluator Pool — Executes candidates and scores them on user-defined metrics. Supports evaluation cascades: cheap checks first (syntax, type-check, fast tests), expensive checks last (full benchmark suite). Can also use LLM-as-judge for properties that are hard to measure programmatically.

4. Programs Database — Maintains the population using MAP-Elites (a quality-diversity algorithm) combined with an island model. MAP-Elites partitions the solution space into behavioral niches and keeps the best candidate per niche, ensuring the population stays diverse rather than collapsing to a single local optimum. The island model runs isolated sub-populations that periodically exchange top candidates.

5. Controller — An async distributed orchestrator that maximizes throughput across the entire pipeline.

The Evolutionary Loop#

1. Sample parent from population (biased toward quality, with diversity)
2. Build prompt: parent code + context + scores + mutation hints
3. LLM generates a code diff (or a full rewrite)
4. Apply the changes to parent → child program
5. Evaluate child on automated metric
6. Add to population if it passes quality threshold
7. Go to step 1 and repeat multiple times

Each iteration is cheap — a single LLM call plus an evaluation run. The power comes from running hundreds of these iterations, accumulating improvements that no single call could produce.

When AlphaEvolve Applies (and When It Doesn't)#

Good FitBad Fit
Clear objective to maximize/minimizeNo measurable "better" (feature implementation, one-off scripts)
Automated, quantifiable evaluation metricRequires physical experiments or subjective human judgment
Runtime, accuracy, resource usage, proof verificationNo clear scoring function exists
Search space is large enough to benefit from iterationThe correct answer is known and straightforward

Real-World Results#

AlphaEvolve isn't a research prototype — it's running in production at Google.

Google Borg Scheduler: A 7-line heuristic discovered by AlphaEvolve continuously saves 0.7% of Google's global compute resources. It was chosen over a deep RL solution because the evolved heuristic is interpretable — engineers can read and reason about 7 lines of code. It has been running in production for over a year.

Gemini Training: 23% speedup on a critical training kernel, yielding a 1% reduction in overall Gemini training time. At Google's scale, 1% of training compute is enormous.

FlashAttention: Up to 32.5% speedup on XLA-generated FlashAttention code.

Matrix Multiplication: Found an algorithm for 4×4 complex matrix multiplication using 48 scalar multiplications — the first improvement over Strassen's 1969 algorithm in this setting, a 56-year-old record.

Applying AlphaEvolve Thinking to Your Code#

You don't need Google-scale infrastructure to apply this pattern. The core idea — generate variants, evaluate on a real metric, select, repeat — works at any scale.

Pattern 1: CPU/Memory Optimization#

Define a benchmark (your evaluation function), generate code variants, measure each one, select the best. Instead of asking an LLM to "optimize this function" once, run it in a loop where each iteration sees what scored well and what didn't.

Pattern 2: Algorithm Selection#

Instead of picking the textbook algorithm, evolve a hybrid tuned to your actual data distribution. A merge sort is O(n log n) in theory, but for your specific data (mostly sorted with occasional outliers), a modified insertion sort with a fallback might be 5x faster in practice.

Pattern 3: Configuration Tuning#

Treat config parameters as the "genome." Evolve database pool sizes, cache TTLs, batch sizes, and thread counts that maximize throughput under realistic load — measured by actual benchmarks, not guesswork.

Pattern 4: Code Quality Evolution#

Use static analysis scores as the evaluator. Evolve code that minimizes linting violations, reduces cyclomatic complexity, or improves test coverage — while maintaining correctness (all tests must still pass).

Building an Evolutionary Optimization Loop with Claude Code#

You can implement an approximation of this pattern as a Claude Code skill. What follows is a pattern overview — the architecture and key steps, not a step-by-step implementation guide. It won't have the full MAP-Elites population model or island-based diversity, but the core principles — population-based search, automated evaluation, iterative refinement — still apply.

Step 1: Define the Evaluator#

Write a standalone script that takes a file path and outputs a numeric score:

#!/bin/bash
# evaluate.sh — scores a candidate implementation
# Run benchmark, report p99 latency in ms (lower is better)
python bench.py --target "$1" --metric p99 2>/dev/null

Your evaluator could measure anything: wall-clock time, memory allocations, static analysis violations, or even call an LLM-as-judge API for subjective qualities.

Step 2: Define the Evolution Target#

Specify which files/functions to evolve and provide the initial implementation as the seed population:

# evolve_config.py
target_files = ["src/query_engine.py"]
target_functions = ["execute_batch_query"]
seed_implementation = "src/query_engine.py"  # initial candidate

Step 3: Assemble Rich Context#

The prompt to Claude should include:

  • The parent code being evolved
  • Surrounding context: imports, callers, type definitions
  • Change summaries from top candidates: "Top candidate switched from HashMap to TreeMap for ordered iteration"
  • Evaluation history: "Attempt 3 scored 85 — reduced allocations but introduced a cache miss. Attempt 5 scored 92 — used a pre-allocated buffer."
  • Mutation directives: "Try a different data structure," "Reduce branching," "Batch I/O operations"

Step 4: The Optimization Loop (Skill Logic)#

population = [initial_implementation]  # maintain multiple candidates, not just one

for iteration in 1..N:
    1. Sample parent from population (not always the best — occasionally pick diverse/weaker
       candidates to maintain exploration)
    2. Build prompt with: parent code, surrounding context (imports, callers, types),
       change summaries from top/diverse programs, evaluation history (what scored well/poorly and why)
    3. Ask Claude to generate K variants (mutations) with diverse strategies
    4. For each variant:
       a. Apply the variant to the codebase
       b. Run evaluator → get score
       c. Add to population if score passes a quality threshold (not just if best)
       d. Revert the variant (keep codebase clean between evaluations)
    5. Prune population: keep top-P candidates + some structurally diverse ones
       (even if not top-scoring) to avoid converging on local optima
    6. Log: iteration, scores, population diversity, best so far, strategy that worked

Key difference from a single "optimize this function" prompt: The skill closes the feedback loop — real measurements drive the next generation, and the prompt evolves with accumulated knowledge of what works and what doesn't.

Designing Proper Evaluators#

The evaluator is the hardest and most important part. It determines the ceiling of what optimization can achieve.

Why Unit Tests Aren't Sufficient#

Unit tests check correctness (binary pass/fail), not quality (continuous metric). Passing tests doesn't guarantee behavioral equivalence for edge cases not covered. Tests don't measure non-functional properties: latency, memory usage, readability, maintainability. You need both: tests as a correctness gate, plus a scoring function for the property you're actually optimizing.

Evaluation Strategies#

StrategyWhat It MeasuresWhen to Use
Micro-benchmarksWall-clock time, allocationsPerformance optimization
Static analysis scoresComplexity, coupling, code smellsCode quality evolution
LLM-as-judgeSubjective qualities (readability, inefficiency patterns)When no programmatic metric exists
Evaluation cascadeMultiple metrics at increasing costLarge search spaces

Designing a Robust Evaluator: The Checklist#

Evaluation Cascade
Rendering diagram...
  1. Correctness gate: Must pass all existing tests. Binary filter before scoring — if tests fail, the candidate is rejected (not stored in the population database) regardless of other metrics.
  2. Primary metric: The property you're optimizing (e.g., p99 latency, CPU usage, throughput).
  3. Guard metrics: Properties that must NOT regress (e.g., memory usage must stay within 2x baseline).
  4. Diversity signal: Reward structurally different solutions to avoid collapsing to a local optimum.
  5. Cascade ordering: Cheap checks first (syntax, type-check, compile), expensive checks last (full benchmark suite). Prune early to save compute.

Example: Performance Evaluator#

def evaluate(candidate_path):
    # Gate: must produce identical results to reference
    if not differential_test(candidate_path, reference_impl, test_inputs):
        return float('-inf')

    # Primary: query latency on realistic workload
    latency_p99 = benchmark(candidate_path, production_trace)

    # Guard: memory must not exceed 2x baseline
    memory_peak = measure_memory(candidate_path, production_trace)
    if memory_peak > 2 * baseline_memory:
        return float('-inf')

    # Score: lower latency is better (negate for maximization)
    return -latency_p99

Example: LLM-as-Judge Evaluator#

When no programmatic metric exists (e.g., code readability, design quality), you can use an LLM as the scoring function. The LLM analyzes the candidate code and returns a structured assessment:

SYSTEM_PROMPT = """You are an expert code review agent specializing in
performance analysis and code quality assessment. Analyze the code for:
- Memory management (excessive allocations, leaks, collection misuse)
- Algorithmic efficiency (redundant computations, poor data structures)
- I/O and resource management (blocking I/O, unclosed resources)
- Concurrency issues (thread safety, lock contention)

Score on a 1-10 scale. Return JSON:
{
  "performance_summary": "...",
  "quality_score": 7,
  "justification": "..."
}"""

def evaluate_with_llm(candidate_path):
    code = read_file(candidate_path)

    # LLM reviews code quality and returns a structured score
    response = call_llm(system=SYSTEM_PROMPT, user=code)
    result = parse_json(response)

    # Normalize 1-10 score to 0-1 range
    score = int(result["quality_score"])
    return min(max(score, 0), 10) / 10.0

This approach works well for properties like code readability, adherence to design patterns, or domain-specific idioms — anything a human reviewer would assess qualitatively.

Key Takeaways#

LLMs are generators, not optimizers. A single call produces plausible code. The magic is in the evaluate-select-iterate loop that turns generation into optimization.

The evaluator is the ceiling. The quality of your scoring function determines the maximum quality your optimization can achieve. Invest here first.

Close the feedback loop. The difference between "ask an LLM to optimize" and "run an optimization loop" is that the loop accumulates knowledge. Each generation's prompt includes what worked, what didn't, and why — information a single call can never have.

AlphaEvolve's recursive self-improvement hints at the future. The system co-evolves its own meta-prompts and mutation strategies — the prompts that drive optimization are themselves being optimized. This is where AI-assisted development is heading: not just generating code, but generating the process that generates better code.