Building an AlphaEvolve Skill for Claude Code

June 18, 2026

You ask Claude Code to "make this function faster." It produces a reasonable rewrite. You benchmark it — maybe it's faster, maybe it's not. If not, you paste the benchmark results back and ask again. You repeat this loop manually, five or six times, until you give up or get lucky.

This is the gap between generation and optimization. In our previous post on AlphaEvolve, we explored how Google DeepMind closes this gap with evolutionary computation — using LLMs as mutation operators within an automated evaluation loop. The insight: a single LLM call produces plausible code, but only iterative, metric-driven selection produces optimal code.

This post introduces alphaevolve-skill — an open-source Claude Code skill that brings this evolutionary optimization pattern to your everyday workflow. We'll cover why we built it, how to use it, and the high-level design behind it. For a full implementation walkthrough, check out the step-by-step tutorial.

Video Walkthrough#

Why Build This Skill?#

The Manual Optimization Loop Is Broken#

Every developer who has tried to optimize code with an LLM hits the same wall:

No feedback loop — the LLM doesn't know if its suggestion actually helped. You benchmark manually, then rephrase your prompt with new context. The LLM starts fresh each time.
Single-shot exploration — you get one idea per prompt. Maybe it's a good direction, maybe not. There's no breadth of search.
No accumulated knowledge — attempt 3's failure doesn't inform attempt 7. You carry the context in your head, losing track of what was tried.
No population diversity — you follow a single line of reasoning. If it leads to a local optimum, you're stuck.

What Evolutionary Optimization Gives You#

The AlphaEvolve skill turns Claude Code from a one-shot generator into an iterative optimizer. Instead of you manually closing the feedback loop, the skill does it automatically:

Real measurements drive selection — not LLM intuition, but actual scores from your benchmarks or a structured evaluation rubric
Knowledge accumulates — each generation's prompt includes what worked, what didn't, and why
Multiple candidates compete — a population of diverse solutions evolves in parallel across isolated "islands"
The process improves itself — successful strategies inform future mutations

The result? You invoke one command and get back measurably better code — with a full audit trail of every attempt, score, and strategy.

When to Reach for This Skill#

Good Fit	Not Ideal
Functions with clear performance characteristics (loops, algorithms, data transforms)	Trivial getters/setters with nothing to optimize
Code with room to improve — unoptimized first drafts, known bottlenecks	Already near-optimal code that's been hand-tuned
Self-contained logic where you can measure success (latency, memory, quality score)	Pure I/O wrappers with no computational logic
Problems where multiple valid strategies exist	Cases where the correct answer is known and straightforward

How to Use It#

Installation#

One command, depending on your coding agent:

# Claude Code
npx skills add https://github.com/yangwenz/alphaevolve-skill --skill alphaevolve-skill -a claude-code

# Codex
npx skills add https://github.com/yangwenz/alphaevolve-skill --skill alphaevolve-skill -a codex

No external dependencies — the database and CLI are plain ESM JavaScript that runs on Node.js 18+.

Invoking the Skill#

Trigger it with natural language. The skill activates when you say "evolve", "optimize", or "run alphaevolve":

# Basic — optimize a function for efficiency
evolve the function processItems in src/pipeline.py

# With a specific goal
optimize the render method in components/Chart.tsx — reduce memory allocations

# With a correctness gate
evolve parseJSON in src/parser.ts with test command "npm test"

# With an empirical benchmark
optimize batchQuery in src/db.py — eval with "python bench/query_bench.py", test with "pytest tests/"

The skill infers parameters from your request and asks clarifying questions if anything is missing.

What Happens During a Run#

The skill validates inputs, establishes a baseline score, then runs the evolutionary loop:

Target: processItems in src/pipeline.py (lines 45-89)
Goal: optimize code efficiency
Iterations: 10
Baseline score: {"efficiency-score": 0.5}

Iteration 1/10 | efficiency-score: 0.6 | best: 0.6
  Replaced repeated list.index() calls with a pre-built dictionary lookup

Iteration 2/10 | efficiency-score: 0.5 | best: 0.6
  Attempted generator-based streaming but increased complexity without gain

Iteration 3/10 | efficiency-score: 0.7 | best: 0.7
  Combined two sequential passes into a single loop with accumulator

Iteration 4/10 | DISCARDED (tests failed)
  Changed return type which broke downstream callers

Iteration 5/10 | efficiency-score: 0.8 | best: 0.8
  Used collections.Counter instead of manual counting loop, added early exit

Not every mutation improves. Iteration 2 explored a dead end. Iteration 4 was discarded by the correctness gate. This is expected — the power comes from selection across many attempts, not from any single generation being perfect.

After all iterations, the skill presents the best implementation and asks whether to apply it to your source file.

Tips for Better Results#

Be specific about your goal — "reduce memory allocations, the function creates too many intermediate lists" beats "make this faster"
Always provide tests when available — they prevent incorrect mutations from surviving
Use more iterations for hard problems — 5 for quick checks, 10-20 for standard optimization, 30+ for deep search
Inspect failed iterations — sometimes a good idea was executed poorly and is worth salvaging manually

Example: Optimizing Shortest Path Algorithm#

Here's a real example. We started with a naive shortest_paths implementation that uses a linear scan to find the minimum-distance unvisited node each iteration — O(V^2) complexity. The command:

use alphaevolve-skill to optimize function shortest_paths in solver.py, 2 iterations, run unit tests via `python3 test_solver.py`

The original code:

def shortest_paths(graph, source):
    import math

    nodes = list(graph.keys())
    dist = {node: math.inf for node in nodes}
    dist[source] = 0
    visited = []

    while len(visited) < len(nodes):
        # Linear scan for minimum — O(V) per iteration
        min_dist = math.inf
        min_node = None
        for node in nodes:
            if node not in visited and dist[node] < min_dist:
                min_dist = dist[node]
                min_node = node

        if min_node is None:
            break

        visited.append(min_node)

        for neighbor, weight in graph.get(min_node, []):
            new_dist = dist[min_node] + weight
            if new_dist < dist[neighbor]:
                dist[neighbor] = new_dist

    return dist

After running the skill, the evolutionary loop discovered the textbook optimization — replace the linear scan with a min-heap and use lazy deletion instead of a visited set:

import heapq
import math

def shortest_paths(graph, source):
    inf = math.inf
    dist = {node: inf for node in graph}
    dist[source] = 0

    heap = [(0, source)]
    heappush = heapq.heappush
    heappop = heapq.heappop
    graph_get = graph.get

    while heap:
        d, u = heappop(heap)

        if d > dist[u]:
            continue

        for v, w in graph_get(u, ()):
            new_dist = d + w
            if new_dist < dist[v]:
                dist[v] = new_dist
                heappush(heap, (new_dist, v))

    return dist

The benchmark results speak for themselves:

Graph Size	Original	Optimized	Speedup
100 nodes, 500 edges	2.3 ms	0.1 ms	23x
500 nodes, 5K edges	95.8 ms	0.5 ms	192x
1K nodes, 10K edges	738.8 ms	1.0 ms	739x
2K nodes, 20K edges	6.18 s	2.3 ms	2,687x
3K nodes, 50K edges	21.03 s	4.2 ms	5,007x

The skill went from O(V^2) to O((V + E) log V) — a fundamental algorithmic improvement that no amount of micro-optimization on the original structure could achieve. This is the kind of leap that population-based search enables: the mutation subagent explored data structure swaps (heap), and the evaluator confirmed the improvement empirically.

High-Level Design#

The skill splits work across four components, each playing to its strengths:

Architecture Overview

Rendering diagram...

Why This Split?#

SKILL.md (orchestrator) handles the coordination logic — what to do next, how to assemble prompts, when to retry, how to report progress. These are judgment-heavy, context-dependent decisions that benefit from LLM reasoning.

scripts/database.mjs (deterministic computation) manages the population of candidates using an island model. Sampling a parent, pruning islands, and handling migration are algorithms that need precision and speed — not LLM interpretation. The CLI wrapper (db-cli.mjs) exposes this as shell commands with JSON output.

Subagents (focused creative work) handle mutation and evaluation in isolation. The mutation subagent gets a rich prompt (parent code + context + history) and edits a candidate file. The evaluator subagent scores code against a standardized rubric. Neither has memory of the other or of the broader loop state — this keeps them focused and objective.

The Island Model#

Rather than a single flat population, the database maintains multiple sub-populations ("islands") that evolve semi-independently. This prevents premature convergence — if all candidates competed in one pool, early high-scorers would dominate and exploration would stop.

3 islands with 40 candidates each
70/30 exploitation/exploration — mostly build on the best, occasionally try random starts
Periodic migration — top performers from each island are copied to others for cross-pollination
Automatic pruning — lowest scorers are removed when islands fill up

Evaluation Pipeline#

Each candidate goes through a multi-stage evaluation:

Correctness gate — run the user's test command; if it fails, discard immediately
LLM-as-judge — a separate subagent scores code quality across six dimensions (algorithmic efficiency, idiomatic usage, anti-patterns, memory, I/O, concurrency)
Benchmark score (optional) — if the user provides a benchmark command, its output is combined with the LLM score

The evaluator uses a standardized rubric defined in references/evaluator.md, so every candidate across all iterations is judged by the same criteria.

Context Assembly#

Before the loop starts, the skill extracts the target function's dependencies — imports, called functions, type signatures — into a reusable context file. Every mutation prompt includes this context so the subagent knows what's available without reading the entire codebase each iteration.

The mutation prompt also evolves: each iteration includes evaluation history from top-performing candidates ("Attempt 3 scored 0.8 — used dict lookup instead of linear search"), so the next subagent can learn from what worked.

Possible Extensions#

The skill is intentionally minimal — a working foundation you can build on. Here are directions worth exploring:

Language-specific evaluators. The current rubric is generic. A Python-specific evaluator that knows about comprehensions, generators, __slots__, and GIL implications would produce sharper scores. Same for TypeScript (narrowing, const assertions), Rust (unnecessary clones, borrow checker patterns), or Java (StringBuilder, stream API). Implementation: just add a new .md file to references/ and select it by file extension.

Mutation strategy diversity. Currently every iteration gets the same prompt style: "improve this for efficiency." Rotating through explicit strategies — data structure swap, algorithm change, micro-optimization, idiom upgrade, parallelism — prevents the population from collapsing to a single optimization style. The orchestrator could track which strategies produce the best improvements and favor them in later iterations.

Built-in benchmark templates. Many users skip empirical scoring because writing a good benchmark is non-trivial. Reusable templates (python-time.py, node-memory.mjs, generic-time.sh) that accept a module path and function name would lower the barrier significantly.

Multi-metric Pareto selection. Real optimization involves trade-offs — faster code might use more memory. Instead of a single score, track multiple independent metrics and maintain a Pareto front: candidates that aren't dominated on all dimensions. Present the trade-off points to the user at the end.

Self-evolving meta-prompts. Track which mutation prompts produce the best improvements, then summarize effective strategies into a meta-prompts.md file that gets included in future iterations. The prompts that drive optimization are themselves optimized — a lightweight version of AlphaEvolve's recursive self-improvement.

Population visualization. A generated HTML report showing score progression over time, family trees (which candidates descend from which), and island diversity would help users decide whether to run more iterations or stop.

Key Takeaways#

Skills turn patterns into reusable workflows. The manual optimize-benchmark-retry loop is something every developer does. Encoding it as a Claude Code skill means you run it with one command instead of managing the cycle yourself — and the skill does it more consistently than you would manually.

Let each component do what it's best at. The orchestrator (SKILL.md) handles judgment and coordination. Scripts handle deterministic computation and state. Subagents handle focused creative and analytical work in isolation. This split keeps the system reliable, fast, and creative where it needs to be.

Population diversity prevents local optima. A single line of reasoning gets stuck. The island model maintains multiple independent sub-populations that evolve different strategies — and periodic migration cross-pollinates breakthroughs across islands. This is why 10 iterations with a population outperforms 10 sequential rewrites.

The evaluator determines the ceiling. The skill is only as good as its scoring function. A specific optimization goal, a correctness gate via tests, and an empirical benchmark give the evolutionary loop the selection pressure it needs. Without measurement, you're just generating — not optimizing.

Context assembly is what makes iteration productive. Each mutation prompt includes the target's dependencies, evaluation history from top candidates, and change summaries of what strategies worked. This accumulated context is why attempt 8 can build on attempt 3's insight — the loop has memory that a single prompt doesn't.