Building an AlphaEvolve Skill for Claude Code
You ask Claude Code to "make this function faster." It produces a reasonable rewrite. You benchmark it — maybe it's faster, maybe it's not. If not, you paste the benchmark results back and ask again. You repeat this loop manually, five or six times, until you give up or get lucky.
This is the gap between generation and optimization. In our previous post on AlphaEvolve, we explored how Google DeepMind closes this gap with evolutionary computation — using LLMs as mutation operators within an automated evaluation loop. The insight: a single LLM call produces plausible code, but only iterative, metric-driven selection produces optimal code.
This post introduces alphaevolve-skill — an open-source Claude Code skill that brings this evolutionary optimization pattern to your everyday workflow. We'll cover why we built it, how to use it, and the high-level design behind it. For a full implementation walkthrough, check out the step-by-step tutorial.
Video Walkthrough#
Why Build This Skill?#
The Manual Optimization Loop Is Broken#
Every developer who has tried to optimize code with an LLM hits the same wall:
- No feedback loop — the LLM doesn't know if its suggestion actually helped. You benchmark manually, then rephrase your prompt with new context. The LLM starts fresh each time.
- Single-shot exploration — you get one idea per prompt. Maybe it's a good direction, maybe not. There's no breadth of search.
- No accumulated knowledge — attempt 3's failure doesn't inform attempt 7. You carry the context in your head, losing track of what was tried.
- No population diversity — you follow a single line of reasoning. If it leads to a local optimum, you're stuck.
What Evolutionary Optimization Gives You#
The AlphaEvolve skill turns Claude Code from a one-shot generator into an iterative optimizer. Instead of you manually closing the feedback loop, the skill does it automatically:
- Real measurements drive selection — not LLM intuition, but actual scores from your benchmarks or a structured evaluation rubric
- Knowledge accumulates — each generation's prompt includes what worked, what didn't, and why
- Multiple candidates compete — a population of diverse solutions evolves in parallel across isolated "islands"
- The process improves itself — successful strategies inform future mutations
The result? You invoke one command and get back measurably better code — with a full audit trail of every attempt, score, and strategy.
When to Reach for This Skill#
| Good Fit | Not Ideal |
|---|---|
| Functions with clear performance characteristics (loops, algorithms, data transforms) | Trivial getters/setters with nothing to optimize |
| Code with room to improve — unoptimized first drafts, known bottlenecks | Already near-optimal code that's been hand-tuned |
| Self-contained logic where you can measure success (latency, memory, quality score) | Pure I/O wrappers with no computational logic |
| Problems where multiple valid strategies exist | Cases where the correct answer is known and straightforward |
How to Use It#
Installation#
One command, depending on your coding agent:
# Claude Code
npx skills add https://github.com/yangwenz/alphaevolve-skill --skill alphaevolve-skill -a claude-code
# Codex
npx skills add https://github.com/yangwenz/alphaevolve-skill --skill alphaevolve-skill -a codex
No external dependencies — the database and CLI are plain ESM JavaScript that runs on Node.js 18+.
Invoking the Skill#
Trigger it with natural language. The skill activates when you say "evolve", "optimize", or "run alphaevolve":
# Basic — optimize a function for efficiency
evolve the function processItems in src/pipeline.py
# With a specific goal
optimize the render method in components/Chart.tsx — reduce memory allocations
# With a correctness gate
evolve parseJSON in src/parser.ts with test command "npm test"
# With an empirical benchmark
optimize batchQuery in src/db.py — eval with "python bench/query_bench.py", test with "pytest tests/"
The skill infers parameters from your request and asks clarifying questions if anything is missing.
What Happens During a Run#
The skill validates inputs, establishes a baseline score, then runs the evolutionary loop:
Target: processItems in src/pipeline.py (lines 45-89)
Goal: optimize code efficiency
Iterations: 10
Baseline score: {"efficiency-score": 0.5}
Iteration 1/10 | efficiency-score: 0.6 | best: 0.6
Replaced repeated list.index() calls with a pre-built dictionary lookup
Iteration 2/10 | efficiency-score: 0.5 | best: 0.6
Attempted generator-based streaming but increased complexity without gain
Iteration 3/10 | efficiency-score: 0.7 | best: 0.7
Combined two sequential passes into a single loop with accumulator
Iteration 4/10 | DISCARDED (tests failed)
Changed return type which broke downstream callers
Iteration 5/10 | efficiency-score: 0.8 | best: 0.8
Used collections.Counter instead of manual counting loop, added early exit
Not every mutation improves. Iteration 2 explored a dead end. Iteration 4 was discarded by the correctness gate. This is expected — the power comes from selection across many attempts, not from any single generation being perfect.
After all iterations, the skill presents the best implementation and asks whether to apply it to your source file.
Tips for Better Results#
- Be specific about your goal — "reduce memory allocations, the function creates too many intermediate lists" beats "make this faster"
- Always provide tests when available — they prevent incorrect mutations from surviving
- Use more iterations for hard problems — 5 for quick checks, 10-20 for standard optimization, 30+ for deep search
- Inspect failed iterations — sometimes a good idea was executed poorly and is worth salvaging manually
Example: Optimizing Shortest Path Algorithm#
Here's a real example. We started with a naive shortest_paths implementation that uses a linear scan to find the minimum-distance unvisited node each iteration — O(V^2) complexity. The command:
use alphaevolve-skill to optimize function shortest_paths in solver.py, 2 iterations, run unit tests via `python3 test_solver.py`
The original code:
def shortest_paths(graph, source):
import math
nodes = list(graph.keys())
dist = {node: math.inf for node in nodes}
dist[source] = 0
visited = []
while len(visited) < len(nodes):
# Linear scan for minimum — O(V) per iteration
min_dist = math.inf
min_node = None
for node in nodes:
if node not in visited and dist[node] < min_dist:
min_dist = dist[node]
min_node = node
if min_node is None:
break
visited.append(min_node)
for neighbor, weight in graph.get(min_node, []):
new_dist = dist[min_node] + weight
if new_dist < dist[neighbor]:
dist[neighbor] = new_dist
return dist
After running the skill, the evolutionary loop discovered the textbook optimization — replace the linear scan with a min-heap and use lazy deletion instead of a visited set:
import heapq
import math
def shortest_paths(graph, source):
inf = math.inf
dist = {node: inf for node in graph}
dist[source] = 0
heap = [(0, source)]
heappush = heapq.heappush
heappop = heapq.heappop
graph_get = graph.get
while heap:
d, u = heappop(heap)
if d > dist[u]:
continue
for v, w in graph_get(u, ()):
new_dist = d + w
if new_dist < dist[v]:
dist[v] = new_dist
heappush(heap, (new_dist, v))
return dist
The benchmark results speak for themselves:
| Graph Size | Original | Optimized | Speedup |
|---|---|---|---|
| 100 nodes, 500 edges | 2.3 ms | 0.1 ms | 23x |
| 500 nodes, 5K edges | 95.8 ms | 0.5 ms | 192x |
| 1K nodes, 10K edges | 738.8 ms | 1.0 ms | 739x |
| 2K nodes, 20K edges | 6.18 s | 2.3 ms | 2,687x |
| 3K nodes, 50K edges | 21.03 s | 4.2 ms | 5,007x |
The skill went from O(V^2) to O((V + E) log V) — a fundamental algorithmic improvement that no amount of micro-optimization on the original structure could achieve. This is the kind of leap that population-based search enables: the mutation subagent explored data structure swaps (heap), and the evaluator confirmed the improvement empirically.
High-Level Design#
The skill splits work across four components, each playing to its strengths:
Why This Split?#
SKILL.md (orchestrator) handles the coordination logic — what to do next, how to assemble prompts, when to retry, how to report progress. These are judgment-heavy, context-dependent decisions that benefit from LLM reasoning.
scripts/database.mjs (deterministic computation) manages the population of candidates using an island model. Sampling a parent, pruning islands, and handling migration are algorithms that need precision and speed — not LLM interpretation. The CLI wrapper (db-cli.mjs) exposes this as shell commands with JSON output.
Subagents (focused creative work) handle mutation and evaluation in isolation. The mutation subagent gets a rich prompt (parent code + context + history) and edits a candidate file. The evaluator subagent scores code against a standardized rubric. Neither has memory of the other or of the broader loop state — this keeps them focused and objective.
The Island Model#
Rather than a single flat population, the database maintains multiple sub-populations ("islands") that evolve semi-independently. This prevents premature convergence — if all candidates competed in one pool, early high-scorers would dominate and exploration would stop.
- 3 islands with 40 candidates each
- 70/30 exploitation/exploration — mostly build on the best, occasionally try random starts
- Periodic migration — top performers from each island are copied to others for cross-pollination
- Automatic pruning — lowest scorers are removed when islands fill up
Evaluation Pipeline#
Each candidate goes through a multi-stage evaluation:
- Correctness gate — run the user's test command; if it fails, discard immediately
- LLM-as-judge — a separate subagent scores code quality across six dimensions (algorithmic efficiency, idiomatic usage, anti-patterns, memory, I/O, concurrency)
- Benchmark score (optional) — if the user provides a benchmark command, its output is combined with the LLM score
The evaluator uses a standardized rubric defined in references/evaluator.md, so every candidate across all iterations is judged by the same criteria.
Context Assembly#
Before the loop starts, the skill extracts the target function's dependencies — imports, called functions, type signatures — into a reusable context file. Every mutation prompt includes this context so the subagent knows what's available without reading the entire codebase each iteration.
The mutation prompt also evolves: each iteration includes evaluation history from top-performing candidates ("Attempt 3 scored 0.8 — used dict lookup instead of linear search"), so the next subagent can learn from what worked.
Possible Extensions#
The skill is intentionally minimal — a working foundation you can build on. Here are directions worth exploring:
Language-specific evaluators. The current rubric is generic. A Python-specific evaluator that knows about comprehensions, generators, __slots__, and GIL implications would produce sharper scores. Same for TypeScript (narrowing, const assertions), Rust (unnecessary clones, borrow checker patterns), or Java (StringBuilder, stream API). Implementation: just add a new .md file to references/ and select it by file extension.
Mutation strategy diversity. Currently every iteration gets the same prompt style: "improve this for efficiency." Rotating through explicit strategies — data structure swap, algorithm change, micro-optimization, idiom upgrade, parallelism — prevents the population from collapsing to a single optimization style. The orchestrator could track which strategies produce the best improvements and favor them in later iterations.
Built-in benchmark templates. Many users skip empirical scoring because writing a good benchmark is non-trivial. Reusable templates (python-time.py, node-memory.mjs, generic-time.sh) that accept a module path and function name would lower the barrier significantly.
Multi-metric Pareto selection. Real optimization involves trade-offs — faster code might use more memory. Instead of a single score, track multiple independent metrics and maintain a Pareto front: candidates that aren't dominated on all dimensions. Present the trade-off points to the user at the end.
Self-evolving meta-prompts. Track which mutation prompts produce the best improvements, then summarize effective strategies into a meta-prompts.md file that gets included in future iterations. The prompts that drive optimization are themselves optimized — a lightweight version of AlphaEvolve's recursive self-improvement.
Population visualization. A generated HTML report showing score progression over time, family trees (which candidates descend from which), and island diversity would help users decide whether to run more iterations or stop.
Key Takeaways#
Skills turn patterns into reusable workflows. The manual optimize-benchmark-retry loop is something every developer does. Encoding it as a Claude Code skill means you run it with one command instead of managing the cycle yourself — and the skill does it more consistently than you would manually.
Let each component do what it's best at. The orchestrator (SKILL.md) handles judgment and coordination. Scripts handle deterministic computation and state. Subagents handle focused creative and analytical work in isolation. This split keeps the system reliable, fast, and creative where it needs to be.
Population diversity prevents local optima. A single line of reasoning gets stuck. The island model maintains multiple independent sub-populations that evolve different strategies — and periodic migration cross-pollinates breakthroughs across islands. This is why 10 iterations with a population outperforms 10 sequential rewrites.
The evaluator determines the ceiling. The skill is only as good as its scoring function. A specific optimization goal, a correctness gate via tests, and an empirical benchmark give the evolutionary loop the selection pressure it needs. Without measurement, you're just generating — not optimizing.
Context assembly is what makes iteration productive. Each mutation prompt includes the target's dependencies, evaluation history from top candidates, and change summaries of what strategies worked. This accumulated context is why attempt 8 can build on attempt 3's insight — the loop has memory that a single prompt doesn't.