Design: How the Skill Is Structured

You've seen what AlphaEvolve is and used the skill to optimize code. Now let's look under the hood — how we actually build it as a Claude Code skill. This page covers the high-level architecture — which pieces go where, and why.

The central design question is: how do you split an evolutionary optimization system across a skill definition, helper scripts, and subagents?

The answer comes from understanding what Claude Code skills can and cannot do, and playing to each component's strengths.

The Big Picture#

Here's the overall structure of our skill:

alphaevolve-skill/
├── SKILL.md                  # The optimization loop (orchestrator)
├── scripts/
│   ├── database.mjs          # Island-model population database
│   └── db-cli.mjs            # CLI wrapper for database operations
├── references/
│   ├── context.md            # Context extraction procedure
│   └── evaluator.md          # LLM-as-judge evaluation rubric
└── tests/
    └── database.test.mjs     # Database unit tests

Each file has a distinct role. Let's walk through the reasoning behind this split.

Architecture Overview
Rendering diagram...

Why the Optimization Loop Lives in SKILL.md#

A Claude Code skill is a markdown file that tells Claude what to do step by step. When you invoke the skill (by saying "evolve this function" or "run alphaevolve"), Claude reads the SKILL.md and follows its instructions like a recipe.

The optimization loop is the core orchestration logic — it controls the flow:

  1. Sample a parent from the population
  2. Spawn a subagent to mutate it
  3. Evaluate the result
  4. Add it to the database
  5. Repeat

This belongs in SKILL.md because the loop is a coordination problem, not a computation problem. Claude excels at following multi-step procedures, making decisions (should I retry? is this result valid?), and assembling prompts from dynamic context. You don't need a programming language for this — you need clear instructions that Claude can follow.

Think of it this way: SKILL.md is the "brain" that decides what to do next. It doesn't do the heavy computation itself — it delegates to scripts and subagents.

What SKILL.md Handles#

  • Input validation — checking that the target file exists, the function name resolves, asking the user for clarification if needed
  • Loop control — iterating N times, handling failures gracefully (skip failed iterations, retry subagents once)
  • Prompt assembly — building rich context for the mutation subagent (parent code + evaluation history + mutation hints)
  • Decision making — should we resume or start fresh? Did the candidate pass tests? Is the score valid?
  • Progress reporting — telling the user what happened each iteration
  • Finalization — presenting the best result and asking whether to apply it

All of these are judgment-heavy, context-dependent tasks — exactly what the LLM is good at.

Why the Database Lives in a Script#

The population database is the opposite kind of problem: it's pure computation with no judgment needed. It maintains a set of candidate programs, tracks their scores, decides which island to assign them to, handles migration between islands, and prunes weak candidates.

This logic lives in scripts/database.mjs (a plain JavaScript file) for three reasons:

1. Determinism#

Sampling a parent, pruning islands, and migrating programs all involve specific algorithms (weighted random selection, sorting by score, round-robin island assignment). These need to produce consistent, repeatable results. If you asked Claude to "pick a good parent from the population" via natural language, you'd get inconsistent behavior — sometimes it picks the best, sometimes it explores, sometimes it gets confused by the data format.

The database script implements exact rules:

  • 70% of the time, pick from the top 3 candidates on the current island
  • 30% of the time, pick a random candidate (exploration)
  • Migrate top performers between islands every 10 generations
  • Prune each island to a maximum of 40 programs

These rules need to be executed precisely, not interpreted.

2. State Persistence#

The population needs to survive across iterations and even across sessions (for resumability). The database script serializes everything to evolve-output/database/database.json — a structured file that can be loaded instantly. If SKILL.md tried to manage this state through conversation context, it would be fragile (context windows have limits) and slow (re-reading all history every iteration).

3. Speed#

Claude doesn't need to "think" about database operations. Sampling a parent, adding a program, and pruning are all operations that should complete in milliseconds. By putting them in a script invoked via node db-cli.mjs sample, the orchestrator gets instant results without burning tokens on mechanical work.

The CLI Wrapper#

The database logic (database.mjs) is a JavaScript module with classes and methods. But Claude Code interacts with the outside world through shell commands. The CLI wrapper (db-cli.mjs) bridges this gap — it exposes the database through a simple command-line interface:

# Check database status
node db-cli.mjs info evolve-output/database

# Sample a parent and inspirations
node db-cli.mjs sample evolve-output/database

# Add a new candidate
node db-cli.mjs add evolve-output/database \
  --codePath "/path/to/candidate.py" \
  --targetCode "..." \
  --parentId "abc-123" \
  --metrics '{"efficiency-score": 0.8}' \
  --changes "Replaced nested loops with hash map lookup"

Each command outputs JSON that Claude can parse. This is a clean contract: the skill knows what commands to run and how to interpret the output, without needing to understand the internal data structures.

Why Evaluation Needs Subagents#

Evaluation is where things get interesting. A candidate needs to be scored, and scoring has two parts:

  1. Correctness gate — does it still pass tests? (a shell command)
  2. Quality scoring — how efficient is the code? (requires judgment)

The correctness gate is simple: run the user's test command and check the exit code. SKILL.md handles this directly.

But quality scoring requires an LLM-as-judge — a separate Claude instance that reads the candidate code and produces a structured score. This is a subagent because:

Isolated Judgment#

The evaluator needs to be objective. If the same Claude instance that generated the code also judged it, there's a conflict of interest — it might rate its own work more favorably. By spawning a separate subagent with a focused evaluation rubric (references/evaluator.md), the judge has no memory of how the code was created. It sees only the code and the scoring criteria.

Focused Context#

The evaluator's prompt is narrow and specific: here's a piece of code, here are 6 dimensions to evaluate (algorithmic efficiency, idiomatic usage, anti-patterns, memory, I/O, concurrency), here's a 1-10 rubric. The subagent doesn't need to know about the population, the iteration count, or the optimization goal — just "how efficient is this code?"

This focus produces more reliable scores than asking the orchestrator to juggle evaluation alongside everything else.

Why Mutation Needs Subagents#

The mutation step — generating a new code variant — also uses a subagent. This is the creative heart of the system: given a parent implementation and context about what's worked before, write something better.

The mutation subagent gets its own spawned instance because:

Clean Workspace#

The subagent operates on a single file (the candidate copy at evolve-output/candidates/iteration_N.py). It reads the file, edits it in place, and reports what it changed. It doesn't need access to the database, the evaluation history, or the loop state — all of that context is pre-assembled into its prompt by the orchestrator.

Bounded Scope#

If the mutation logic ran in the main orchestrator's context, every failed attempt would consume context window space. Over 10-20 iterations, this adds up. By spawning a fresh subagent each iteration, the orchestrator stays lean — it only keeps the summary ("changed X, scored Y") rather than the full mutation deliberation.

Rich Prompting#

The orchestrator builds a carefully assembled prompt for each mutation attempt:

1. Task: "Improve the function `processItems` for efficiency"
2. Context: imports, dependencies, type signatures
3. History: "Attempt 3 scored 0.7 — reduced allocations but introduced cache miss.
            Attempt 5 scored 0.9 — used pre-allocated buffer."
4. Constraints: same signature, must remain correct, edit in place

This prompt evolves every iteration as the orchestrator incorporates new knowledge from the database. A subagent sees this as a fresh, self-contained task — maximizing the quality of each mutation attempt.

The Reference Documents#

The references/ directory contains two instruction files that guide subagents:

context.md — Context Extraction Procedure#

Before the loop starts, the orchestrator needs to understand the target function's dependencies: what it imports, what it calls, what types it uses. The context.md file defines a step-by-step procedure for extracting this information and saving it as a reusable markdown file.

This extracted context is included in every mutation prompt, so the subagent knows what surrounding code looks like without reading the entire codebase.

evaluator.md — Scoring Rubric#

This is the LLM-as-judge prompt. It defines:

  • 6 evaluation dimensions (algorithmic efficiency, idiomatic usage, anti-patterns, memory, I/O, concurrency)
  • A scoring rubric (1-2 = severe issues, 9-10 = near-optimal)
  • The exact output format ({"efficiency-score": 7})

By standardizing the evaluation criteria in a reference document, every candidate is judged by the same rubric across all iterations.

How the Pieces Work Together#

Let's trace one iteration of the loop to see how all components interact:

One Iteration
Rendering diagram...
  1. SKILL.md asks the database for a parent candidate
  2. db-cli.mjs samples from the current island (balancing exploitation and exploration) and returns the parent plus "inspiration" programs (top performers whose strategies might inform the next mutation)
  3. SKILL.md assembles a rich prompt: parent code + extracted context + evaluation history from inspirations
  4. A mutation subagent receives this prompt, edits the candidate file, and reports what it changed
  5. SKILL.md runs the correctness gate (test command) — if it fails, the candidate is discarded
  6. An evaluator subagent scores the candidate's code quality using the standardized rubric
  7. SKILL.md sends the result to the database via db-cli.mjs add
  8. The database assigns the candidate to an island, updates the best score, and handles migration/pruning
  9. SKILL.md reports progress to the user and moves to the next iteration

Design Principles#

To summarize the key design decisions:

ComponentWhat It DoesWhy It's Separate
SKILL.mdOrchestrates the loop, assembles prompts, makes decisionsJudgment-heavy coordination that benefits from LLM reasoning
database.mjsPopulation management, sampling, island migrationDeterministic algorithms that need speed and precision
db-cli.mjsExposes database as shell commands with JSON outputBridges the module-to-CLI gap so Claude can invoke it
Mutation subagentGenerates code variants by editing a candidate fileIsolated creative work with bounded scope
Evaluator subagentScores code quality using a standardized rubricObjective judgment without knowledge of the generation process
references/Reusable instructions for subagentsConsistency across iterations — same rubric, same extraction procedure

The general principle: let each component do what it's best at.

  • Claude (in SKILL.md) handles ambiguity, judgment, and coordination
  • Scripts handle deterministic computation and state persistence
  • Subagents handle focused creative or analytical tasks in isolation

This separation keeps the system reliable (scripts don't hallucinate), creative (subagents get full LLM power for mutations), and coherent (the orchestrator maintains the big picture across iterations).

What's Next#

With the high-level design clear, we'll move to implementation — starting with the population database and its island model, then building the evaluation pipeline, and finally wiring everything together in the skill definition.