Implementation Details

In the previous page, we covered the high-level design — why the loop lives in SKILL.md, why the database is a script, and why subagents handle mutation and evaluation. Now let's look at the actual implementation of each component.

We'll walk through four pieces in order:

  1. The database script — how we manage the population of candidates
  2. Context extraction — how we gather the information a mutation subagent needs
  3. The evaluator — how we score candidate code consistently
  4. The optimization loop — how SKILL.md orchestrates everything

1. The Database Script (scripts/database.mjs)#

The database is the backbone of the evolutionary loop — it stores every candidate program, tracks lineage, and decides which parents to sample next. The design page explained why this lives in a script (determinism, persistence, speed). Here we'll look at how it works.

The Island Model#

Rather than maintaining a single flat population, the database uses an island model — multiple sub-populations that evolve semi-independently. This prevents premature convergence: if all candidates competed in one pool, early high-scorers would dominate and the system would stop exploring diverse strategies.

ConceptWhat It DoesDefault
IslandAn independent sub-population with its own candidates3 islands
Island capacityMaximum programs per island before pruning40 programs
MigrationTop performers copied between islands to cross-pollinateEvery 10 generations
Current islandRound-robin assignment — each new candidate goes to the next island in sequenceCycles 0 → 1 → 2 → 0 → ...

Data Model#

Each program in the database is stored as a JSON object:

{
  "id": "a1b2c3d4",
  "codePath": "/absolute/path/to/evolve-output/candidates/iteration_3.py",
  "targetCode": "def processItems(items, config): ...",
  "parentId": "e5f6g7h8",
  "metrics": { "efficiency-score": 0.8, "benchmark-score": 0.75 },
  "changes": "Replaced linear search with dict lookup in inner loop",
  "island": 1,
  "generation": 3
}

The full database state (database.json) contains:

  • programs — array of all stored candidates
  • islands — number of islands (default 3)
  • islandCapacity — max programs per island (default 40)
  • migrationInterval — how often to migrate (default 10)
  • generation — global generation counter
  • currentIsland — which island receives the next program

Sampling Algorithm#

When the orchestrator calls node db-cli.mjs sample, the database returns a parent and a list of inspirations:

Parent selection (from the current island):

  • 70% of the time: pick from the top 3 candidates by score (exploitation)
  • 30% of the time: pick a random candidate (exploration)

This balance ensures the system mostly builds on what works, but occasionally tries a different starting point — avoiding local optima.

Inspirations are the top-performing programs across all islands. Their change summaries are included in the mutation prompt, so the subagent can learn from strategies that worked elsewhere without being constrained to a single lineage.

The sample output looks like:

{
  "parent": { "id": "a1b2c3d4", "codePath": "...", "metrics": {...}, "changes": "..." },
  "inspirations": [
    { "id": "x9y8z7", "metrics": {...}, "changes": "Used pre-allocated buffer..." },
    { "id": "m3n4o5", "metrics": {...}, "changes": "Batch-processed items in chunks of 64..." }
  ]
}

Migration#

Every migrationInterval generations, the database copies the top performer from each island to every other island. This cross-pollinates good strategies across sub-populations without homogenizing them — the copied program competes in its new island alongside unrelated candidates.

Pruning#

When an island exceeds its capacity (default 40), the database removes the lowest-scoring programs to make room. This keeps memory bounded and ensures sampling draws from a relevant pool rather than ancient, low-quality candidates.

The CLI Wrapper (db-cli.mjs)#

The CLI wrapper exposes three commands:

# Check if a previous run exists
node db-cli.mjs info evolve-output/database
# → { "totalPrograms": 12, "bestMetrics": {...}, "generation": 12 }

# Sample a parent + inspirations
node db-cli.mjs sample evolve-output/database
# → { "parent": {...}, "inspirations": [...] }

# Add a scored candidate
node db-cli.mjs add evolve-output/database \
  --codePath "/path/to/candidate.py" \
  --targetCode "$(cat /tmp/code.txt)" \
  --parentId "a1b2c3d4" \
  --metrics '{"efficiency-score": 0.8}' \
  --changes "Replaced nested loops with hash map"
# → { "id": "new-id", "bestMetrics": {...}, "lastIteration": 13 }

Each command reads database.json, performs its operation, writes the updated state back, and prints JSON to stdout. The orchestrator parses this output and continues.

2. Context Extraction (references/context.md)#

Before the optimization loop starts, we need to answer a question: what does the target function depend on? A mutation subagent can't improve code it doesn't understand. If it doesn't know what types a function accepts, what helpers it calls, or what libraries it imports, it'll produce mutations that don't compile.

The context extraction step runs once and produces a reusable markdown file that gets included in every mutation prompt.

What Gets Extracted#

ComponentWhat It CapturesWhy It Matters
ImportsAll import/require statements the target depends onMutations need to know what libraries are available
Goal1-2 sentence description of the function's purposeGuides the subagent toward meaningful improvements
DependenciesSignatures of functions/methods the target callsPrevents mutations that break call contracts

The Extraction Procedure#

The procedure defined in context.md follows these steps:

Step 1 — Read the source file and identify the language from the file extension.

Step 2 — Locate the target by name. If multiple definitions share the same name (e.g., overloaded methods), the user can provide a disambiguator — either a line number (line:42) or a signature fragment ((int, str) -> bool).

Step 3 — Extract imports that the target depends on. This includes both direct imports (used inside the function body) and transitive ones (types imported for parameters).

Step 4 — Write a goal description. A concise summary of what the function does — its purpose, inputs, outputs, and key behavior. This goes into the mutation prompt so the subagent knows what the code should do, not just what it currently does.

Step 5 — Extract dependency signatures. For every function, method, or class that the target calls, record its name, full signature, and where it's defined. This gives the mutation subagent enough information to respect the API contracts without reading the entire codebase.

Output Format#

The extracted context is saved as a markdown file at evolve-output/context/<filename>_<target_name>.md:

# Context: processItems

**File**: `src/pipeline.py`
**Language**: Python
**Signature**: `def processItems(items: list[Item], config: Config) -> Result`
**Lines**: 45-89

## Goal

Processes a batch of items according to the given configuration,
applying transformations in sequence and aggregating results.

## Imports

```python
from typing import List
from .models import Item, Config, Result
from .transforms import apply_transform
```

## Dependencies

### apply_transform

- **Signature**: `def apply_transform(item: Item, transform_type: str) -> Item`
- **Location**: `src/transforms.py`

### Result.merge

- **Signature**: `def merge(self, other: Result) -> Result`
- **Location**: same file

This file is computed once and reused across all iterations. If it already exists when the skill starts (e.g., on a resumed run), the extraction step is skipped entirely.

Why This Matters#

Without context extraction, the mutation subagent would either:

  • Need to read the entire codebase each iteration (slow, wastes tokens)
  • Operate blindly (produces mutations that don't compile or break APIs)

By pre-computing a focused context file, every iteration gets exactly the information it needs — no more, no less.

3. The Evaluator (references/evaluator.md)#

The evaluator is the scoring function that determines whether a mutation is an improvement. In our skill, this is an LLM-as-judge: a separate subagent that reads candidate code and produces a structured score.

Why LLM-as-Judge?#

Not every project has a benchmark script ready to go. Many optimization tasks — "make this code more efficient," "reduce anti-patterns," "improve memory usage" — don't have a single numeric metric you can run from the command line. An LLM evaluator handles these cases: it can assess code quality across multiple dimensions without requiring the user to write a custom benchmark.

When the user does provide a benchmark command (e.g., node bench.mjs), the skill combines both scores — LLM judgment and empirical measurement — for a more robust evaluation.

The Six Evaluation Dimensions#

The evaluator rubric in evaluator.md assesses code across six dimensions:

1. Algorithmic Efficiency — Are algorithms appropriate for the problem? Are there redundant computations, unnecessary repeated iterations, or missing early exits?

2. Idiomatic Efficiency — Does the code leverage the language's built-in functions and standard library? Are language-native constructs used where they outperform manual implementations?

3. Anti-Pattern Detection — String concatenation in loops, unnecessary object creation per iteration, hidden quadratic behavior behind convenient abstractions, N+1 query patterns.

4. Memory Efficiency — Loading entire datasets vs. streaming, unnecessary data duplication, unreleased references, missing use of generators or lazy evaluation.

5. I/O and Resource Efficiency — Batched vs. one-at-a-time I/O, resource lifecycle management, caching, buffering.

6. Concurrency and Parallelism — Independent tasks run concurrently where possible, appropriate synchronization scope, correct async patterns.

Scoring Rules#

Two important scoring rules keep evaluations fair:

  1. Score only on applicable dimensions. If the code has no I/O, dimension 5 is ignored. If it's inherently single-threaded, dimension 6 is skipped. The evaluator doesn't penalize code for things that don't apply.

  2. Weight by bottleneck impact. Among applicable dimensions, the evaluator weighs each by how much it affects the dominant performance bottleneck. Algorithmic inefficiency in a tight loop matters more than a missing buffer on a one-time file read.

The Scoring Rubric#

The evaluator assigns an integer from 1 to 10:

ScoreMeaning
1-2Severe inefficiency: obvious quadratic+ complexity where linear is possible, egregious anti-patterns
3-4Poor: multiple anti-patterns, no optimization consideration, wasteful allocations
5-6Acceptable: works correctly but has clear opportunities for improvement in at least two dimensions
7-8Good: appropriate data structures and algorithms, leverages language idioms, minor improvements possible
9-10Excellent: near-optimal, no anti-patterns, minimal memory footprint, well-chosen algorithms

Output Format#

The evaluator returns a single JSON object:

{"efficiency-score": 7}

No additional text, no markdown fences, no explanation — just the score. This strict format makes parsing reliable: the orchestrator reads the JSON, extracts the integer, normalizes it to [0.0, 1.0], and moves on.

If parsing fails (invalid JSON, missing key, or score outside 1-10), the orchestrator retries the evaluator subagent once. If it fails again, the candidate is discarded and the loop continues to the next iteration.

Combined Scoring#

When the user provides a benchmark command alongside the LLM evaluator, the final score is the average of both:

final_score = (llm_score + benchmark_score) / 2

This gives you the best of both worlds: empirical measurement for properties you can benchmark, plus LLM judgment for properties that are hard to measure programmatically.

4. The Optimization Loop (SKILL.md)#

The SKILL.md file is the heart of the skill — it defines the complete workflow from input validation to final output. Let's walk through each step.

Step 0: Validate Inputs#

The skill infers these parameters from the user's request:

InputRequired?Default
Target fileYes— (ask user if unclear)
Target nameYes— (ask user if unclear)
Number of iterationsNo10
Optimization goalNo"optimize code efficiency"
Test commandNonone (skip correctness gate)
Eval commandNonone (LLM evaluator only)

If the target file doesn't exist, the skill stops. If the target name matches multiple definitions (overloaded methods), the skill lists all candidates and asks the user to pick one.

Step 1: Setup and Load Database#

The skill creates a directory structure for all artifacts:

evolve-output/
├── database/       # Population state (database.json)
├── context/        # Extracted dependency context
├── candidates/     # Each iteration's mutated file
└── best/           # The winning implementation

It then checks if a previous run exists by calling node db-cli.mjs info. If the database isn't empty, the skill asks the user: resume or start fresh? This makes runs resumable — if you stop midway through 20 iterations, you can pick up where you left off without losing progress.

Step 2: Extract Context#

If this is a fresh run, the skill follows the procedure in references/context.md to extract imports, dependencies, and goal description for the target. The result is saved to evolve-output/context/ for reuse across all iterations.

Step 3: Seed the Population#

The initial implementation is the first member of the population. The skill:

  1. Extracts the target's source code from the file
  2. Evaluates it using the full evaluation pipeline (to establish a baseline score)
  3. Adds it to the database as the seed program (with parentId = "0")

If the evaluation fails on the seed, the skill stops — this means the evaluation setup itself is misconfigured (e.g., the test command doesn't work).

Step 4: The Evolution Loop#

This is where the optimization happens. For each iteration i from 1 to N:

Evolution Loop Flow
Rendering diagram...

Let's look at each sub-step:

4a. Sample Parent and Inspirations

node db-cli.mjs sample evolve-output/database

The database returns a parent program (selected from the current island with a balance of exploitation and exploration) plus a list of "inspiration" programs — top performers whose change summaries will be included in the mutation prompt.

4b. Copy Parent to Candidate File

cp <parent.codePath> evolve-output/candidates/iteration_3.py

The subagent will edit this copy in place. The original files are never modified during the loop — only during the final "apply" step at the end.

4c. Spawn Mutation Subagent

The orchestrator assembles a prompt with four sections:

  1. Task — what to optimize and the goal
  2. Context — imports, dependencies, type signatures (from the context file)
  3. Evaluation history — what previous attempts scored and why (from inspirations)
  4. Constraints — keep the same signature, maintain correctness, edit in place

The subagent reads the candidate file, makes improvements, and returns a summary of what it changed.

Here's what the prompt looks like in practice:

You are an expert code optimizer. Your goal: optimize code efficiency —
make the code faster and with less anti-patterns.

Modify the function `processItems` in the file
`evolve-output/candidates/iteration_3.py` to improve its efficiency.

# Context
[imports, dependencies, goal from context file]

# Previous Attempts
## Attempt 1 — Score: {"efficiency-score": 0.6}
Changes: Replaced list comprehension with generator to reduce memory.

## Attempt 2 — Score: {"efficiency-score": 0.8}
Changes: Used dict lookup instead of linear search in inner loop.

# Constraints
- Same function signature — do not rename or change parameters.
- Must remain correct — do not sacrifice correctness for performance.
- Edit the file in place.
- After editing, respond with a CHANGES summary of what you did and why.

4d. Read the Mutated Candidate

After the subagent finishes, the orchestrator:

  1. Reads the edited candidate file to extract the new target code
  2. Records the changes summary from the subagent's response
  3. Notes the parent's ID for lineage tracking

4e. Evaluate the Candidate

Evaluation is a multi-stage pipeline:

Correctness gate (only if a test command was provided):

  1. Back up the original file: cp src/pipeline.py src/pipeline.py.bak
  2. Swap in the candidate: cp evolve-output/candidates/iteration_3.py src/pipeline.py
  3. Run tests: npm test (or whatever command the user specified)
  4. Restore the original: mv src/pipeline.py.bak src/pipeline.py
  5. If tests fail → discard the candidate, move to next iteration

The backup/restore pattern ensures the project is never left in a broken state — even if the skill is interrupted mid-evaluation.

LLM evaluation:

  1. Spawn an evaluator subagent with the rubric from references/evaluator.md plus the candidate's target code
  2. Parse the JSON response: {"efficiency-score": 8}
  3. Normalize: llm_score = 8 / 10.0 = 0.8

Benchmark evaluation (only if an eval command was provided):

  1. Same backup/swap/restore pattern as the correctness gate
  2. Run the benchmark command
  3. Parse the score from stdout (JSON or last numeric value)
  4. The score must be in [0.0, 1.0]

Final score:

  • LLM only: final_score = llm_score
  • Both: final_score = (llm_score + benchmark_score) / 2

4f. Update Population

node db-cli.mjs add evolve-output/database \
  --codePath "/absolute/path/to/evolve-output/candidates/iteration_3.py" \
  --targetCode "$(cat /tmp/target_code.txt)" \
  --parentId "abc-123" \
  --metrics '{"efficiency-score": 0.8, "benchmark-score": 0.75}' \
  --changes "$(cat /tmp/changes.txt)"

The database handles the rest: assigning the program to an island, updating the best score, checking if migration is due, and pruning if the island is full.

4g. Report Progress

After each iteration, the user sees:

Iteration 3/10 | efficiency-score: 0.8 | benchmark-score: 0.75 | best: 0.85
  Δ Used dict lookup instead of linear search in inner loop

This keeps the user informed without overwhelming them — they can see the score trending and understand what strategies are working.

Step 5: Finalize#

After all iterations complete:

  1. Retrieve the best program from the database
  2. Copy it to evolve-output/best/
  3. Report the improvement:
Evolution complete.
Baseline: {"efficiency-score": 0.5}
Best:     {"efficiency-score": 0.9} (iteration 7)
Improvement: +80%

Strategy that worked best: Replaced nested loops with hash map
lookup and introduced early-exit conditions.
  1. Show the best implementation's code
  2. Ask the user: "Apply this implementation to the original source file?"

If the user says yes, the best candidate is copied over the original. If not, it stays in evolve-output/best/ for manual review.

Error Handling and Resilience#

The skill is designed to handle failures gracefully:

FailureHow It's Handled
Subagent fails/times outRetry once with the same prompt. If it fails again, discard the iteration and continue.
Tests fail on candidateDiscard the candidate (it's incorrect). Continue to next iteration.
Evaluator returns invalid JSONRetry once. If still invalid, discard the iteration.
Benchmark command errorsHalt the entire run and report the error (likely a setup problem).
Skill is interrupted mid-evaluationOn resume, detect the .bak file and restore the original before continuing.
Seed evaluation failsStop immediately — the evaluation setup is misconfigured.

The key principle: never leave the project in a broken state. Every file swap uses the backup/restore pattern, and the skill always restores the original even if something crashes.

Putting It All Together#

The four components work in concert:

  1. The database script maintains a diverse population across islands, handles sampling with exploitation/exploration balance, and persists state so runs are resumable
  2. Context extraction runs once at the start, producing a reusable context file that keeps mutation prompts grounded in reality
  3. The evaluator provides consistent, dimension-aware scoring that drives selection pressure toward genuinely better code
  4. The optimization loop ties everything together — sampling diverse parents, assembling evolving prompts, delegating creative and analytical work to subagents, and accumulating improvements across generations

The result: code that improves measurably, iteration by iteration, driven by real evaluation rather than LLM intuition.