Usage: Running the AlphaEvolve Skill

Now that you understand what AlphaEvolve is and why it matters, let's see the skill in action. This page covers installation, invoking the skill, configuring evaluators, interpreting results, and tips for getting the most out of evolutionary optimization. After seeing how it works as a user, we'll dive into the design and implementation in the following pages.

Installation#

The skill is installed via the skills CLI tool. Run one command depending on which coding agent you use:

Claude Code:

npx skills add https://github.com/yangwenz/alphaevolve-skill --skill alphaevolve-skill -a claude-code

Codex:

npx skills add https://github.com/yangwenz/alphaevolve-skill --skill alphaevolve-skill -a codex

That's it — no external dependencies needed. The database and CLI are plain ESM JavaScript that runs on Node.js 18+.

After installation, the skill is available in your coding agent. You can verify by checking that a SKILL.md file exists in your .claude/skills/ directory (for Claude Code) or the equivalent for Codex.

Invoking the Skill#

Trigger the skill with natural language. The skill activates when you use words like "evolve", "optimize", or "run alphaevolve":

# Basic — optimize a function for efficiency
evolve the function processItems in src/pipeline.py

# Specify a goal
optimize the render method in components/Chart.tsx — reduce memory allocations

# Set iteration count
run alphaevolve on the sort function in utils/sort.py — 20 iterations

# With a test suite as correctness gate
evolve parseJSON in src/parser.ts with test command "npm test"

# With a benchmark for empirical scoring
optimize batchQuery in src/db.py — eval with "python bench/query_bench.py", test with "pytest tests/"

You don't need to remember exact syntax. The skill infers parameters from your natural language request and asks clarifying questions if something is missing.

Parameters#

ParameterRequired?DefaultDescription
Target fileYesPath to the file containing the code to optimize
Target nameYesFunction, method, or class name to evolve
IterationsNo10Number of evolutionary cycles to run
Optimization goalNo"optimize code efficiency"What the optimizer should aim for
Test commandNoskipShell command to validate correctness (exit 0 = pass)
Eval commandNoLLM-onlyBenchmark script that outputs a score in [0.0, 1.0]

When to Provide a Test Command#

If you have tests that cover the target function, always provide them. The test command acts as a correctness gate — any mutation that breaks tests is automatically discarded. Without it, the skill relies solely on the LLM evaluator, which focuses on code quality but can't guarantee behavioral correctness.

# Good: tests prevent incorrect mutations from surviving
evolve processItems in src/pipeline.py with test command "pytest tests/test_pipeline.py"

# Also works with any command that returns exit code 0 on success
evolve sort in lib/sort.js with test command "node --test tests/sort.test.js"

When to Provide an Eval Command#

Provide a benchmark when you have a measurable, quantifiable metric:

  • Execution time (latency, throughput)
  • Memory usage
  • Number of operations
  • Any numeric performance indicator

Your eval command should output a score between 0.0 and 1.0. The skill accepts two formats:

JSON output (preferred):

{"score": 0.85}

Plain numeric output (also works — the skill takes the last number in stdout):

Running benchmark...
Score: 0.85

Example benchmark script:

# bench/query_bench.py
import time
from src.db import batchQuery

test_data = load_test_data()

start = time.perf_counter()
for _ in range(1000):
    batchQuery(test_data)
elapsed = time.perf_counter() - start

# Normalize: faster = higher score
# Baseline is 2.0 seconds; anything under 0.5s gets a perfect score
score = max(0.0, min(1.0, 1.0 - (elapsed - 0.5) / 1.5))
print(f'{{"score": {score:.3f}}}')

What Happens When You Run It#

Here's what you'll see during a typical run:

1. Input Validation#

The skill confirms what it will optimize:

Target: processItems in src/pipeline.py (lines 45-89)
Goal: optimize code efficiency
Iterations: 10
Test command: pytest tests/test_pipeline.py
Eval command: none (LLM evaluator only)

If the target name matches multiple definitions, the skill lists them and asks you to pick:

Multiple matches for "process" in src/pipeline.py:
  1. processItems (line 45) — def processItems(items: list, config: Config) -> Result
  2. processQueue (line 112) — def processQueue(queue: Queue) -> None

Which one should I evolve?

2. Baseline Evaluation#

The skill evaluates your current implementation to establish a starting score:

Baseline score: {"efficiency-score": 0.5}
Starting evolution.

3. Iteration Progress#

Each iteration reports its score and what changed:

Iteration 1/10 | efficiency-score: 0.6 | best: 0.6
  Δ Replaced repeated list.index() calls with a pre-built dictionary lookup

Iteration 2/10 | efficiency-score: 0.5 | best: 0.6
  Δ Attempted generator-based streaming but increased complexity without gain

Iteration 3/10 | efficiency-score: 0.7 | best: 0.7
  Δ Combined two sequential passes into a single loop with accumulator

Iteration 4/10 | DISCARDED (tests failed)
  Δ Changed return type which broke downstream callers

Iteration 5/10 | efficiency-score: 0.8 | best: 0.8
  Δ Used collections.Counter instead of manual counting loop, added early exit

Notice that iteration 2 didn't improve the best score (exploration attempt), and iteration 4 was discarded because it broke tests. This is normal — not every mutation is an improvement, and the correctness gate prevents regressions.

4. Final Report#

Evolution complete.
Baseline: {"efficiency-score": 0.5}
Best:     {"efficiency-score": 0.8} (iteration 5)
Improvement: +60%

Strategy that worked best: Replaced manual iteration patterns with
standard library utilities (Counter, dict comprehension) and eliminated
a redundant second pass by combining operations into a single loop.

The skill then shows you the best implementation and asks:

Apply this implementation to the original source file?

If you say yes, it replaces your original code. If not, the result is saved in evolve-output/best/ for manual review.

Resuming an Interrupted Run#

If you stop the skill mid-run (Ctrl+C, close the terminal, etc.), your progress is preserved. Next time you invoke the skill on the same target, it detects the existing state:

An existing evolution run was found (5 programs, best score: {"efficiency-score": 0.8}).
Resume or start fresh?

Choose resume to continue from where you left off. The database, all candidate files, and the extracted context are intact. The skill picks up at the next iteration number.

Choose start fresh to delete everything and begin from scratch.

Output Artifacts#

Everything the skill produces lives in evolve-output/ relative to your working directory:

evolve-output/
├── database/
│   └── database.json        # Full population state (programs, islands, scores)
├── context/
│   └── pipeline_processItems.md   # Extracted dependency context
├── candidates/
│   ├── iteration_1.py       # Each iteration's mutated file
│   ├── iteration_2.py
│   ├── iteration_3.py
│   └── ...
└── best/
    └── pipeline.py          # The winning implementation

You can inspect any of these:

  • candidates/ — see exactly what each iteration produced (useful for understanding what strategies were tried)
  • database/database.json — the full population with scores, lineage, and island assignments
  • best/ — the final result, ready to copy into your project

Tips for Getting Good Results#

Choose the Right Target#

The skill works best on:

  • Functions with clear performance characteristics — loops, data transformations, algorithms
  • Code with room to improve — if your function is already near-optimal, 10 iterations won't find much
  • Self-contained logic — functions that do real work rather than thin wrappers

Less ideal targets:

  • Trivial getters/setters (nothing to optimize)
  • Functions that are mostly I/O-bound with no computational logic
  • Code where correctness is extremely subtle (the LLM evaluator focuses on efficiency, not correctness edge cases)

Write a Specific Goal#

The default goal ("optimize code efficiency") works, but a specific goal produces better results:

# Generic (OK)
evolve processItems in src/pipeline.py

# Specific (better)
evolve processItems in src/pipeline.py — reduce memory allocations,
the function currently creates too many intermediate lists

# Very specific (best)
evolve processItems in src/pipeline.py — this function processes
100k items in a hot loop, optimize for CPU time, memory usage is not
a concern since items are small

The more context you give about what matters, the better the mutations will be.

Use More Iterations for Hard Problems#

The default of 10 iterations is reasonable for straightforward optimization. For harder problems:

  • 5 iterations — quick check, "is there low-hanging fruit?"
  • 10 iterations — standard optimization pass
  • 20-30 iterations — deep optimization for performance-critical code
  • 50+ iterations — exhaustive search (e.g., algorithmic breakthroughs)

More iterations means more chances for the population to discover diverse strategies and combine them.

Always Provide Tests When Available#

Tests are your safety net. Without them, a mutation might produce code that's "efficient" but subtly broken. Even a basic smoke test is better than nothing:

# Even a simple test is valuable
evolve sort in lib/sort.js with test command "node -e \"require('./lib/sort').sort([3,1,2])\""

Inspect Failed Iterations#

When iterations are discarded (test failures), look at what the mutation tried. Sometimes a good idea was executed poorly — you might want to manually incorporate the concept even if the automated version broke something.

The candidate files for discarded iterations are still saved in evolve-output/candidates/. You can read them to understand what was attempted.

Troubleshooting#

"Seed evaluation fails" — The evaluation setup is broken. Check that your test command works on the unmodified code, and that Node.js 18+ is available for the database scripts.

Every iteration is discarded — Your tests might be too strict (failing on semantically equivalent but syntactically different code) or the target is too complex for single-iteration mutations. Try relaxing the test command or breaking the target into smaller functions.

Scores aren't improving — The code might already be near-optimal for LLM judgment. Try adding a benchmark command for empirical measurement, or increase the iteration count to give the population more time to explore.

"Multiple matches" keeps appearing — Your target name is ambiguous. Use a more specific name or provide additional context: "evolve the processItems method in the Pipeline class in src/pipeline.py".