What's Next: Extending the Skill

The alphaevolve-skill as built is a working evolutionary optimizer — but it's intentionally minimal. The LLM evaluator uses a single generic rubric, the mutation strategy is uniform, and there's no language-specific knowledge baked in.

This page explores concrete directions for extending the skill. Each section is a self-contained improvement you could implement.

Language-Specific Evaluators#

The current evaluator (references/evaluator.md) uses a single rubric for all languages. It works — but it misses language-specific performance patterns that a tailored evaluator would catch.

The Problem#

A Python function that builds a list with repeated .append() calls is idiomatic but slow — a list comprehension or itertools chain would be faster. A Java method that creates String objects in a loop instead of using StringBuilder is a textbook anti-pattern. A Rust function that .clone()s unnecessarily defeats the borrow checker's zero-cost guarantees.

The generic evaluator can identify some of these (it checks for "idiomatic efficiency"), but it lacks the specificity to score them consistently. A Python evaluator that explicitly knows about comprehensions, generators, __slots__, GIL implications, and NumPy vectorization will produce more actionable scores.

How to Implement#

Create language-specific evaluator files alongside the generic one:

references/
├── evaluator.md              # Generic (fallback)
├── evaluator-python.md       # Python-specific rubric
├── evaluator-typescript.md   # TypeScript-specific rubric
├── evaluator-java.md         # Java-specific rubric
└── evaluator-rust.md         # Rust-specific rubric

Each file follows the same structure as the current evaluator.md but adds language-specific dimensions. For example, a Python evaluator might include:

### Python-Specific Efficiency

- Are list comprehensions or generator expressions used instead of manual loops with append?
- Are built-in functions (map, filter, zip, enumerate, any, all) used where they outperform explicit iteration?
- Is NumPy/Pandas vectorization used instead of row-by-row iteration for numerical data?
- Are __slots__ used on frequently instantiated classes to reduce memory overhead?
- Are f-strings or join() used instead of repeated string concatenation?
- Is functools.lru_cache or manual memoization applied to pure functions with repeated calls?
- Are context managers used for resource management?
- Is the GIL considered — using multiprocessing for CPU-bound work vs. asyncio for I/O-bound?

The SKILL.md would then select the appropriate evaluator based on the file extension:

Evaluator selection:
- .py → references/evaluator-python.md
- .ts, .tsx, .js, .jsx → references/evaluator-typescript.md
- .java → references/evaluator-java.md
- .rs → references/evaluator-rust.md
- anything else → references/evaluator.md (generic)

Built-In Benchmark Templates#

Right now, the user has to write their own benchmark script if they want empirical scoring. Many users skip this because writing a good benchmark is non-trivial. We can lower the barrier by providing reusable benchmark templates.

Template Approach#

Create a benchmarks/ directory with ready-to-use templates:

benchmarks/
├── python-time.py       # Wall-clock timing for Python functions
├── python-memory.py     # Peak memory measurement for Python
├── node-time.mjs        # Wall-clock timing for JS/TS functions
├── node-memory.mjs      # Heap measurement for JS/TS
└── generic-time.sh      # Language-agnostic timing via /usr/bin/time

Example — python-time.py:

#!/usr/bin/env python3
"""
Benchmark template: measures wall-clock time of a target function.
Usage: python benchmarks/python-time.py <module_path> <function_name> <input_file>

Outputs a score in [0.0, 1.0] where higher = faster.
Configure BASELINE_MS and TARGET_MS below.
"""
import sys, time, importlib.util, json

BASELINE_MS = 1000   # Expected time for unoptimized code
TARGET_MS = 100      # Target time for "perfect" score
ITERATIONS = 100     # Number of runs to average

module_path, func_name, input_file = sys.argv[1], sys.argv[2], sys.argv[3]

# Load module dynamically
spec = importlib.util.spec_from_file_location("target", module_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
func = getattr(mod, func_name)

# Load test input
with open(input_file) as f:
    test_input = json.load(f)

# Measure
times = []
for _ in range(ITERATIONS):
    start = time.perf_counter()
    func(test_input)
    times.append((time.perf_counter() - start) * 1000)

avg_ms = sum(times) / len(times)
score = max(0.0, min(1.0, (BASELINE_MS - avg_ms) / (BASELINE_MS - TARGET_MS)))
print(json.dumps({"score": round(score, 4), "avg_ms": round(avg_ms, 2)}))

The user would then invoke:

evolve processItems in src/pipeline.py \
  — eval with "python benchmarks/python-time.py src/pipeline.py processItems data/test_input.json"

This is far easier than writing a benchmark from scratch.

Mutation Strategy Diversity#

Currently, the mutation subagent gets a single prompt: "improve this code for efficiency." Every iteration uses the same approach — the subagent decides what to try. We can improve exploration by explicitly varying the mutation strategy.

Strategy Pool#

Define a set of mutation strategies that the orchestrator rotates through:

StrategyPrompt DirectiveWhat It Encourages
Data structure swap"Try a completely different data structure for the main collection"HashMap → TreeMap, List → Set, Array → Ring Buffer
Algorithm change"Replace the core algorithm with a fundamentally different approach"Sort-then-scan → Hash-based, BFS → DFS, iterative → divide-and-conquer
Micro-optimization"Keep the same algorithm but optimize the hot path for fewer allocations and branches"Pre-allocation, loop unrolling, branch elimination
Idiom upgrade"Rewrite using the language's most efficient built-in patterns"Comprehensions, stream APIs, vectorized operations
Parallelism"Introduce concurrency or parallelism where independent work exists"Async I/O, worker pools, SIMD-style batching
Free-form"Improve efficiency in whatever way you think is most impactful"Unconstrained exploration

The orchestrator would cycle through strategies or select based on what's been working:

iteration 1: free-form
iteration 2: data structure swap
iteration 3: algorithm change
iteration 4: micro-optimization
iteration 5: idiom upgrade
iteration 6: (repeat strategy with best average score)
...

This prevents the population from collapsing to a single style of optimization (e.g., all candidates are minor variations of the same approach).

Multi-Metric Evaluation#

The current skill computes a single final score. Real optimization often involves trade-offs between competing objectives — making code faster might use more memory, or reducing allocations might increase code complexity.

Pareto-Based Selection#

Instead of a single score, track multiple independent metrics:

{
  "efficiency-score": 0.8,
  "memory-score": 0.6,
  "readability-score": 0.9
}

A candidate is "better" than another only if it's better on all metrics (Pareto dominance). Candidates that are better on some metrics but worse on others are kept as diversity — they represent different trade-off points.

This changes the database's comparison function. Instead of sorting by average score, you'd maintain the Pareto front: the set of candidates where no other candidate dominates them on all dimensions.

Implementation Sketch#

Add a --multi-metric mode to the database:

function paretoDominates(a, b) {
  const metricsA = Object.values(a.metrics);
  const metricsB = Object.values(b.metrics);
  let dominated = false;
  for (let i = 0; i < metricsA.length; i++) {
    if (metricsA[i] < metricsB[i]) return false;
    if (metricsA[i] > metricsB[i]) dominated = true;
  }
  return dominated;
}

The final report would then present the Pareto front:

Evolution complete. Pareto front (3 non-dominated solutions):

  A: efficiency=0.9, memory=0.5, readability=0.7 (fastest)
  B: efficiency=0.7, memory=0.9, readability=0.8 (most memory-efficient)
  C: efficiency=0.8, memory=0.7, readability=0.9 (most readable)

Which trade-off point would you like to apply?

Self-Evolving Meta-Prompts#

One of AlphaEvolve's most powerful features is recursive self-improvement: the system evolves its own mutation prompts alongside the code. We can approximate this.

The Idea#

Track which mutation prompts produce the best improvements. After every few iterations, adjust the prompt template based on what worked:

Iteration 3: "Try a different data structure" → score improved by +0.2
Iteration 5: "Reduce branching" → score improved by +0.15
Iteration 7: "Batch I/O operations" → no improvement

After 10 iterations, the orchestrator could summarize: "Data structure changes and branch reduction have been the most effective strategies. Emphasize these in future prompts."

Implementation#

Add a meta-prompts.md file that accumulates strategy performance:

# Effective Strategies (updated after iteration 10)

## High Impact
- Data structure swaps: average improvement +0.18
- Eliminating redundant passes: average improvement +0.15

## Low Impact
- I/O batching: average improvement +0.02 (target has minimal I/O)
- Parallelism: no improvement (single-threaded workload)

## Recommendation for Next Iterations
Focus on data structure alternatives and loop fusion.
Deprioritize I/O and concurrency strategies.

This file would be included in the mutation prompt, giving the subagent accumulated wisdom about what works for this specific target.

Custom Evaluator Plugins#

Beyond language-specific evaluators, users might want domain-specific scoring. A machine learning pipeline should be judged differently from a web server handler or a CLI tool.

Plugin Interface#

Define a simple convention: any executable that accepts code on stdin and outputs a JSON score is a valid evaluator plugin:

# Custom evaluator for ML code
cat candidate.py | python evaluators/ml-efficiency.py
# Output: {"score": 0.7, "dimensions": {"vectorization": 0.9, "memory-locality": 0.5}}

The skill would support a --evaluator flag:

evolve train_step in model.py — evaluator "python evaluators/ml-efficiency.py"

Example domain-specific evaluators:

  • ML/Data Science: vectorization usage, memory locality, gradient computation efficiency
  • Web handlers: response time estimation, middleware overhead, connection pooling
  • Database queries: index usage, join strategy, result set size estimation
  • CLI tools: startup time, streaming vs. buffering, exit code handling

Population Visualization#

The current skill reports progress as text. A visual representation of the population would help users understand the optimization landscape.

What to Visualize#

  • Score over time — line chart showing best score, average score, and worst score per iteration
  • Family tree — which candidates are derived from which parents, highlighting the lineage of the best solution
  • Island diversity — how different the solutions on each island are (measured by edit distance or structural difference)

Implementation Approach#

Generate a simple HTML report at evolve-output/report.html after finalization:

node scripts/visualize.mjs evolve-output/database > evolve-output/report.html

Even a basic chart showing score progression gives users intuition about whether to run more iterations (score still climbing) or stop (plateau reached).

Summary of Extension Ideas#

ExtensionDifficultyImpact
Language-specific evaluatorsLow — write new .md filesBetter scoring for language idioms
Benchmark templatesLow — reusable scriptsMore users get empirical measurement
Mutation strategy diversityMedium — modify prompt assemblyBroader exploration, fewer local optima
Multi-metric evaluationMedium — change DB comparison logicHandle trade-offs between competing objectives
Self-evolving meta-promptsMedium — track strategy performancePrompts improve over iterations
Custom evaluator pluginsLow — convention-based interfaceDomain-specific scoring
Population visualizationMedium — generate HTML reportBetter user understanding of the optimization

Each of these builds on the existing architecture without changing the fundamental design. The skill's separation of concerns — orchestrator in SKILL.md, computation in scripts, creativity in subagents — makes it straightforward to add new evaluators, strategies, or reporting without restructuring the whole system.

The most impactful first step? Language-specific evaluators. They're easy to write (just a markdown file with a tailored rubric), require no code changes to the skill itself, and immediately improve scoring quality for the languages you use most.