What's Next: Extending the Skill
The alphaevolve-skill as built is a working evolutionary optimizer — but it's intentionally minimal. The LLM evaluator uses a single generic rubric, the mutation strategy is uniform, and there's no language-specific knowledge baked in.
This page explores concrete directions for extending the skill. Each section is a self-contained improvement you could implement.
Language-Specific Evaluators#
The current evaluator (references/evaluator.md) uses a single rubric for all languages. It works — but it misses language-specific performance patterns that a tailored evaluator would catch.
The Problem#
A Python function that builds a list with repeated .append() calls is idiomatic but slow — a list comprehension or itertools chain would be faster. A Java method that creates String objects in a loop instead of using StringBuilder is a textbook anti-pattern. A Rust function that .clone()s unnecessarily defeats the borrow checker's zero-cost guarantees.
The generic evaluator can identify some of these (it checks for "idiomatic efficiency"), but it lacks the specificity to score them consistently. A Python evaluator that explicitly knows about comprehensions, generators, __slots__, GIL implications, and NumPy vectorization will produce more actionable scores.
How to Implement#
Create language-specific evaluator files alongside the generic one:
references/
├── evaluator.md # Generic (fallback)
├── evaluator-python.md # Python-specific rubric
├── evaluator-typescript.md # TypeScript-specific rubric
├── evaluator-java.md # Java-specific rubric
└── evaluator-rust.md # Rust-specific rubric
Each file follows the same structure as the current evaluator.md but adds language-specific dimensions. For example, a Python evaluator might include:
### Python-Specific Efficiency
- Are list comprehensions or generator expressions used instead of manual loops with append?
- Are built-in functions (map, filter, zip, enumerate, any, all) used where they outperform explicit iteration?
- Is NumPy/Pandas vectorization used instead of row-by-row iteration for numerical data?
- Are __slots__ used on frequently instantiated classes to reduce memory overhead?
- Are f-strings or join() used instead of repeated string concatenation?
- Is functools.lru_cache or manual memoization applied to pure functions with repeated calls?
- Are context managers used for resource management?
- Is the GIL considered — using multiprocessing for CPU-bound work vs. asyncio for I/O-bound?
The SKILL.md would then select the appropriate evaluator based on the file extension:
Evaluator selection:
- .py → references/evaluator-python.md
- .ts, .tsx, .js, .jsx → references/evaluator-typescript.md
- .java → references/evaluator-java.md
- .rs → references/evaluator-rust.md
- anything else → references/evaluator.md (generic)
Built-In Benchmark Templates#
Right now, the user has to write their own benchmark script if they want empirical scoring. Many users skip this because writing a good benchmark is non-trivial. We can lower the barrier by providing reusable benchmark templates.
Template Approach#
Create a benchmarks/ directory with ready-to-use templates:
benchmarks/
├── python-time.py # Wall-clock timing for Python functions
├── python-memory.py # Peak memory measurement for Python
├── node-time.mjs # Wall-clock timing for JS/TS functions
├── node-memory.mjs # Heap measurement for JS/TS
└── generic-time.sh # Language-agnostic timing via /usr/bin/time
Example — python-time.py:
#!/usr/bin/env python3
"""
Benchmark template: measures wall-clock time of a target function.
Usage: python benchmarks/python-time.py <module_path> <function_name> <input_file>
Outputs a score in [0.0, 1.0] where higher = faster.
Configure BASELINE_MS and TARGET_MS below.
"""
import sys, time, importlib.util, json
BASELINE_MS = 1000 # Expected time for unoptimized code
TARGET_MS = 100 # Target time for "perfect" score
ITERATIONS = 100 # Number of runs to average
module_path, func_name, input_file = sys.argv[1], sys.argv[2], sys.argv[3]
# Load module dynamically
spec = importlib.util.spec_from_file_location("target", module_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
func = getattr(mod, func_name)
# Load test input
with open(input_file) as f:
test_input = json.load(f)
# Measure
times = []
for _ in range(ITERATIONS):
start = time.perf_counter()
func(test_input)
times.append((time.perf_counter() - start) * 1000)
avg_ms = sum(times) / len(times)
score = max(0.0, min(1.0, (BASELINE_MS - avg_ms) / (BASELINE_MS - TARGET_MS)))
print(json.dumps({"score": round(score, 4), "avg_ms": round(avg_ms, 2)}))
The user would then invoke:
evolve processItems in src/pipeline.py \
— eval with "python benchmarks/python-time.py src/pipeline.py processItems data/test_input.json"
This is far easier than writing a benchmark from scratch.
Mutation Strategy Diversity#
Currently, the mutation subagent gets a single prompt: "improve this code for efficiency." Every iteration uses the same approach — the subagent decides what to try. We can improve exploration by explicitly varying the mutation strategy.
Strategy Pool#
Define a set of mutation strategies that the orchestrator rotates through:
| Strategy | Prompt Directive | What It Encourages |
|---|---|---|
| Data structure swap | "Try a completely different data structure for the main collection" | HashMap → TreeMap, List → Set, Array → Ring Buffer |
| Algorithm change | "Replace the core algorithm with a fundamentally different approach" | Sort-then-scan → Hash-based, BFS → DFS, iterative → divide-and-conquer |
| Micro-optimization | "Keep the same algorithm but optimize the hot path for fewer allocations and branches" | Pre-allocation, loop unrolling, branch elimination |
| Idiom upgrade | "Rewrite using the language's most efficient built-in patterns" | Comprehensions, stream APIs, vectorized operations |
| Parallelism | "Introduce concurrency or parallelism where independent work exists" | Async I/O, worker pools, SIMD-style batching |
| Free-form | "Improve efficiency in whatever way you think is most impactful" | Unconstrained exploration |
The orchestrator would cycle through strategies or select based on what's been working:
iteration 1: free-form
iteration 2: data structure swap
iteration 3: algorithm change
iteration 4: micro-optimization
iteration 5: idiom upgrade
iteration 6: (repeat strategy with best average score)
...
This prevents the population from collapsing to a single style of optimization (e.g., all candidates are minor variations of the same approach).
Multi-Metric Evaluation#
The current skill computes a single final score. Real optimization often involves trade-offs between competing objectives — making code faster might use more memory, or reducing allocations might increase code complexity.
Pareto-Based Selection#
Instead of a single score, track multiple independent metrics:
{
"efficiency-score": 0.8,
"memory-score": 0.6,
"readability-score": 0.9
}
A candidate is "better" than another only if it's better on all metrics (Pareto dominance). Candidates that are better on some metrics but worse on others are kept as diversity — they represent different trade-off points.
This changes the database's comparison function. Instead of sorting by average score, you'd maintain the Pareto front: the set of candidates where no other candidate dominates them on all dimensions.
Implementation Sketch#
Add a --multi-metric mode to the database:
function paretoDominates(a, b) {
const metricsA = Object.values(a.metrics);
const metricsB = Object.values(b.metrics);
let dominated = false;
for (let i = 0; i < metricsA.length; i++) {
if (metricsA[i] < metricsB[i]) return false;
if (metricsA[i] > metricsB[i]) dominated = true;
}
return dominated;
}
The final report would then present the Pareto front:
Evolution complete. Pareto front (3 non-dominated solutions):
A: efficiency=0.9, memory=0.5, readability=0.7 (fastest)
B: efficiency=0.7, memory=0.9, readability=0.8 (most memory-efficient)
C: efficiency=0.8, memory=0.7, readability=0.9 (most readable)
Which trade-off point would you like to apply?
Self-Evolving Meta-Prompts#
One of AlphaEvolve's most powerful features is recursive self-improvement: the system evolves its own mutation prompts alongside the code. We can approximate this.
The Idea#
Track which mutation prompts produce the best improvements. After every few iterations, adjust the prompt template based on what worked:
Iteration 3: "Try a different data structure" → score improved by +0.2
Iteration 5: "Reduce branching" → score improved by +0.15
Iteration 7: "Batch I/O operations" → no improvement
After 10 iterations, the orchestrator could summarize: "Data structure changes and branch reduction have been the most effective strategies. Emphasize these in future prompts."
Implementation#
Add a meta-prompts.md file that accumulates strategy performance:
# Effective Strategies (updated after iteration 10)
## High Impact
- Data structure swaps: average improvement +0.18
- Eliminating redundant passes: average improvement +0.15
## Low Impact
- I/O batching: average improvement +0.02 (target has minimal I/O)
- Parallelism: no improvement (single-threaded workload)
## Recommendation for Next Iterations
Focus on data structure alternatives and loop fusion.
Deprioritize I/O and concurrency strategies.
This file would be included in the mutation prompt, giving the subagent accumulated wisdom about what works for this specific target.
Custom Evaluator Plugins#
Beyond language-specific evaluators, users might want domain-specific scoring. A machine learning pipeline should be judged differently from a web server handler or a CLI tool.
Plugin Interface#
Define a simple convention: any executable that accepts code on stdin and outputs a JSON score is a valid evaluator plugin:
# Custom evaluator for ML code
cat candidate.py | python evaluators/ml-efficiency.py
# Output: {"score": 0.7, "dimensions": {"vectorization": 0.9, "memory-locality": 0.5}}
The skill would support a --evaluator flag:
evolve train_step in model.py — evaluator "python evaluators/ml-efficiency.py"
Example domain-specific evaluators:
- ML/Data Science: vectorization usage, memory locality, gradient computation efficiency
- Web handlers: response time estimation, middleware overhead, connection pooling
- Database queries: index usage, join strategy, result set size estimation
- CLI tools: startup time, streaming vs. buffering, exit code handling
Population Visualization#
The current skill reports progress as text. A visual representation of the population would help users understand the optimization landscape.
What to Visualize#
- Score over time — line chart showing best score, average score, and worst score per iteration
- Family tree — which candidates are derived from which parents, highlighting the lineage of the best solution
- Island diversity — how different the solutions on each island are (measured by edit distance or structural difference)
Implementation Approach#
Generate a simple HTML report at evolve-output/report.html after finalization:
node scripts/visualize.mjs evolve-output/database > evolve-output/report.html
Even a basic chart showing score progression gives users intuition about whether to run more iterations (score still climbing) or stop (plateau reached).
Summary of Extension Ideas#
| Extension | Difficulty | Impact |
|---|---|---|
| Language-specific evaluators | Low — write new .md files | Better scoring for language idioms |
| Benchmark templates | Low — reusable scripts | More users get empirical measurement |
| Mutation strategy diversity | Medium — modify prompt assembly | Broader exploration, fewer local optima |
| Multi-metric evaluation | Medium — change DB comparison logic | Handle trade-offs between competing objectives |
| Self-evolving meta-prompts | Medium — track strategy performance | Prompts improve over iterations |
| Custom evaluator plugins | Low — convention-based interface | Domain-specific scoring |
| Population visualization | Medium — generate HTML report | Better user understanding of the optimization |
Each of these builds on the existing architecture without changing the fundamental design. The skill's separation of concerns — orchestrator in SKILL.md, computation in scripts, creativity in subagents — makes it straightforward to add new evaluators, strategies, or reporting without restructuring the whole system.
The most impactful first step? Language-specific evaluators. They're easy to write (just a markdown file with a tailored rubric), require no code changes to the skill itself, and immediately improve scoring quality for the languages you use most.