How I Made My Agent Discover and Create New Agent Skills
Darwin doesn't just improve existing skills. Every week it scans every agent conversation to find capability gaps — and proposes new skills before anyone asks.

How I Made My Agent Discover and Create New Agent Skills
*Darwin doesn't just improve existing skills. Every week it reads every agent conversation, error log, and memory file to find what's missing — and proposes new skills before anyone asks for them.*
*Part 2 of the Darwin Series*
📚 The Darwin Series — How We Built Self-Improving AI Agents - Part 1: Our AI Agent Scored 25% on Its Most Important Skill. Here's How We Fixed It. — scoring and auto-improving existing skills - Part 2: How I Made My Agent Discover and Create New Agent Skills *(you are here)* - Part 3: The Full Self-Improvement Loop — how scoring, research, and corrections connect - Related: How Hawking Does Research That Doesn't Suck · The Memory Architecture
---
The Gap That Existing Frameworks Miss
Part 1 of this series covered how Darwin scores and improves skills that already exist. But there's a harder problem underneath that one.
What about the capability gaps that don't have a skill yet?
When an agent fails repeatedly at the same type of task, you usually notice eventually. But when an agent consistently doesn't do something — doesn't validate URLs before citing them, doesn't check if a prerequisite tool exists before promising a capability — you often don't notice at all. The absence of good behavior is invisible until the cost adds up. (The ones I noticed first were the ones that embarrassed me in front of output. The ones the harvest found first were the ones that were quietly wrong for weeks.)
That was the second problem I needed Darwin to solve. And the framework that pointed me toward the answer was the Forward-Deployed Engineer (FDE) pattern from Fintool — the idea that great engineers build automation before the user asks for it. They recognize the pattern in the data before the need becomes explicit.
I thought: what if Darwin could do that for agent skills?
The Five Frameworks (Again, Because They All Apply Here)
This whole series runs on five foundations. I credit them every time because this is their work adapted, not invented:
| Framework | Source | What it contributes here | |-----------|--------|--------------------------| | Karpathy AutoResearch | github.com/karpathy/autoresearch | The optimization loop that governs what happens after a skill is created | | MetaClaw | github.com/aiming-lab/MetaClaw | Post-session extraction: learn from EVERY conversation, not just failures | | SkillRL | arxiv.org/abs/2602.08234 | Codify success patterns, not just failure corrections | | Fintool FDE | Forward-Deployed Engineer pattern | The entire harvest-and-propose philosophy: build it before anyone asks | | Nyk's Council | github.com/0xNyk/council-of-high-intelligence | Proposals require evidence, not opinion — structured critique before approval |
The Weekly Harvest: What It Actually Reads
Every Sunday regularly IST, `skill_harvest.py` wakes up and reads everything from the last 7 days across all agents:
- Session JSONL files — full conversation transcripts, human-to-agent and agent-to-agent
- Gateway logs — inbound messages, auto-replies, messages that got no reply at all
- Daily memory logs — Borges' cross-agent nightly consolidation
- ERRORS.md — every tagged correction with its `[DARWIN: skill-name]` marker
- LEARNINGS.md — cross-agent rules that have accumulated
- All existing SKILL.md files — to audit conversations against what the rules actually say

It's answering five questions simultaneously:
- Skill gaps — what capability was clearly missing? (MetaClaw: extract signal from every session, not just the broken ones)
- Success patterns — what worked well enough to codify? (SkillRL: learn from what went right, not just what went wrong)
- Rule violations — which SKILL.md rules did agents quietly ignore this week?
- Blind spots — which rules were never tested? Which messages got no reply?
- Proactive proposals — what would a Forward-Deployed Engineer build right now, before anyone asked?
The harvest doesn't just look for errors. It looks for patterns — and patterns includes the absence of behaviors that should exist. (This is the MetaClaw contribution I appreciate most: learn from every session, not just the sessions where something visibly broke.)
What the First Harvest Found
The first harvest ran on March 30, 2026 — six days after Darwin went into production. The numbers:
22 skill gaps identified across 5 agents.
Some were obvious in hindsight. Some weren't visible at all until the harvest surfaced them. Here's the breakdown of the most significant ones:
| Agent | Gap identified | Evidence from sessions | |-------|----------------|----------------------| | APRIL | No pre-publish content linting | 3 drafts with banned formatting patterns slipped through to review | | APRIL | No source prefetching | 2 posts cited URLs that returned 404 at publish time | | Scout | No data recency labeling | 5 reports presented data from 30+ days ago as current | | Dev | No prerequisite verification | Promised PDF parsing without checking if poppler was installed | | Jarvis | No source-of-truth verification step | Config fix "done" but not validated with 2 consecutive successful runs |
12 new skill proposals generated in FDE mode.
Not all 12 were approved. That's by design. The harvest proposes; a human (me) reviews the HARVEST-REPORT.md and marks each proposal APPROVED or REJECTED. One human gate, then everything downstream is autonomous. (That one review session took me about 25 minutes over coffee. I approved 4, rejected 8. The 8 rejections weren't bad ideas — they just didn't have enough evidence from actual sessions to justify building.)
~14 rule violations caught across existing skills.
Agents had rules written in their SKILL.md files that the harvest proved they were ignoring. These didn't require new skills — they required Darwin to tighten the existing skill's mutation targets in the next cycle.
The Harvest Report Format
Here's what a real FDE-mode proposal from the first harvest looked like:

```markdown ## PROPOSAL: content-linter Agent: APRIL Trigger: Before any draft moves from production to review
Evidence from this week: - Session 2026-03-24: Draft reached review with 4 instances of markdown bold (text). Arif's rule: no bold. Rule in SOUL.md, not enforced. - Session 2026-03-26: Draft cited "inspired by" phrasing from voice bank verbatim. Voice bank is reference, not template. - Session 2026-03-27: Draft used "Here's what I learned" as a section header. Kill list violation.
Proposed checklist (3 assertions): 1. Does the draft contain zero instances of bold markdown? (YES/NO) 2. Does the draft contain zero kill-list phrases? (YES/NO) 3. Does the draft contain zero verbatim voice-bank sentences? (YES/NO)
Value: Catches formatting and voice violations before they reach Arif's review. Currently 3-4 violations per week slipping through.
Status: AWAITING APPROVAL ```
That's the format. Evidence from specific sessions. Named assertions traced to real failures. An explicit value statement. Not "we should have a linter" — "here's exactly why, from sessions on these specific dates."
The Nyk's Council framework shows up here: proposals without evidence are rejected. The harvest LLM is required to cite its sources the same way research agents are.
The Lamarck Protocol: Creating a New Skill
Once a proposal is approved, Darwin follows the Lamarck Protocol to build the skill. (Named for the pre-Darwin evolutionary theory that organisms pass acquired traits to their offspring — which is roughly what's happening when a correction in production becomes a rule in a new skill.)
The steps:
1. Find the closest template. Don't write from scratch. Find the existing SKILL.md most similar to the proposed skill. Use it as structural scaffolding. The content-linter was built from APRIL's blog-exec skill — same agent, similar domain.
2. Write SKILL.md under 200 lines. Every rule must trace to a real failure. Not best practices. Not "probably useful." Something that actually went wrong. The initialization block is mandatory — the agent reads ERRORS.md and LEARNINGS.md before starting any task.
3. Write checklist.json with 3-6 binary assertions. From the approved proposal. Every assertion observable in the output. Every assertion sourced from a specific session failure. Two people (or two LLMs) should be able to agree on YES or NO without ambiguity.
4. Write 3 test cases. Real inputs from the harvest evidence. Not hypotheticals. The content-linter's test cases were the exact three drafts from APRIL's sessions where violations slipped through.
5. Sandbox test. ```bash python3 darwin_sandbox.py create content-linter # isolate from production python3 darwin_eval.py checklist.json test-01.md # score against first test python3 darwin_eval.py checklist.json test-02.md # score against second python3 darwin_eval.py checklist.json test-03.md # score against third python3 darwin_sandbox.py promote content-linter # if avg >= 80% python3 darwin_sandbox.py revert content-linter # if avg < 80% ```
The pass threshold for a brand-new skill is 80%. Darwin then optimizes toward 95% over subsequent cycles.
6. Register and log. Drop the files into `~/.openclaw/workspace/skills/content-linter/`. Darwin auto-discovers in the next cycle. Log the creation in LEARNINGS.md: which harvest triggered it, which agent owns it, initial pass rate.
The Four Skills That Actually Shipped
Of the 12 proposals, 4 were approved and built in the first cycle:

content-linter Catches banned patterns, markdown bold, and voice-bank echoes before drafts reach review. Runs as a step in APRIL's blog-exec skill. Initial pass rate: 80% on first generation (the 3 test cases from the harvest evidence).
The one assertion it failed initially: voice-bank echo detection. The LLM judge couldn't reliably distinguish "inspired by" from "verbatim copy." That assertion went into Darwin's queue for refinement — the check language needed to be more specific.
source-prefetcher Validates that external URLs cited in a draft actually resolve before publishing. Simple in concept; the implementation required Scout's web_search tool to be available in APRIL's skill context, which it wasn't. That was an infrastructure fix, not a Darwin fix — took 20 minutes to wire up. (The reason we needed this: two published posts linked to URLs that had 404'd. Not catastrophic, just embarrassing. The kind of thing a five-line validation step catches.)
Initial pass rate: 100% on test cases. (URL validation is binary and deterministic — the LLM judge doesn't even need to be clever about it.)
revision-tracker Detects when an agent is in a revision round (vs. a fresh draft), and automatically loads the previous draft plus its review feedback as context. This one came from a pattern in the data: APRIL was receiving revision requests but not loading the previous version, causing it to start from scratch rather than improve the specific failures.
Initial pass rate: 67% (2/3 test cases). The failing assertion: the skill would load the previous draft correctly but then not explicitly reference which specific elements were flagged for revision. Went into Darwin's next optimization cycle.
cron-validator Checks file paths, tool dependencies, channel configs, and environment variables before a cron job starts. Built from three cron failures in one week that all traced to the same root cause: a path that had changed when a workspace was reorganized, and nobody updated the cron configs.
Initial pass rate: 100%. Cron validation is deterministic — file paths either exist or they don't.
What Gets Rejected and Why
Of the 12 proposals, 8 were rejected. The most common rejection reason: insufficient evidence from the harvest. A proposal like "monitoring-skill: agent should monitor external APIs for changes" was rejected because the harvest couldn't cite specific sessions where this gap caused a real problem — only sessions where it might have been useful.
The Fintool FDE pattern works in both directions. A good FDE builds things before they're explicitly asked for — but they build things with clear evidence they're needed, not just because they seem like a good idea. "Seems useful" proposals got rejected. "These 3 sessions broke because this didn't exist" proposals got approved. (Honest answer: this discipline is harder for me as the human reviewer than it sounds. You want to say yes to things that feel right. The evidence requirement forces you to ask "how do I know?" — which is the whole point.)
The other rejection category: scope too broad. A proposal for "general-quality-checker" that would assess every agent output across 15 dimensions was rejected for the same reason Darwin limits checklists to 3-6 assertions — the broader the scope, the harder it is to score reliably, and the more likely the agent starts gaming the metric.
The Feedback Loop Between Harvest and Production
What makes the harvest genuinely useful (rather than just a weekly audit) is that it feeds directly into Darwin's optimization queue.
The path from production error to improved skill:
``` Monday: Arif corrects APRIL's output ↓ Monday: Agent writes [DARWIN: blog-exec] tag to ERRORS.md before replying ↓ Sunday: skill_harvest.py reads ERRORS.md as primary signal ↓ Sunday: Harvest includes this in Section 3 (Rule Violations) of HARVEST-REPORT.md ↓ Sunday 2:00 AM: Darwin reads harvest as PRIMARY signal for optimization queue ↓ Darwin: Diagnoses root cause, proposes mutation to blog-exec SKILL.md ↓ Darwin: Sandbox test → promote or revert ↓ Wednesday: Promoted mutation now in production ```
The harvest is not a separate system — it's the signal layer that tells Darwin where to look. Without the harvest, Darwin is flying blind, optimizing randomly. With the harvest, every optimization cycle is informed by exactly what happened in production last week.
What I Actually Learned
Three things that surprised me in the first harvest cycle:
1. The absence of behavior is harder to detect than the presence of failure. The harvest found gaps I didn't know existed — like the prerequisite verification failure — because those gaps don't produce error logs. They produce capabilities you thought you had but didn't.
2. The "max 3 proposals per agent" rule is not arbitrary. The harvest initially surfaced 22 gaps and could have generated 22 proposals. Limiting to 3 per agent forces prioritization. And prioritization forces the harvest LLM to think about value per week rather than just identifying everything that could theoretically be better. That constraint produces better proposals.
3. Rejected proposals are valuable data too. Knowing that 8 of 12 proposals lacked sufficient evidence taught me something about what the harvest signal quality actually looks like. The next harvest had better evidence because the agents had been logging more carefully to ERRORS.md — knowing that the harvest was coming and that evidence-poor proposals would be rejected.
Build This for Your Own System: The Minimum Viable Harvest
You don't need `skill_harvest.py` to run your first harvest. Here's the manual version:
Step 1: Spend 30 minutes reading the last week's agent outputs. Don't look for obvious failures. Look for patterns of absence — things the agent consistently doesn't do that you'd want it to.
Step 2: Check your error logs for recurring tags. If you've been logging corrections (you should be — Part 1 covers why), what category do most of them fall into? Formatting? Sourcing? Verification? That category is your first proposal.
Step 3: Write one proposal in the format above. Name, trigger, evidence from specific sessions, 3 assertions from real failures, value in time or quality.
Step 4: Decide: does this need a new skill, or just a tighter rule in an existing one? New skills have overhead — they need checklists, test cases, sandbox testing. If the gap is one rule missing from an existing skill, add it there instead.
Step 5: Build the minimum version and test it on last week's outputs. Does it catch what it was supposed to catch? If yes, you've run a harvest cycle.
The automated harvest is that process running across all sessions, all agents, every week, in 20 minutes instead of 3 hours. But the logic is identical.
Where This Is Going
Part 3 of this series zooms out to the complete picture: how Darwin's scoring loop, the weekly harvest, Hawking's research engine, and the correction protocol all connect into one self-improving system.
But the core lesson from Part 2: agent systems grow capability gaps the same way they grow quality gaps — silently, invisibly, until something makes the cost obvious. The harvest is the instrument that makes invisible gaps visible before they compound into expensive habits.
What capability gap in your own agent system would a harvest uncover? My bet is the most useful finds are never the obvious ones — they're the things your agents consistently *don't* do that you've never noticed.
---
*The `skill_harvest.py` script described in this post runs weekly regularly IST across all agent workspaces. It uses OpenRouter with Kimi K2.5 as primary analysis model. Full harvest cycle — read, analyze, propose — takes ~20 minutes and produces a HARVEST-REPORT.md that lands in the workspace root and a .docx version in the shared vault.*
*What capability gap in your own agent system would a harvest uncover? I'd be curious what shows up on the first pass — my bet is the most useful finds are never the obvious ones.*
---
Want the harvest template?
I packaged the FDE-mode proposal format, the HARVEST-REPORT structure, and a starter skill_harvest config into a free download.
→ our upcoming resources
Includes: proposal template, harvest report format, checklist.json starter, and the 5-step manual harvest checklist from this article. No spam. Just the files.
📬 Subscribe to The Wiring for the next article in the series.
---
📚 The Darwin Series 1. Our AI Agent Scored 25% on Its Most Important Skill — how Darwin scores and mutates agent skills 2. How I Made My Agent Discover and Create New Agent Skills — the weekly harvest pipeline *(this article)* 3. The Full Self-Improvement Loop — how all three systems connect
Related: - How Hawking Does Research That Doesn't Suck - The Memory Architecture: How Corrections Compound
📬 Subscribe to The Wiring — next article drops tomorrow.
Continue reading
Darwin Part 3: How Part 1's Scoring + Part 2's Discovery Became a Compounding Loop
If you're new, this recaps Parts 1 and 2. If you're continuing, this is the payoff: how scoring + discovery connect with memory and research so corrections become permanent upgrades.
Our AI Agent Scored 25% on Its Most Important Skill. Here's How We Fixed It.
We didn't invent self-improving AI. We got curious, studied five frameworks, and adapted them for a problem nobody talks about; agent skills rot.
The Memory Architecture: How Corrections Compound
Without a memory architecture, every AI agent correction evaporates at session end. Same mistake, next week. Borges closes the loop — a Monday correction becomes a permanent skill improvement by Wednesday.
Enjoyed this?
Weekly notes on building companies with AI agents — what's working, what's not, and what I'm learning along the way.
Join 50+ founders