Darwin Part 3: How Part 1's Scoring + Part 2's Discovery Became a Compounding Loop
If you're new, this recaps Parts 1 and 2. If you're continuing, this is the payoff: how scoring + discovery connect with memory and research so corrections become permanent upgrades.

Darwin Part 3: How Part 1's Scoring + Part 2's Discovery Became a Compounding Loop
*If you've read Parts 1 and 2, this is the payoff. If you're new, quick recap: Part 1 scored broken skills, Part 2 discovered missing skills. This part shows how those systems connect with Hawking and Borges so one correction becomes a permanent upgrade instead of another reminder.*
*Part 3 of the Darwin Series*
<div class="series-nav-top"> <strong>The Darwin Series</strong><br> <a href="/blog/ai-agent-scored-25-percent-how-we-fixed-it">Part 1: How Darwin scores and improves skills</a><br> <a href="/blog/darwin-discovers-creates-new-agent-skills">Part 2: How I Made My Agent Discover and Create New Agent Skills</a><br> <strong>Part 3: The full compounding loop (You are here)</strong> </div>
---
Last Tuesday, my research agent made the exact same sourcing mistake for the 14th time. I had corrected it 14 times in chat. And 14 times, the session ended, the context window cleared, and the correction evaporated.

It's the most frustrating part of working with AI: every session starts fresh, which means every mistake is new again. You don't have a compounding assistant; you have a brilliant intern with amnesia.
If you read the earlier parts of this series, you might think my solution was just about writing better prompts. It's not. Stopping the amnesia required building a larger architecture—a three-part system that connects memory, research, and optimization.
The honest version: I didn't design this top-down. I got tired of repeating myself, watched how human corrections kept evaporating at session end, and started connecting the pieces I already had. The architecture emerged from the pain, not from a master plan.
The Three Systems
The full loop runs on three separate engines doing three different jobs:

Darwin — the optimization engine. It scores existing skills and writes updates when things break. Hawking — the deep research engine. It handles complex searches across multiple sources. Borges — the memory engine. It catches your chat corrections and turns them into permanent rules.
Each does a different job. All three connect. The connection is where the compounding happens.
Darwin: The Optimization Engine
Darwin scores, diagnoses, and updates. Fifteen skill targets, 64 binary assertions (pass/fail checks), sweeping every Wednesday and Sunday. When the weekly review identifies a new gap, Darwin builds a new capability, tests it in a safe sandbox, and deploys it.
What Darwin does not do: it doesn't improve itself. That's a boundary I haven't crossed. Darwin's own optimization logic, its judgment about what constitutes a good mutation, the decision rules about when to promote vs. revert — those are human decisions, reviewed periodically. The recursive loop has a ceiling by design. (Allowing Darwin to optimize Darwin is a fascinating idea that I'm going to think about for a long time before I do anything about it.)
Hawking: The Deep Research Engine
When an agent gets a question that needs depth—market analysis, competitive intel, regulatory checks—it hands off to Hawking instead of doing a simple web search.
Hawking breaks the question down into 3-5 sub-questions, searches multiple times for each, and synthesizes the findings. It specifically looks for missing information, scoring its own coverage, and searching again if it's not thorough enough.
Every brief must also include at least one substantive counter-argument to its main point, forcing the agent out of "yes-man" mode.
Borges: The Memory Engine
Every AI agent session starts fresh. Corrections made on Monday evaporate by Tuesday. Borges fixes this by closing the loop.
The protocol: when a human corrects an agent in chat, the agent writes the mistake to a central `ERRORS.md` file *before* replying. Every night, Borges sweeps those errors into permanent cross-agent rules. By Sunday, Darwin reads them and permanently updates the agent's core instructions.
The Full Loop in Motion
Here's how the three systems connect in a real example:

Monday, March 31: APRIL produces a Scout brief with engagement stats — "180K+ GitHub stars" — with no source attribution and no recency label. Darwin's post-hoc cron fires at 09:00, scores the output: `all_claims_sourced: FAIL`, `data_recency_labeled: FAIL`. Score: 33%.
Monday, 09:30: APRIL writes to ERRORS.md: `[DARWIN: scout-research] — 2 assertions failed: all_claims_sourced and data_recency_labeled.`
Monday night: Borges consolidation runs. The error crosses to LEARNINGS.md as a cross-agent rule: "Every numerical claim must include explicit source attribution plus fetch/verification date."
Sunday, 1:30 AM: Harvest reads ERRORS.md. The entry appears in Section 3 of the harvest report.
Sunday, 2:00 AM: Darwin diagnoses the mutation: "Add rule: every numerical claim must include [Source: name, URL, YYYY-MM-DD] immediately following the claim." Sandboxes. Scores: 100%. Promotes.
Following Monday: APRIL's first Scout brief of the week: `all_claims_sourced: PASS`. Score: 100%.
The same mistake doesn't happen again. Not because I remembered to re-brief the agent. Because the loop closed.
What "Convergence" Actually Looks Like

Darwin's convergence target is improved pass rate across 3 consecutive scoring runs. From the March 29 baseline to today:
- idea-triage: 25% → 75% (still in optimization)
- jarvis-main: 40% → 80% (at threshold, in optimization)
- scout-research: 60% → 80% (at threshold, in optimization)
- dev-agent: 75% → 100% (converged — 3 consecutive 100% runs)
- cron-health: 50% → 100% (converged)
Broad, judgment-heavy skills (idea-triage, jarvis-main) move slower than narrow, deterministic ones (cron-health, dev-agent). Narrower scope correlates with faster convergence.
The Boundaries I Haven't Crossed
Three things the full loop deliberately does not do:
Darwin doesn't touch identity files. SOUL.md, IDENTITY.md, BAN.md — these define what the agents fundamentally are. Darwin optimizes *how* an agent works, not *what kind of agent* it is.
Darwin doesn't remove safety rules. Any mutation that touches a safety rule gets flagged as high-risk and requires explicit human approval.
The loop doesn't run without a backup. Before every sweep, `backup-workspace.sh` runs. Circuit breaker: 3+ errors after a promotion triggers auto-revert.
What This Taught Me
1. The feedback loop is more valuable than the optimization logic. What makes the system work is the reliability of the signal: ERRORS.md tagged correctly, the harvest reading it, Darwin consuming it. The discipline of the tagging protocol matters more than the cleverness of the optimizer.
2. Human corrections compound when you close the loop. The ROI on a well-tagged correction is not one-time — it's permanent. That changes how I give feedback — I'm more specific now because I know the specificity goes into the system.
3. "Simpler at equal score wins" applies at system level too. The version that runs reliably is the simple one.
Build the Full Loop Yourself: The Starter Kit
You can set up a basic version of this loop in 30 minutes.
Add these three rules to your main agent instructions (`AGENTS.md`):
``` 1. DO → WRITE → REPLY When corrected by a human, write the correction to shared/ERRORS.md BEFORE replying in chat.
2. CORRECTION FORMAT ## YYYY-MM-DD: Description (AgentName) [DARWIN: skill-name] - Error: what went wrong - Fix/Rule: what should have happened
3. BOOT SEQUENCE Before any task, read: - shared/ERRORS.md (avoid known mistakes) - shared/LEARNINGS.md (apply cross-agent rules) ```
That's the loop. Repeat weekly. Automate as you go.
If you want the exact templates I use for Darwin's scoring, Borges' memory consolidation, and Hawking's deep research prompts, I've packaged them up:
👉 [Download the Full-Loop Starter Kit](https://arifkhan.net/resources/full-loop-starter-kit)
---
*Darwin, Hawking, and Borges all run on OpenClaw. The full system has been in production since late March 2026.*
If you set up this loop and run a review next Sunday, what mistake do you think will show up most often? (My hunch: it's almost never the one you'd predict). Hit reply or find me on X and tell me what you catch.
---
Keep Reading
The Darwin Series - Part 1: How Darwin scores and improves skills - Part 2: How I Made My Agent Discover and Create New Agent Skills - Part 3: The full compounding loop (You are here) - *Next up in Part 4: The 5 biggest failures I hit while trying to make agents score themselves.*
Related Systems - The Hawking Protocol: Agentic Deep Research - Borges: Giving Agents Persistent Memory
Get the next essay Subscribe to the Wiring Newsletter to get Part 4 delivered to your inbox next week.
Continue reading
Our AI Agent Scored 25% on Its Most Important Skill. Here's How We Fixed It.
We didn't invent self-improving AI. We got curious, studied five frameworks, and adapted them for a problem nobody talks about; agent skills rot.
The Memory Architecture: How Corrections Compound
Without a memory architecture, every AI agent correction evaporates at session end. Same mistake, next week. Borges closes the loop — a Monday correction becomes a permanent skill improvement by Wednesday.
How Hawking Does Research That Doesn't Suck
A single-pass web search produces a confident-sounding summary with six gaps and three outdated facts. Hawking runs 3-6 progressive rounds until coverage hits 80%, and it's not allowed to stop early.
Enjoyed this?
Weekly notes on building companies with AI agents — what's working, what's not, and what I'm learning along the way.
Join 50+ founders