Build noteApr 7, 20268 min read

Darwin Part 3: How Part 1's Scoring + Part 2's Discovery Became a Compounding Loop

If you're new, this recaps Parts 1 and 2. If you're continuing, this is the payoff: how scoring + discovery connect with memory and research so corrections become permanent upgrades.

darwin hawking ai-agents self-improvement openclaw autoresearch system-design

ShareX / Twitter LinkedIn

Darwin Part 3: How Part 1's Scoring + Part 2's Discovery Became a Compounding Loop

*If you've read Parts 1 and 2, this is the payoff. If you're new, quick recap: Part 1 scored broken skills, Part 2 discovered missing skills. This part shows how those systems connect with Hawking and Borges so one correction becomes a permanent upgrade instead of another reminder.*

*Part 3 of the Darwin Series*

<div class="series-nav-top"> <strong>The Darwin Series</strong><br> <a href="/blog/ai-agent-scored-25-percent-how-we-fixed-it">Part 1: How Darwin scores and improves skills</a><br> <a href="/blog/darwin-discovers-creates-new-agent-skills">Part 2: How I Made My Agent Discover and Create New Agent Skills</a><br> <strong>Part 3: The full compounding loop (You are here)</strong> </div>

---

Last Tuesday, my research agent made the exact same sourcing mistake for the 14th time. I had corrected it 14 times in chat. And 14 times, the session ended, the context window cleared, and the correction evaporated.

The Evaporating Correction: 14 times the same mistake

It's the most frustrating part of working with AI: every session starts fresh, which means every mistake is new again. You don't have a compounding assistant; you have a brilliant intern with amnesia.

If you read the earlier parts of this series, you might think my solution was just about writing better prompts. It's not. Stopping the amnesia required building a larger architecture—a three-part system that connects memory, research, and optimization.

The honest version: I didn't design this top-down. I got tired of repeating myself, watched how human corrections kept evaporating at session end, and started connecting the pieces I already had. The architecture emerged from the pain, not from a master plan.

The Three Systems

The full loop runs on three separate engines doing three different jobs:

The Full Self-Improvement Architecture — Darwin + Hawking + Borges

Darwin — the optimization engine. It scores existing skills and writes updates when things break. Hawking — the deep research engine. It handles complex searches across multiple sources. Borges — the memory engine. It catches your chat corrections and turns them into permanent rules.

Each does a different job. All three connect. The connection is where the compounding happens.

Darwin: The Optimization Engine

Darwin scores, diagnoses, and updates. Fifteen skill targets, 64 binary assertions (pass/fail checks), sweeping every Wednesday and Sunday. When the weekly review identifies a new gap, Darwin builds a new capability, tests it in a safe sandbox, and deploys it.

What Darwin does not do: it doesn't improve itself. That's a boundary I haven't crossed. Darwin's own optimization logic, its judgment about what constitutes a good mutation, the decision rules about when to promote vs. revert — those are human decisions, reviewed periodically. The recursive loop has a ceiling by design. (Allowing Darwin to optimize Darwin is a fascinating idea that I'm going to think about for a long time before I do anything about it.)

Hawking: The Deep Research Engine

When an agent gets a question that needs depth—market analysis, competitive intel, regulatory checks—it hands off to Hawking instead of doing a simple web search.

Hawking breaks the question down into 3-5 sub-questions, searches multiple times for each, and synthesizes the findings. It specifically looks for missing information, scoring its own coverage, and searching again if it's not thorough enough.

Every brief must also include at least one substantive counter-argument to its main point, forcing the agent out of "yes-man" mode.

Borges: The Memory Engine

Every AI agent session starts fresh. Corrections made on Monday evaporate by Tuesday. Borges fixes this by closing the loop.

The protocol: when a human corrects an agent in chat, the agent writes the mistake to a central `ERRORS.md` file *before* replying. Every night, Borges sweeps those errors into permanent cross-agent rules. By Sunday, Darwin reads them and permanently updates the agent's core instructions.

The Full Loop in Motion

Here's how the three systems connect in a real example:

Correction Trace: From Error to Permanent Fix

Monday, March 31: APRIL produces a Scout brief with engagement stats — "180K+ GitHub stars" — with no source attribution and no recency label. Darwin's post-hoc cron fires at 09:00, scores the output: `all_claims_sourced: FAIL`, `data_recency_labeled: FAIL`. Score: 33%.

Monday, 09:30: APRIL writes to ERRORS.md: `[DARWIN: scout-research] — 2 assertions failed: all_claims_sourced and data_recency_labeled.`

Monday night: Borges consolidation runs. The error crosses to LEARNINGS.md as a cross-agent rule: "Every numerical claim must include explicit source attribution plus fetch/verification date."

Sunday, 1:30 AM: Harvest reads ERRORS.md. The entry appears in Section 3 of the harvest report.

Sunday, 2:00 AM: Darwin diagnoses the mutation: "Add rule: every numerical claim must include [Source: name, URL, YYYY-MM-DD] immediately following the claim." Sandboxes. Scores: 100%. Promotes.

Following Monday: APRIL's first Scout brief of the week: `all_claims_sourced: PASS`. Score: 100%.

The same mistake doesn't happen again. Not because I remembered to re-brief the agent. Because the loop closed.

What "Convergence" Actually Looks Like

Skill Convergence: Before → After Darwin

Darwin's convergence target is improved pass rate across 3 consecutive scoring runs. From the March 29 baseline to today:

idea-triage: 25% → 75% (still in optimization)
jarvis-main: 40% → 80% (at threshold, in optimization)
scout-research: 60% → 80% (at threshold, in optimization)
dev-agent: 75% → 100% (converged — 3 consecutive 100% runs)
cron-health: 50% → 100% (converged)

Broad, judgment-heavy skills (idea-triage, jarvis-main) move slower than narrow, deterministic ones (cron-health, dev-agent). Narrower scope correlates with faster convergence.

The Boundaries I Haven't Crossed

Three things the full loop deliberately does not do:

Darwin doesn't touch identity files. SOUL.md, IDENTITY.md, BAN.md — these define what the agents fundamentally are. Darwin optimizes *how* an agent works, not *what kind of agent* it is.

Darwin doesn't remove safety rules. Any mutation that touches a safety rule gets flagged as high-risk and requires explicit human approval.

The loop doesn't run without a backup. Before every sweep, `backup-workspace.sh` runs. Circuit breaker: 3+ errors after a promotion triggers auto-revert.

What This Taught Me

1. The feedback loop is more valuable than the optimization logic. What makes the system work is the reliability of the signal: ERRORS.md tagged correctly, the harvest reading it, Darwin consuming it. The discipline of the tagging protocol matters more than the cleverness of the optimizer.

2. Human corrections compound when you close the loop. The ROI on a well-tagged correction is not one-time — it's permanent. That changes how I give feedback — I'm more specific now because I know the specificity goes into the system.

3. "Simpler at equal score wins" applies at system level too. The version that runs reliably is the simple one.

Build the Full Loop Yourself: The Starter Kit

You can set up a basic version of this loop in 30 minutes.

Add these three rules to your main agent instructions (`AGENTS.md`):

``` 1. DO → WRITE → REPLY When corrected by a human, write the correction to shared/ERRORS.md BEFORE replying in chat.

2. CORRECTION FORMAT ## YYYY-MM-DD: Description (AgentName) [DARWIN: skill-name] - Error: what went wrong - Fix/Rule: what should have happened

3. BOOT SEQUENCE Before any task, read: - shared/ERRORS.md (avoid known mistakes) - shared/LEARNINGS.md (apply cross-agent rules) ```

That's the loop. Repeat weekly. Automate as you go.

If you want the exact templates I use for Darwin's scoring, Borges' memory consolidation, and Hawking's deep research prompts, I've packaged them up:

👉 [Download the Full-Loop Starter Kit](https://arifkhan.net/resources/full-loop-starter-kit)

---

*Darwin, Hawking, and Borges all run on OpenClaw. The full system has been in production since late March 2026.*

If you set up this loop and run a review next Sunday, what mistake do you think will show up most often? (My hunch: it's almost never the one you'd predict). Hit reply or find me on X and tell me what you catch.

---

Keep Reading

The Darwin Series - Part 1: How Darwin scores and improves skills - Part 2: How I Made My Agent Discover and Create New Agent Skills - Part 3: The full compounding loop (You are here) - *Next up in Part 4: The 5 biggest failures I hit while trying to make agents score themselves.*

Get the next essay Subscribe to the Wiring Newsletter to get Part 4 delivered to your inbox next week.

← Back to blog

Continue reading

Build noteApr 4, 202620 min read

Darwin Part 3: How Part 1's Scoring + Part 2's Discovery Became a Compounding Loop

Darwin Part 3: How Part 1's Scoring + Part 2's Discovery Became a Compounding Loop

The Three Systems

Darwin: The Optimization Engine

Hawking: The Deep Research Engine

Borges: The Memory Engine

The Full Loop in Motion

What "Convergence" Actually Looks Like

The Boundaries I Haven't Crossed

What This Taught Me

Build the Full Loop Yourself: The Starter Kit

Keep Reading

Continue reading

Our AI Agent Scored 25% on Its Most Important Skill. Here's How We Fixed It.

The Memory Architecture: How Corrections Compound

How Hawking Does Research That Doesn't Suck