Darwin Starter Kit - Checklist Template for Self-Improving AI Agents | Arif Khan

Checklist.json Template

Binary checklists, not rubrics. Every assertion must be YES or NO. Two people (or two LLMs) should agree without ambiguity. 3-6 assertions per skill, each traced to a real failure.

{
  "skill_name": "your-skill-name",
  "assertions": [
    {
      "id": "assertion_1",
      "check": "Does the output do X? (YES/NO)"
    },
    {
      "id": "assertion_2",
      "check": "Does the output avoid Y? (YES/NO)"
    },
    {
      "id": "assertion_3",
      "check": "Does the output include Z? (YES/NO)"
    }
  ]
}

FDE-Mode Proposal Format

Every new skill proposal must include evidence from specific sessions. "Seems useful" gets rejected. "These 3 sessions broke because this didn't exist" gets approved.

## PROPOSAL: [skill-name]
**Agent:** [which agent]
**Trigger:** [when this runs]

**Evidence from this week:**
- Session [date]: [what went wrong]
- Session [date]: [what went wrong]
- Session [date]: [what went wrong]

**Proposed checklist (3-6 assertions):**
1. Does the output do X? (YES/NO)
2. Does the output avoid Y? (YES/NO)
3. Does the output include Z? (YES/NO)

**Value:** [measurable impact in time or quality]

**Status:** AWAITING APPROVAL

5-Step Manual Harvest

You don't need automation to run your first harvest. This is the manual version. The automated harvest is this same process running across all sessions in 20 minutes instead of 3 hours.

Spend 30 minutes reading last week's agent outputs

Don't look for obvious failures. Look for patterns of absence - things the agent consistently doesn't do that you'd want it to.

Check error logs for recurring tags

What category do most corrections fall into? Formatting? Sourcing? Verification? That category is your first proposal.

Write one evidence-based proposal

Name, trigger, evidence from specific sessions, 3 assertions from real failures, value in time or quality.

Decide: new skill or tighter rule?

New skills have overhead. If the gap is one rule missing from an existing skill, add it there instead.

Build the minimum version and test

Does it catch what it was supposed to catch on last week's outputs? If yes, you've run a harvest cycle.

Read the full series

The Darwin Series covers the complete architecture: scoring existing skills, discovering new ones, and closing the self-improvement loop.

Read the Darwin Series