Build noteApr 6, 20267 min read

How Hawking Does Research That Doesn't Suck

A single-pass web search produces a confident-sounding summary with six gaps and three outdated facts. Hawking runs 3-6 progressive rounds until coverage hits 80%, and it's not allowed to stop early.

hawking research ai-agents autoresearch openclaw deep-research

ShareX / Twitter LinkedIn

How Hawking Does Research That Doesn't Suck

*A single-pass web search produces a confident-sounding summary with six gaps and three outdated facts. Hawking runs 3-6 progressive rounds until coverage hits 80%, and it's not allowed to stop early.*

---

What Single-Pass Research Actually Produces

My research agent Scout runs twice daily — 07:30 and 13:30 IST. Its job: surface what matters across AI news, market data, and competitive intelligence. Feed that signal to APRIL for content. Feed it to me for judgment.

For months, Scout's research was fine. Readable. Reasonably sourced. Delivered on time.

Then Darwin scored it. (This is becoming a theme in this series. "It seemed fine" turns out to mean "I had no instrument to tell otherwise.")

improved pass rate on the research quality checklist. The failing assertions:

Claims without source attribution (presenting data without saying where it came from)
Data recency not labeled (presenting 30-day-old stats as current market data)
Not all keyword groups covered (systematic gaps in coverage — some categories always got skipped)
No evaluation of competing viewpoints (confirming whatever the first credible source said)

60% means it was failing on 2 of 5 quality dimensions, reliably, every single research cycle. Not occasionally. Every time.

The root cause was obvious once I saw it: single-pass research produces single-pass quality. You search, you find the first credible sources, you synthesize. You don't know what you missed.

So I built Hawking. And I borrowed Karpathy's AutoResearch loop again — the same mutate-score-iterate pattern that Darwin applies to skill optimization, applied instead to research quality.

The Decomposition Step: Where Most Research Goes Wrong

Most research agents skip this. You give them a question, they search the question. That produces answers to one version of the question, not the question itself. (And the version they answer is usually the most optimistic one, because that's what the most available sources confirm.)

Hawking's first move on any research request is decomposition. Break the question into 3-5 sub-questions across different domains:

Factual: what are the verified facts?
Current state: what is happening right now?
Forward-looking: what are the trends and projections?
Contrarian: what's the strongest argument against the main thesis? (mandatory)
Domain-specific: what specialized knowledge applies?

The Loop: 3-6 Rounds of Progressive Deepening

Hawking Research Loop — 3-6 progressive rounds until coverage hits 80%

Once decomposition is done, Hawking runs its research loop. Each round:

Search: 2-3 web searches per uncovered sub-question with varied phrasing. Not the same query three ways — different angles. (If web_search fails, it falls back to DuckDuckGo, then Grok xAI API. Never stalls.)

Synthesize: Build a working draft. Mark every gap explicitly: `[GAP: DTCM Q1 2026 regulatory update — not found]`. Mark every unverified claim: `[model knowledge — verify]`. Every factual claim gets a source tag.

Score coverage: ``` coverage = (sub-questions with ≥1 cited source) / (total sub-questions) ``` A sub-question only counts as covered if it has a specific, dated source from this session.

Iterate if needed: Coverage < 80% and fewer than 6 rounds → refine the queries for uncovered sub-questions and run again. Coverage ≥ 80% after at least 3 rounds → compose.

Progressive Summarization: Why Round 6 Doesn't Explode Context

Progressive Summarization Across Research Rounds — 12,000 words becomes ~2,500

Six rounds of research at 2000 words each would be 12,000 words of context. That breaks most LLM workflows. Hawking solves this with progressive summarization:

``` Round 1: Full findings (~2000 words) Round 2: [Round 1 compressed to 500 words] + Round 2 full findings Round 3: [Rounds 1-2 compressed to 500 words] + Round 3 full findings ... ```

Only the current round stays in full detail. Prior rounds become a 500-word summary preserving key findings, source citations, and explicit gaps. Six rounds of research stays at ~2500 words context, not 12,000.

Every round's state is persisted to JSON. If Hawking crashes at Round 4, it resumes from Round 4 state — no restart.

The Contrarian Obligation

This one's non-negotiable, from Nyk's Council.

Every Hawking brief must include at least one counter-argument to its main thesis. Not as a disclaimer. As a substantive finding.

Why force this? Because research agents have the same confirmation bias problem as human researchers — they find what they're looking for. (I named this the contrarian obligation and it's probably the highest-leverage single addition to Hawking.)

What's not allowed: no counter-arguments section at all, or a token "risks include general market uncertainty" that says nothing.

Real Coverage Dashboard

Hawking Coverage Scoring Dashboard — Dubai STR query after Round 2

That's a real research state from a Dubai STR market query after Round 2. Coverage: 60%. Two sub-questions covered, three pending. The dashboard shows exactly what's covered, what's pending, and what Round 3 needs to target.

How Hawking Improves Its Own Research

Hawking's research skill gets Darwin-scored like every other skill. The current checklist:

```json [ {"id": "all_claims_sourced", "check": "Does every factual claim include a source with URL and date?"}, {"id": "data_recency_labeled", "check": "Are all data points older than 7 days marked with recency disclaimer?"}, {"id": "contrarian_present", "check": "Does the brief include at least one substantive counter-argument?"}, {"id": "gaps_explicit", "check": "Are all unanswered sub-questions marked as explicit gaps?"}, {"id": "coverage_at_threshold", "check": "Does the brief achieve >= 80% coverage score?"} ] ```

Scout's research skill progression under Darwin:

Week 1 (pre-Hawking): 60% → failures on sourcing, recency, coverage
Week 2 (Hawking deployed): 67% → coverage improved, sourcing still failing
Week 3 (multi-round + sourcing rule): 80% → at threshold
Week 4 (contrarian added): 83%
Current: 80-83%, targeting 95%

The biggest single improvement: requiring 3 rounds minimum before composing. Before that, Hawking was hitting 80% coverage after 1 round on easy questions and stopping. The "3 rounds minimum" rule stabilized quality across both easy and hard research.

Build This for Your Own Research Workflow

Two principles that improve quality most:

1. Decompose before you search. Take any research question and write 4 sub-questions: factual, current state, forward-looking, and contrarian. Search each separately. Your coverage will double.

2. Score your coverage, don't just read your output. After your research, ask: which of my 4 sub-questions actually got answered with a source from this session? That number is your coverage score. If it's below 80%, run another round targeting the uncovered ones.

3. Mark what you don't know. `[GAP: X — not found]` and `[model knowledge — verify]` are the most useful annotations you can add. They prevent "I couldn't find this" from quietly becoming "this probably doesn't apply."

---

*Hawking runs as Scout's deep research module on OpenClaw. Weekly cron: Monday 05:00 IST for portfolio and market sweeps. On-demand: triggered by any agent that needs depth over breadth.*

*One question I'm still figuring out: how do you calibrate "good enough" for research quality without making everything a 6-round deep dive? If you've solved this, I'd genuinely like to know your approach.*

← Back to blog

Continue reading

Build noteApr 7, 20268 min read