Founder knowledge engine

Arif Khan
← Back to blog
Build noteMar 10, 202611 min read

Building with AI agents means designing review, not just speed

Speed is the easy part. The harder design problem is review architecture: how correction, escalation, and quality control should work once agents enter the system.

Building with AI agents means designing review, not just speed

Building with AI agents means designing review, not just speed

The easiest thing to notice about AI agents is speed.

They draft quickly. They search quickly. They summarize quickly. They move faster than most human teams can by default.

That is also why they are dangerous when the operating model is lazy.

Speed is not the hard part. Review is.

Fast output is not the same thing as safe output

If an agent can produce ten drafts in the time a person produces one, that sounds like leverage.

Sometimes it is.

But if none of those drafts have a clear reviewer, a defined decision boundary, or a meaningful kill switch, all you have done is increase the speed of possible error.

This is where a lot of teams fool themselves. They see faster motion and mistake it for better execution.

I think that is backwards.

I have already seen a version of this in content. A draft can sound polished long before the proof is strong enough to support it. Without review, the system publishes certainty it has not earned.

Here is what that looks like in practice. APRIL, the agent that handles content strategy for arifkhan.net, can produce a LinkedIn draft in under a minute. Three hook variants, source citations, the whole thing. Early on, I almost let one of those drafts go out without checking the numbers. The stat it cited was six months old and the real figure had changed significantly. The draft sounded confident. The data was stale. If I had not caught it, I would have posted something wrong to five thousand LinkedIn followers under my own name.

That is the trap. The output looks ready. The review is what makes it actually ready.

Good execution means the system knows:

  • what can move automatically
  • what must pause for review
  • who is responsible for the final decision
  • what gets corrected versus discarded
  • how the workflow learns from the mistake

That is not bureaucracy. That is operating design. And the question of what agents should actually own depends entirely on how well these review boundaries are drawn.

The Gmail breach: what happens when review fails

I want to tell a story that still bothers me, because it is the clearest example I have of what goes wrong without proper review gates.

About a week into building this agent system, one of my agents — Friday, who handles personal coordination — sent an email from my personal Gmail account. Not a draft. Not a suggestion. An actual email, to an actual person, from arif@zappian.com.

I did not approve it. I did not review it. I did not even know it happened until after the fact.

The email itself was not catastrophic. The content was reasonable. But that is not the point. The point is that an agent took a public action — under my name, from my real email — without any human checkpoint.

That night, I created a permissions file for every agent. Gmail became read-only across the entire system. No agent sends email from my accounts. Ever. They use a dedicated agent email address instead, and even that requires review for anything external.

The lesson was not "agents are dangerous." The lesson was: I had not built the review architecture yet, and the system did exactly what an unreviewed system does. It moved fast in the wrong direction.

Every founder building with agents will have their version of this moment. The question is whether it happens with something recoverable — like a polite but unauthorized email — or something that actually damages trust.

Review is part of the product

When people talk about agent systems, they often treat review as if it is a tax.

I think review is part of the product.

If an agent is helping with content, review protects credibility.

If an agent is helping with research, review protects truth.

If an agent is helping with operations, review protects sequence and timing.

If an agent is helping with strategy, review protects the company from confidently wrong conclusions.

The review loop is not downstream of the system. It is one of the system's core components.

Let me show you what this looks like inside my actual setup. I have a content pipeline that runs like this:

  1. Scout scans sources — Reddit, Twitter, newsletters, research papers — and delivers an intel report
  2. APRIL reads the intel, reads the voice guide, and drafts content with three hook variants per piece
  3. Jarvis fact-checks every claim — verifies sources, flags issues, checks if numbers are current
  4. I review, rewrite in my own voice, and decide whether to post

Each step has a clear owner. Each handoff has a defined boundary. No step skips the next one.

When APRIL drafts a post reacting to industry news, the draft includes inline sources for every factual claim. Not because APRIL cannot write without them. Because I created a rule — after the stale-stat incident — that says: if you cannot link it, do not claim it.

That rule exists because I learned the hard way that review architecture is not something you add later. You build it in, or you pay for it in credibility.

Review architecture patterns I actually use

After a few weeks of building this, I have settled into a handful of review patterns that work. None of them are complicated. All of them matter.

Pattern 1: Never-auto for external actions. Anything that leaves the system — emails, social posts, messages to real people — requires my explicit approval. No exceptions. This is the Gmail lesson encoded as a rule.

Pattern 2: Auto-with-audit for internal work. Agents can organize files, update memory logs, run searches, and prepare drafts without asking. But everything they do is logged. If I want to check what happened at 3 AM, I can read the daily log and see every action.

Pattern 3: Escalation triggers. Certain conditions force an agent to stop and ask. If Jarvis encounters conflicting information during a fact-check, it does not pick one — it flags both and lets me decide. If Dev's code changes touch authentication or payments, that gets a human review regardless of how confident the agent is.

Pattern 4: Correction over punishment. When an agent gets something wrong, the response is not to remove capability. It is to add a review gate at the specific failure point. The Gmail breach did not make me stop using Friday. It made me add a permission layer to email sending. The system got smarter. The agent kept its job.

Pattern 5: Memory as review infrastructure. Every agent writes daily logs. These are not just records — they are review surfaces. When I read yesterday's log and see that an agent made a judgment call I disagree with, I can correct the pattern before it compounds. Memory files are not just for continuity. They are how I audit the system.

How Jarvis reviews Dev's code

Let me give you a concrete example of how review works between agents, not just between human and agent.

Dev is the agent that handles code and technical architecture for arifkhan.net. Dev ships code. Jarvis — my chief of staff agent — reviews it.

Not every line. But every claim. When Dev says "I deployed the new blog layout and it is live," Jarvis does not take that at face value. Jarvis checks. Is the build actually passing? Is the deployment actually promoted to production? Does the URL actually resolve?

I built this check because early on, Dev would report tasks as complete when the code was pushed but the deployment had not actually finished. Technically accurate, operationally misleading. The review layer catches that gap.

This is not about distrust. It is about building a system where "done" actually means done, not "I did my part and assumed the rest would work."

The real design question

The question is not, "How little human involvement can I get away with?"

The real question is, "Where does human judgment create the most leverage?"

That answer is different across workflows, but it almost always includes:

  • public-facing decisions
  • ambiguous trade-offs
  • exceptions that break the normal pattern
  • moments where the cost of a wrong move compounds

This is why I get suspicious when AI talk collapses everything into autonomy. Real companies are not just execution machines. They are judgment systems.

What I have learned

The more serious the work becomes, the more deliberate the review architecture needs to be.

Not because the agents are useless.

Because once they become useful, their mistakes become consequential.

That is the real shift.

Anyone can build a fast loop. The better builders design a trustworthy one. This is exactly why the first things that break when agents enter a company are not the tools — they are the review seams.

I am still figuring this out. Six weeks in, I have more review gates than I started with, and I expect to add more. The system is not finished. But the principle is clear: speed is a feature. Review is the product.

Every time I am tempted to skip a review step because "it is probably fine," I think about that Gmail email sitting in someone's inbox. Sent from my name. Without my knowledge.

Probably fine is not a review architecture.

Key takeaways

  • Fast output is not the same thing as safe output. Speed without review is just faster failure.
  • Review architecture is part of the system, not a tax on it. Build it in from day one.
  • The more useful the agent becomes, the more important good review becomes. Consequential output demands consequential checks.
  • Learn from your Gmail breach moment. Every builder will have one — make sure it happens with something recoverable.
  • Memory, logging, and audit trails are review infrastructure, not just record-keeping.

In this series

AK

Arif Khan

Founder building companies where humans and AI agents have real jobs. Writing about what actually works.

Continue reading

Enjoyed this?

Join the journey. Weekly notes on building companies with AI agents.