AI agents require to-do lists to stay on track

3 hours ago 1

If you’ve worked with AI agents, you’ve probably hit this wall:

Your agent starts strong. It analyzes requirements, writes some code, maybe runs a few tests. Then... it loses the thread. Starts repeating itself. Forgets what it was building. Or worse—claims it’s done when half the work is still incomplete.

This isn’t a limitation of the underlying LLM. It’s an architecture problem.

After deploying AI agents that build full-stack applications at justcopy.ai, we discovered something counterintuitive: the solution to making AI agents reliable isn’t more intelligence—it’s more structure.

Specifically, task-driven architecture: treating AI agents like software engineers with mandatory todo lists.

Before diving into the solution, let’s examine why most AI agent implementations fail:

You prompt: “Build a user authentication system”

The agent:

  • Creates a login component

  • Starts working on password validation

  • Decides to refactor the entire project structure

  • Begins implementing OAuth

  • Goes back to fix the login component

  • Forgets to actually create the signup flow

Symptom: Circular behavior with no clear progress

You have multiple agents working in sequence:

Agent A (Requirements): “Here’s what we need to build...” Agent B (Implementation): reads requirements “I’ll start from scratch!”

Symptom: Each agent reinvents the wheel instead of building on previous work

Agent: “I’ve completed the authentication system!”

You: “But there’s no password reset functionality...”

Agent: “Oh, I thought that was optional.”

Symptom: Agents finish without completing all necessary work

If these scenarios sound familiar, you’re not alone. These are fundamental issues with how we structure agent workflows.

The breakthrough came from observing how successful engineering teams work.

Good engineers don’t operate from vague directives. They have:

  • A clear list of tasks

  • Acceptance criteria for each task

  • A definition of “done”

  • Validation that work is complete

Why should AI agents be any different?

Task-driven architecture applies this same structure:

1. Agent receives explicit todo list with validation criteria 2. Agent gets next incomplete task 3. Agent executes task 4. Agent marks task complete with evidence 5. System validates completion 6. Repeat until all tasks complete 7. Agent transitions to next phase

No ambiguity. No guessing. No premature exits.

Let’s break down the architecture:

Agents maintain a structured todo list:

{ todos: [ { id: “todo-1”, description: “Initialize project sandbox environment”, validation: “Sandbox returns 200 status code”, completed: false, validationResult: null }, { id: “todo-2”, description: “Install project dependencies”, validation: “node_modules directory exists”, completed: false, validationResult: null } ] }

Each todo includes:

  • Unique ID for tracking

  • Description of the task

  • Validation criteria (how to verify it’s done)

  • Completion status and timestamp

  • Validation results (evidence of completion)

Every agent follows this mandatory loop:

// Step 1: Initialize todos agent.initializeTodos([ { id: “todo-1”, description: “...”, validation: “...” }, { id: “todo-2”, description: “...”, validation: “...” }, ]); // Step 2: Execute loop while (true) { const nextTask = agent.getNextTodo(); if (nextTask === “ALL_COMPLETE”) { break; } // Execute the task const result = await agent.executeTask(nextTask); // Mark complete with evidence agent.markComplete(nextTask.id, result.evidence); } // Step 3: Complete phase agent.completePhase();

The loop is self-documenting: you can look at the todo list and know exactly where the agent is in its workflow.

Here’s the critical piece: agents cannot proceed until all todos are verified complete.

function canCompletePhase(todos) { const incomplete = todos.filter(t => !t.completed); if (incomplete.length > 0) { throw new Error( `Cannot complete phase - ${incomplete.length} todos incomplete:\n` + incomplete.map(t => `- ${t.description}`).join(’\n’) ); } return true; }

No exceptions. No shortcuts. If a single todo is incomplete, the agent cannot finish.

This simple gate prevents 80% of production failures.

Here’s how task-driven architecture works in practice for a project setup agent:

Agent: Setup Manager Todos:

  • Initialize cloud sandbox

  • Clone project template

  • Install dependencies (frontend + backend)

  • Start dev servers on ports 3000/3001

  • Verify both servers respond

Critical detail: This agent runs with temperature 0.0 (pure determinism). Infrastructure needs reliability, not creativity.

Each todo has specific validation:

  • “Sandbox returns 200 status code”

  • “node_modules directory exists with 500+ packages”

  • “curl localhost:3000 returns HTTP 200”

The agent cannot mark the phase complete until every validation passes.

Task-driven architecture succeeds because it aligns with how LLMs actually work:

LLMs have context windows, not infinite memory. Without structure, they lose track of what they’ve done.

A todo list is external memory—the agent can always check: “What have I completed? What’s next?”

When you ask “Did you complete the task?”, LLMs might hallucinate success.

When you ask “Show me evidence the task is complete”, you force concrete verification.

Example:

  • ❌ “Did you install dependencies?”

  • ✅ “Show me that node_modules exists with ‘ls node_modules | wc -l’”

Evidence-based validation prevents false confidence.

Humans struggle with vague instructions. So do LLMs.

“Build authentication” is vague.

“Build authentication with: signup endpoint, login endpoint, JWT generation, session middleware, and curl test showing successful login” is verifiable.

Specificity prevents ambiguity.

Instead of one giant todo list, break work into milestones:

## Milestone 1: User Authentication ⏳ **Goal:** Login and signup functionality **Testing:** curl commands to verify endpoints work - [ ] Create POST /api/auth/signup endpoint - [ ] Create POST /api/auth/login endpoint - [ ] Add JWT generation middleware - [ ] Test signup with curl - [ ] Test login with curl ## Milestone 2: User Dashboard 🚧 **Goal:** Main interface after login **Testing:** Visit localhost:3000/dashboard in browser - [x] Create Dashboard component - [ ] Add navigation header - [ ] Fetch user data from API - [ ] Display user profile

Status indicators:

  • ⏳ Pending (not started)

  • 🚧 In Progress (currently working)

  • ✅ Complete (done and verified)

Users can watch progress in real-time. Agents can reference the plan to stay on track.

Bad approach:

Todo: Create Next.js app from scratch

Good approach:

Todos: 1. Copy battle-tested Next.js template from storage 2. Customize template with project-specific configs 3. Verify template builds successfully 4. Add project-specific features

Why? Templates include:

  • Optimized build configurations

  • Security best practices

  • Dependency compatibility

  • Testing infrastructure

Starting from templates prevents 80% of common setup issues.

After deploying task-driven agents in production, here’s what we learned:

Match temperature to task type:

  • Infrastructure/Setup: 0.0 (maximum determinism)

  • Requirements/Research: 0.4 (balanced)

  • Creative work (UI): 0.5 (some creativity)

Lower temperature = more reliable infrastructure Higher temperature = more creative solutions

Don’t use the same temperature for all agents.

Every todo needs concrete validation criteria:

❌ Bad: “Create login page” ✅ Good: “Create login page with username/password fields, submit button, and fetch to /api/auth/login that returns JWT in console.log”

Specificity prevents “it’s done” when it’s not actually done.

Track these metrics:

  • Agent completion rates by type

  • Average todos per successful phase

  • Token usage per agent

  • Error frequencies

Data reveals where agents struggle. Optimize based on evidence, not intuition.

“Build authentication” is too vague.

Break it down:

  • Create signup endpoint

  • Create login endpoint

  • Add password hashing

  • Generate JWT tokens

  • Add session middleware

  • Test with curl

Granular todos = clear progress.

Don’t just trust agents to mark tasks complete.

Require evidence:

  • Screenshot showing UI works

  • curl output showing API responds

  • File exists check

  • Test passes

Evidence prevents hallucinated completion.

One agent that does everything = unmaintainable.

Instead: Specialized agents with clear boundaries

  • Requirements Analyst (research only)

  • UX Designer (flows only)

  • Frontend Engineer (UI only)

  • Backend Engineer (API only)

Separation of concerns applies to agents too.

If you’re building AI agents for production:

1. Structure > Intelligence A mediocre LLM with good task structure beats GPT-5 with vague instructions.

2. Validation is Mandatory Every task needs verifiable completion criteria. “Show me evidence” prevents hallucination.

3. Specialized Agents One agent per phase. Clear handoffs. No overlap.

4. Monitor Everything Track completion rates, token usage, costs. Optimize based on data.

5. Temperature by Task Type Infrastructure: 0.0. Research: 0.4. Creative: 0.5.

Building production AI agents isn’t about the fanciest LLM or the most tokens.

It’s about architecture.

Task-driven design gives you:

  • ✅ Reliability (validation gates prevent incomplete work)

  • ✅ Debuggability (todo audit trail shows exactly where failures occur)

  • ✅ User trust (transparent progress builds confidence)

  • ✅ Scalability (clear structure enables agent coordination)

The pattern is simple: Todo list → Sequential execution → Validation → Transition.

The impact is profound: AI agents that actually work in production.

You can see task-driven architecture in action at justcopy.ai, where AI agents build full-stack applications using this exact pattern.

All insights shared here are from real production experience running multi-agent systems at scale.

Discussion about this post

Read Entire Article