The Case for Separating Thinking (GPU) and Compute (CPU)

2 hours ago 1

Current LLM interfaces hide the choices that determine both quality and cost. A single request triggers internal reasoning, external actions such as retrieval or small program runs, and some amount of evaluation and editing. Because all of this is fused into one opaque process, users cannot say where to invest effort. They cannot ask the system to search widely but draft briefly, to run a few experiments before writing, or to spend the bulk of the budget on checking and improving what was written. The opportunity is to expose a compact control surface that lets people express computational intent, see how it maps to resources, and get behavior that matches the importance of the task.

One way to design the system is to treat the user’s request as a policy and the runtime’s response as a plan. Policy is a short statement of objectives, limits, and preferences. For example: produce one best answer, favor exploration over polish, and cap both cost and latency. Plan is the concrete execution derived from that policy: how many candidates to try, which tools to call, how to divide budget between internal reasoning and external actions, how many evaluation cycles to run, and when to stop. The value of this separation is practical. People get a stable language for intent, while the system retains freedom to choose tactics that fit the moment.

A simple decomposition keeps the interface small and maps cleanly to hardware. Think is internal model work such as planning, drafting, and synthesis. It mostly consumes accelerator time, and the natural unit is model tokens, including hidden deliberation. Act is everything outside the model: retrieval, browsing, API calls, script execution, and small experiments. It mostly consumes CPU, I/O, and network, and the natural units are CPU seconds, request counts, egress, and memory. Read fits under Act, since it interacts with the world rather than synthesizing text. Evaluate is analysis of outputs: self‑critique, static checks, test runs, citation verification, reranking, and edits guided by those checks. Evaluate draws on both sides. Some checks use a judge model and therefore spend tokens. Others run tools or tests and therefore spend CPU. The mapping is an approximation rather than a rule, but it is accurate enough to make budgets legible and to separate GPU‑style spend from CPU‑style spend.

With those pieces in place, a request can carry only a few fields. The user sets a macro budget in dollars or credits and a soft deadline. The user states an output intent, such as one answer or a few diverse options. The user also provides a split preference across Think, Act, and Evaluate, either as a ratio or through presets such as Explore‑heavy, Evaluate‑heavy, or Balanced. From that policy the runtime produces a plan. It allocates a token budget to reasoning and to any judge‑model passes, a CPU and network budget to external actions, and a bounded number of evaluation cycles. It then executes in multiple turns rather than assuming a single final pass. Draft lightly to frame options, act to gather evidence or run a quick experiment, evaluate and edit, and repeat while the expected value of another cycle justifies the remaining spend. The output can be refined across turns and does not need to be final on the first attempt.

Two mechanics make this predictable without adding complexity. First, the runtime uses checkpoints because token totals and fetch sizes are uncertain in advance. Before each new phase it estimates remaining spend and either continues, rebalances, or stops if projections would breach the budget or the deadline. Second, the runtime returns clear telemetry. It reports tokens used for reasoning and for judging, counts of external requests, CPU seconds consumed, egress, latency, and the number of evaluation cycles. That record lets teams price workflows, compare presets, and see which levers moved quality.

This control surface is not just for steering one run. It is also an instrument for learning how much compute different problems warrant. Each request yields a small dataset: the task description or tag, the Think‑Act‑Evaluate allocation, the realized spend, and an outcome signal. Outcome can be explicit user feedback, agreement among independent runs, test pass rates, or correction volume on a second pass. From these records the system can fit simple spend‑versus‑quality curves by task family and by phase. Code synthesis often benefits when a larger share goes to evaluation and tests. Literature reviews often benefit when a larger share goes to action and breadth. The curves will be noisy, but even coarse patterns help.

Once those patterns exist, the planner can offer grounded defaults. Before a request runs, it can propose an allocation that usually meets a stated quality and cost target for similar tasks, together with an honest range. The user can accept or adjust the split. The plan remains accountable to the budget and deadline and still stops when the expected gain from an additional cycle is low. The learning loop need not be framed as aggressive or conservative by decree. The choice depends on the environment. If errors are costly, a plan that emphasizes evaluation and early stopping may dominate. If exploration is cheap and surprises are valuable, a plan that shifts more budget to breadth can be justified. The point is to make the trade‑offs explicit and to base defaults on observed outcomes rather than on guesswork.

This arrangement improves day‑to‑day decisions without expanding the interface. When reasoning and action share one throttle, forecasting cost and latency is guesswork. When the phases have distinct budgets, you can see whether dollars went to deep reasoning, to breadth of evidence and experiments, or to checking and revision. Over a few runs the compute shape of common jobs becomes clear. Research‑heavy synthesis tends to spend more on Act and Evaluate. Concept work often spends more on Think. Those patterns travel across teams and tasks, and they make budgeting less of a gamble and more of a choice.

Read Entire Article

The Case for Separating Thinking (GPU) and Compute (CPU)

Related

Video of Kristi Noem blaming Democrats for shutdown rolling ...

Run Express server in the browser (2016)

Scientists may have spotted 'wind' blowing from the Milky Wa...