What Problem Is RAG Solving?

2 weeks ago 2

Information retrieval is a constrained decision problem. Given a corpus $C$, a question $q$, and a budget $B$ (latency, money, context window), the system's job is to select an evidence set $E \subset C$ that is sufficient for the user's decision while minimizing cost and error. Classical IR formalized this as ranking by relevance; practical systems add a second objective: coverage without redundancy so that each additional item in $E$ contributes new information. Everything else—chunking, embeddings, reranking—are mechanisms for solving that selection problem.

A first‑principles factorization of the problem space follows from that definition. The key variables are:

Footprint: how small a set of passages could suffice for a correct answer (one span vs. many pages).
Dispersion: whether those passages live together or are scattered across documents.
Signal alignment: how well surface signals match the question (lexical identifiers, numerics) versus how much paraphrase and synonymy must be bridged.
Authority and time: whether correctness depends on provenance, "valid‑as‑of" dates, or version lineage—not merely on semantic closeness.
Structure: whether truth resides in tables, parameter blocks, code signatures, or section headers rather than undifferentiated prose.
Redundancy: the degree of near‑duplication; the marginal gain from each additional item in $E$.
Dynamics: how quickly the underlying facts change relative to indexing and inference.
Economics: latency and per‑query cost limits, plus the repeat rate of questions (which determines the value of caching).
Explanation demand: whether the user needs a justified conclusion (with reasoning and citations) or merely a pointer/value.

With the problem space laid out, traditional retrieval‑augmented generation can be defined precisely. The pipeline chunks documents, represents those chunks in a semantic space, retrieves the nearest neighbors to the query (optionally after a light rerank), and has a language model compose a brief answer grounded in those passages. Its implicit commitments are now clear in terms of the variables above: a small footprint and low dispersion (answers live in a few local spans), acceptable signal alignment via semantics (paraphrase is the main hurdle), weak demands on authority/time (sources are roughly interchangeable, recency is not decisive), modest structure needs (text dominates tables), manageable redundancy (diversification can suppress clones), stable enough dynamics to keep an index fresh, and economics that tolerate a short generation pass. Under these conditions, traditional RAG is an efficient solver of the evidence‑selection problem.

The remaining question is when to pay for the model's final step. A language model in RAG is best understood as an expensive summarizer with light reasoning. Its value is not in formatting; it lies in compressing distributed, inconsistent, or conditional evidence into a defensible conclusion. Formally, the generation step is justified when

$$\Delta = \text{Utility}(\text{LLM}(E)) - \text{Utility}(\text{extractive } E) - \text{Cost}(\text{LLM})$$

is positive—i.e., when the improvement in decision quality or user effort exceeds the token and latency cost. In practice, three circumstances consistently make $\Delta > 0$:

Semantic integration: relevant passages describe the same fact with incompatible language, and the user needs one coherent statement with citations.
Small conditional joins: the decision depends on conditions distributed across a few passages (version ranges, exceptions, feature flags).
Justified conclusions: the product is an explanation the user can act on, not just a pointer—"affected/not affected, because…", with the lines that support it.

When none of these hold—single‑span answers, table cells, live numeric values—generation is usually negative value. Extraction, not prose, is the right end state.

A short example makes the point concrete without lapsing into method‑shopping. A security engineer asks whether a current build is affected by a newly publicized OpenSSL vulnerability. In the factorization above: the footprint is small; dispersion is low (advisory, distro bulletin, internal changelog); signal alignment requires both identifiers and paraphrase; authority/time are mild (official sources, recent documents); structure is light (version strings, not complex tables); redundancy exists (near‑duplicate advisory quotes) but is manageable; dynamics are moderate; and explanation demand is real (someone must sign off on a decision). Retrieval selects exactly those passages and diversifies out duplicates. Generation then normalizes version notations, evaluates a simple inequality with a backport caveat, and returns a short, cited judgment. Attempting to replace that with deterministic parsers across vendors tends to be costlier and more brittle; returning raw snippets simply pushes the reasoning back onto a human.

This same first‑principles lens clarifies when traditional RAG is the wrong tool even if retrieval is still required. If authority/time dominate (compliance, policy), evidence selection hinges on provenance and effective dates rather than pure semantics; the retrieval objective is to return trustworthy and current items, not merely similar ones. If structure carries truth (tables, code), the evidence units are rows, cells, and signatures; arithmetic and joins should be deterministic while the model narrates. If dispersion is high or footprint is large (narratives, cross‑file code flows), chunk‑level snippets lose coherence; either permit long‑context reading or perform slow “read‑to‑extract” before any summary. If dynamics are high (live prices, operational status), the relevant evidence set isn’t in the index at all; the sufficient set is the API response.

A compact decision frame falls naturally out of the variables, without naming tools.

What answer is actually owed? Pointer, local join with a condition, table value, time‑aware comparison, or policy judgment with caveats.
What is the smallest sufficient footprint and how dispersed is it? If it’s a handful of spans, selection by similarity with deduplication is plausible; if breadth is inherent, read more or shift to extraction before synthesis.
Which signals determine correctness here—authority/time, structure, identifiers, paraphrase? Ensure evidence selection exposes those signals; do not ask a generator to fix a ranking objective it never saw.
Is Δ positive? If the model is merely rephrasing, remove it. If it is integrating, checking a condition, or justifying a conclusion, pay the generation cost.
Do the economics and dynamics support this loop? Cache repeated answers; call live sources when truth moves faster than your index.

Described this way, the “IR landscape” is a space of problems parameterized by evidence sufficiency, dispersion, signal mix, trust and time, structure, redundancy, dynamics, and economics. Traditional RAG occupies the region where sufficiency is small, dispersion is low, semantics dominate, trust/time are secondary, and a brief, reasoned explanation is the product. Outside that region the problem changes. The system should change with it—not because another tool is fashionable, but because the evidence‑selection objective and the value of generation have shifted.

This same first‑principles lens clarifies when traditional RAG is the wrong tool even if retrieval is still required. If authority/time dominate (compliance, policy), evidence selection hinges on provenance and effective dates rather than pure semantics; the retrieval objective is to return trustworthy and current items, not merely similar ones. If structure carries truth (tables, code), the evidence units are rows, cells, and signatures; arithmetic and joins should be deterministic while the model narrates. If dispersion is high or footprint is large (narratives, cross‑file code flows), chunk‑level snippets lose coherence; either permit long‑context reading or perform slow "read‑to‑extract" before any summary. If dynamics are high (live prices, operational status), the relevant evidence set isn't in the index at all; the sufficient set is the API response.

A compact decision frame falls naturally out of the variables, without naming tools.

What answer is actually owed? Pointer, local join with a condition, table value, time‑aware comparison, or policy judgment with caveats.

What is the smallest sufficient footprint and how dispersed is it? If it's a handful of spans, selection by similarity with deduplication is plausible; if breadth is inherent, read more or shift to extraction before synthesis.

Which signals determine correctness here—authority/time, structure, identifiers, paraphrase? Ensure evidence selection exposes those signals; do not ask a generator to fix a ranking objective it never saw.

Is $\Delta$ positive? If the model is merely rephrasing, remove it. If it is integrating, checking a condition, or justifying a conclusion, pay the generation cost.

Do the economics and dynamics support this loop? Cache repeated answers; call live sources when truth moves faster than your index.

Described this way, the "IR landscape" is a space of problems parameterized by evidence sufficiency, dispersion, signal mix, trust and time, structure, redundancy, dynamics, and economics. Traditional RAG occupies the region where sufficiency is small, dispersion is low, semantics dominate, trust/time are secondary, and a brief, reasoned explanation is the product. Outside that region the problem changes. The system should change with it—not because another tool is fashionable, but because the evidence‑selection objective and the value of generation have shifted.

Read Entire Article

What Problem Is RAG Solving?

Related

He Chunhui's Tiny386 Turns an ESP32-S3 into a Fully-Function...

Lego Icons Star Trek U.S..S. Enterprise NCC-1701-D

The best intro video to blockchain is from 2013