Information Bandwidth in Reinforcement Learning

1 month ago 3

Understanding Sample Efficiency Through Signal Density

When I first read the “LoRA Without Regret” blog post, one claim caught my attention: policy gradient algorithms learn roughly 1 bit of information per episode. This insight elegantly explains why LoRA—with its mere thousands of trainable parameters—works so remarkably well for RL fine-tuning of large language models.

But what does this actually mean? And if policy gradients learn so little per episode, how much do other RL algorithms learn? In this post, I’ll work through an information-theoretic framework to answer these questions rigorously.


TL;DR: The Main Results

Policy gradient’s hard limit: Compressing 1000+ tokens into one scalar reward creates an information ceiling of $\leq \log_2(B)$ bits per episode. For binary feedback, this is $\leq 1$ bit/episode—explaining why training needs thousands of episodes and why LoRA’s modest capacity (300-500× excess) suffices.

Actor-critic’s theoretical potential: By bootstrapping historical knowledge through a learned critic, actor-critic methods generate dense per-token feedback. Under independence assumptions, this gives an upper bound of $\leq T \log_2(B_\delta)$ bits/episode. For $T=1000$ tokens and 8-bit TD errors, this ceiling is $\leq 8000$ bits/episode—potentially 8000× higher than policy gradient.

The practical implication: LoRA already provides 300-500× excess capacity relative to policy gradient’s information ceiling. Even with substantial improvements in actor-critic methods, LoRA’s capacity appears sufficient for foreseeable applications.

AlgorithmSignal DensityInformation Upper Bound
Policy Gradient1 scalar/episode$\leq 1$ bit/episode (binary)
Actor-Critic$T$ scalars/episode$\leq 8000$ bits/episode (assumes independent signals)

Note: The 8000 bits/episode ceiling assumes independent TD errors—an assumption violated in practice by bootstrap methods. The actual achievable information bandwidth remains an open empirical question.


Part 1: The Mathematical Framework

Setup: Language Model Fine-Tuning as an MDP

When fine-tuning an LLM with RL, we work with a specific type of MDP:

  • States $s$: Token sequences $(x_1, \ldots, x_t)$
  • Actions $a$: Next token $x_{t+1}$ from vocabulary
  • Transitions: Deterministic (append token: $s' = s \circ a$)
  • Rewards $R_\xi$: Determined by unknown parameter $\xi$ (preferences, objectives)

Key property: Transitions are known and deterministic. All uncertainty is in the reward function $\xi$.

Information-Theoretic Lens

To enable rigorous analysis, we use a Bayesian framework as a mathematical modeling tool:

  1. Put a prior $p(\xi)$ over reward parameters
  2. This induces a distribution $p(\pi^*)$ over optimal policies
  3. Each $\xi$ determines a unique optimal policy $\pi^*_\xi$

This doesn’t claim algorithms maintain explicit posteriors—it’s an analytical device that makes the learning signal $S$ and optimal policy $\pi^*$ well-defined random variables, enabling computation of mutual information $I(S; \pi^*)$.

Definition (Information Bandwidth):

$$\mathcal{B} = I(S; \pi^*)$$

This measures how many bits of uncertainty about the optimal policy $\pi^*$ are resolved per episode by the learning signal $S$.

Two Minimal Assumptions

Assumption A1 (Unique Optimum): Each $\xi$ determines a unique optimal policy $\pi^*_\xi$.

Justification: Generic for neural networks with many parameters. Floating-point precision breaks ties; exact degeneracy is measure-zero.

Assumption A2 (Finite Resolution): The learning signal has finite effective resolution—it can take at most $B$ distinguishable values.

Justification: Holds exactly for binary preferences ( $B=2$) or Likert scales ( $B=4$- $7$). Approximately true for continuous signals with noise, finite precision, or practical distinguishability limits.


Part 2: Policy Gradient’s 1-Bit Ceiling

The Algorithm

Policy gradient (REINFORCE) works as follows:

  1. Sample trajectory $\tau = (s_0, a_0, \ldots, s_T)$
  2. Observe scalar return $G = R_\xi(s_T)$
  3. Update: $\theta \leftarrow \theta + \alpha \nabla_\theta \log p_\theta(\tau) \cdot G$

Learning signal: $S = G$ (one scalar per episode)

The Information Ceiling

Theorem 1 (Policy Gradient Information Ceiling):

Under assumptions A1 and A2, policy gradient’s information bandwidth satisfies:

$$I(G; \pi^*) \leq \log_2(B) \text{ bits per episode}$$

Intuition: Information about $\pi^*$ must flow through $\xi$ (by data processing inequality), and the scalar $G$ has entropy bounded by its resolution $\log_2(B)$.

Detailed Proof (click to expand)

We prove this in two steps: first showing information flow must go through $\xi$, then bounding the entropy.

Step 1: Information Flow via Data Processing Inequality

Since $\pi^*_\xi$ is a deterministic function of $\xi$ alone (by Assumption A1), the optimal policy $\pi^*$ contains no information beyond what $\xi$ specifies. Therefore, once we condition on $\xi$, learning $G$ provides no additional information about $\pi^*$. Formally: $\pi^* \perp G | \xi$.

This conditional independence gives us the Markov chain:

$$G \to \xi \to \pi^*$$

Reasoning: The optimal policy $\pi^*$ is a deterministic function of $\xi$ alone (by A1). Therefore, once $\xi$ is given, $\pi^*$ is fixed. Any other variable, including $G$, becomes conditionally independent of $\pi^*$ given $\xi$.

By the data processing inequality, post-processing cannot increase information:

$$I(G; \pi^*) \leq I(G; \xi)$$

Step 2: Entropy Upper Bound

By definition of mutual information: $$I(G; \xi) = H(G) - H(G|\xi)$$

Since conditional entropy is non-negative $H(G|\xi) \geq 0$: $$I(G; \xi) \leq H(G)$$

By Assumption A2, $G$ takes at most $B$ distinct values. For any discrete random variable $X$ with support size $|X| \leq B$:

$$H(X) = -\sum_x p(x) \log_2 p(x) \leq \log_2(|X|) \leq \log_2(B)$$

Equality holds when $X$ is uniform over its support.

Combining both steps: $$I(G; \pi^*) \leq I(G; \xi) \leq H(G) \leq \log_2(B)$$ ∎

This is a hard ceiling regardless of sequence length $T$, model complexity, or computational resources.

Concrete Examples

  • Binary preferences ( $B=2$): $\leq 1$ bit/episode
  • Likert scale ( $B=5$): $\leq 2.3$ bits/episode
  • 8-bit resolution ( $B=256$): $\leq 8$ bits/episode

Why This Matters

The compression bottleneck: A typical LLM generation has $T \sim 1000$ tokens, each chosen from hundreds of possibilities. Policy gradient compresses all this rich structure—which words worked well, where the response went wrong, which reasoning steps succeeded—into one number.

This structural compression is why:

  • Training needs thousands of episodes: With 1 bit/episode and binary feedback, 1000 episodes gives $\leq 1000$ bits total
  • LoRA works well: As we’ll see, LoRA provides 300-500× more capacity than this ceiling
  • Adding parameters doesn’t help: The bottleneck is signal sparsity, not model capacity

Part 3: Actor-Critic’s Dense Signal Upper Bound

The Algorithm

Actor-critic methods (A3C, PPO with value function) maintain two components:

  • Actor $\pi_\theta(a|s)$: The policy with parameters $\theta$
  • Critic $V_\phi(s)$: Value function estimating expected return from state $s$, with parameters $\phi$

Training loop: For each episode, perform updates at each timestep:

  1. Rollout: Generate trajectory $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_T)$ using current policy $\pi_\theta$

  2. Compute TD errors at each timestep $t$: $$\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$$

    This measures how much better/worse the observed outcome was compared to the critic’s expectation.

  3. Update critic toward the TD target: $$\phi \leftarrow \phi + \alpha_\phi \cdot \delta_t \cdot \nabla_\phi V_\phi(s_t)$$

    This semi-gradient update moves $V_\phi(s_t)$ toward the bootstrap target $r_t + \gamma V_\phi(s_{t+1})$, treating $V_\phi(s_{t+1})$ as fixed.

  4. Update actor (policy gradient with advantage): $$\theta \leftarrow \theta + \alpha_\theta \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \delta_t$$

    Move probability toward actions with positive $\delta_t$ (better than expected), away from negative $\delta_t$.

Key difference from policy gradient: Instead of one scalar return $G$ per episode, we get $T$ TD errors $\{\delta_t\}_{t=0}^{ T - 1 }$—one feedback signal per timestep.

Learning signal: $S = \{\delta_t\}_{t=0}^{ T - 1 }$ (one signal per token)

Instead of waiting until the end for one scalar, we get feedback at every step.

Where Does the Dense Signal Come From?

A natural question: “The environment only provides rewards at certain steps. How can actor-critic get more information per episode?”

The key insight: The extra information doesn’t come from the environment in the current episode. It comes from the critic’s accumulated knowledge from all past episodes.

The critic $V_\phi(s)$ acts as compressed memory of all historical data. It learns to predict expected future returns from any state $s$ based on thousands of previous rollouts.

Let’s re-examine the TD error with this perspective:

$$\delta_t = \underbrace{(r_t + \gamma V_\phi(s_{t+1}))}_{\text{Observed outcome}} - \underbrace{V_\phi(s_t)}_{\text{Historical expectation}}$$

  • $V_\phi(s_t)$: The critic’s prediction, representing the historical average of what should happen from state $s_t$
  • $r_t + \gamma V_\phi(s_{t+1})$: The observed outcome of taking action $a_t$, incorporating one step of real environmental feedback $r_t$

The TD error $\delta_t$ is a “surprise” signal—how much better or worse reality was compared to historical expectation. This signal is information-rich precisely because it compares against a learned model of the reward structure. The critic bootstraps knowledge from the past to provide dense, step-by-step feedback in the present.

In short: Actor-critic achieves higher information bandwidth by efficiently reusing historical data via the critic. Instead of treating each episode as independent (like policy gradient), the critic allows every step in the current episode to be evaluated against distilled knowledge of all previous trials.

The tradeoff: This bootstrapping comes with an inherent cost—because all TD errors depend on the same learned value function $V_\phi$, they become structurally correlated. The very mechanism that enables dense signals also introduces the correlation barrier that prevents us from achieving the full $T \log_2(B_\delta)$ bound.

Let’s make the correlation concrete with an example:

Scenario: Suppose your critic systematically overestimates values by 20% everywhere (a common early-training phenomenon), where $V_{\text{true}}$ denotes the true value function.

Then for a sequence of states:

  • $\delta_0 = r_0 + \gamma(1.2 \cdot V_{\text{true}}(s_1)) - 1.2 \cdot V_{\text{true}}(s_0)$
  • $\delta_1 = r_1 + \gamma(1.2 \cdot V_{\text{true}}(s_2)) - 1.2 \cdot V_{\text{true}}(s_1)$
  • $\delta_2 = r_2 + \gamma(1.2 \cdot V_{\text{true}}(s_3)) - 1.2 \cdot V_{\text{true}}(s_2)$

Notice that all TD errors share the same 1.2× bias. If $\delta_0$ is surprisingly negative (indicating overestimation), then $\delta_1$ and $\delta_2$ are likely also negative—they’re positively correlated.

This correlation isn’t a bug or implementation flaw. It’s inherent to bootstrap methods: using the same approximate value function $V_\phi$ across all states creates shared biases that induce correlation. Even as the critic improves, bootstrap methods maintain correlation structure because consecutive TD errors depend on overlapping value estimates.

This structural correlation is why the independence-based bound of $T \log_2(B_\delta)$ is likely not achievable in practice—the actual achievable information bandwidth remains an open question.

Extended Assumption

Assumption A2’ (Effective TD Resolution): For the purpose of upper bound analysis, we model each TD error $\delta_t$ as having effective resolution $B_\delta$ distinguishable values.

This is a modeling assumption for theoretical analysis. TD errors are continuous in practice, but we model their effective information content as finite due to:

  • Finite precision arithmetic (float32: ~7 significant digits)
  • Neural network optimization noise
  • Gradient update granularity

We use $B_\delta = 256$ (8 bits) as an illustrative example, representing roughly 2-3 significant digits of precision. The actual effective resolution is task-dependent and difficult to measure empirically. The resulting bound should be interpreted as demonstrating the potential order-of-magnitude advantage (hundreds to thousands of times higher than policy gradient) rather than a precise quantitative prediction.

For other reasonable choices:

  • $B_\delta = 16$ (4 bits): Upper bound becomes 4000 bits/episode
  • $B_\delta = 1024$ (10 bits): Upper bound becomes 10,000 bits/episode

The qualitative conclusion—that actor-critic has dramatically higher theoretical potential—holds across this range.

Important caveat: Unlike A2 (which applies to actual observations), A2’ applies to derived learning signals. The resulting bound $I(S; \pi^*) \leq T \log_2(B_\delta)$ is a mathematical upper limit on potential information flow, not a tight characterization of actual information.

The Information Ceiling

Theorem 2 (Actor-Critic Information Upper Bound):

Under assumptions A1 and A2’, actor-critic’s information bandwidth satisfies:

$$I(\{\delta_t\}; \pi^*) \leq T \log_2(B_\delta) \text{ bits per episode}$$

This upper bound is mathematically sound but likely not achievable in practice due to correlation between TD errors.

For $T=1000$ and $B_\delta=256$: This theoretical ceiling is $8000$ bits/episode—8000× higher than policy gradient’s 1 bit.

Intuition: With $T$ independent signals each carrying $\log_2(B_\delta)$ bits, we get $T \log_2(B_\delta)$ total bits. However, this assumes independence—an assumption violated by the bootstrap structure of TD learning.

Detailed Proof (click to expand)

We bound the entropy of the TD error sequence.

Step 1: Chain Rule Decomposition

By the chain rule for entropy: $$H(\delta_0, \delta_1, \ldots, \delta_{ T - 1 }) = \sum_{t=0}^{ T - 1 } H(\delta_t | \delta_0, \ldots, \delta_{ t - 1 })$$

Using $\delta_{ < t } = (\delta_0, \ldots, \delta_{ t - 1 })$: $$H(\{\delta_t\}) = \sum_{t=0}^{ T - 1 } H(\delta_t | \delta_{ < t })$$

Step 2: Bounding Conditional Entropy

By Assumption A2’, each $\delta_t$ takes at most $B_\delta$ values. For any random variable $X$ with $|X| \leq B_\delta$:

$$H(X | Y) \leq \log_2(B_\delta)$$

for any conditioning variable $Y$. This holds because:

  • Conditional entropy is maximized when $X$ is uniform over its support
  • Even with conditioning, $X$ still has at most $B_\delta$ values

Therefore: $$H(\delta_t | \delta_{ < t }) \leq \log_2(B_\delta)$$

Critical observation: This bound is tight only when $\delta_t$ is nearly independent of $\delta_{ < t }$: $$H(\delta_t | \delta_{ < t }) \approx H(\delta_t)$$

If perfectly correlated: $H(\delta_t | \delta_{ < t }) = 0$. Reality falls between these extremes.

Step 3: Summing Over Time

$$H(\{\delta_t\}) = \sum_{t=0}^{ T - 1 } H(\delta_t | \delta_{ < t }) \leq T \log_2(B_\delta)$$

Step 4: Information Flow Bound via Data Processing Inequality

Consider the Markov chain: $$\xi \to \{\delta_t\} \to f(\{\delta_t\})$$

where $f$ is any function of the TD error sequence (including the identity). Since $\pi^*$ is a deterministic function of $\xi$ alone (by Assumption A1), knowing $\xi$ completely determines $\pi^*$. Therefore $\pi^*$ can be viewed as a function $g(\xi)$, giving us another Markov chain: $$\{\delta_t\} \to \xi \to \pi^*$$

By the data processing inequality, post-processing cannot increase information: $$I(\{\delta_t\}; \pi^*) \leq I(\{\delta_t\}; \xi)$$

Mutual information is always bounded by the entropy of either variable: $$I(\{\delta_t\}; \xi) \leq H(\{\delta_t\})$$

Combining with Step 3: $$I(\{\delta_t\}; \pi^*) \leq I(\{\delta_t\}; \xi) \leq H(\{\delta_t\}) \leq T \log_2(B_\delta)$$ ∎

Critical note on the proof: This bound treats TD errors as information-carrying signals about the reward parameter $\xi$. However, in the actual actor-critic algorithm, $\delta_t$ depends on both $\xi$ (through rewards) and the current training state ( $\theta, \phi$). The Markov chain $\{\delta_t\} \to \xi \to \pi^*$ is valid when we view the TD error sequence as the raw observational data from which we want to infer $\xi$, treating the influence of $(\theta, \phi)$ as part of the observation process.

The bound represents an idealized scenario modeling what information about $\xi$ (and thus $\pi^*$) could theoretically be extracted from observing the TD error sequence. This theoretical upper limit assumes: (1) the value function perfectly captures all relevant information from $\xi$, and (2) TD errors are independent. In practice, both assumptions are violated—value functions are approximate, and TD errors share the same learned $V_\phi$, creating correlation. Thus this is an upper limit on potential information flow rather than a characterization of what actual algorithms achieve.

⚠️ Critical Caveat: This bound assumes independent TD errors—an assumption fundamentally violated by bootstrap methods. Successive TD errors are structurally correlated: both $\delta_t$ and $\delta_{t+1}$ depend on $V(s_{t+1})$, sharing value function biases. This correlation is not merely an implementation issue but inherent to TD learning, preventing the $T \log_2(B_\delta)$ bound from being achievable. The actual achievable information bandwidth remains an open empirical question.

Understanding the Gap Between Theory and Practice

For $T=1000$, $B_\delta=256$:

  • Policy gradient (binary): $\leq 1$ bit/episode
  • Actor-critic (theoretical ceiling): $\leq 8000$ bits/episode

The theoretical ceiling represents what could be achieved with independent TD errors. In practice, the achievable information bandwidth is likely much lower due to inherent correlation in bootstrap methods.

Multiple factors contribute to this gap:

  1. TD error correlation (our main focus): Bootstrap methods create structural dependencies between consecutive TD errors

  2. Value function approximation error: Neural networks can’t perfectly represent value functions, introducing systematic biases beyond correlation

  3. Finite sample effects: Each state is visited finitely often, creating statistical uncertainty in value estimates

  4. Optimization challenges: Critic training is unstable, sensitive to hyperparameters, and may converge slowly

  5. Information utilization efficiency: Even with perfect TD signals, gradient descent may not efficiently extract all available information

Our hypothesis: Correlation (factor 1) is a substantial barrier that accounts for a significant portion of the gap. However, disentangling these factors empirically is an important direction for future work.


Part 4: Why LoRA Works—The Capacity-Information Match

The Capacity Argument

Consider typical RLHF setup:

  • Episodes: $N = 1000$
  • LoRA: rank $r=8$, dimension $d=4096$
  • Binary preferences: $B=2$

Information available: $$N \times \log_2(B) = 1000 \times 1 = 1000 \text{ bits}$$

LoRA capacity (rough estimate):

  • Parameters: $2rd = 65{,}000$
  • Effective bits per parameter: 5-8 (rough estimate)
    • Justification: After training, parameters likely take on hundreds to thousands of distinguishable values that meaningfully affect behavior. This is much less than float32 precision (23 bits) but more than crude quantization (2-3 bits). The 5-8 bit range represents an educated guess rather than a measured quantity.
  • Total capacity: $65{,}000 \times 5$ to $65{,}000 \times 8$ = ~300,000-500,000 bits

The ratio: LoRA provides ~300-500× more capacity than policy gradient’s 1000-bit ceiling.

Interpretation: Even if our “bits per parameter” estimate is off by 2-3×, the qualitative conclusion holds: LoRA has substantial excess capacity relative to policy gradient’s information ceiling. This explains why LoRA works well despite its parameter efficiency.

The Key Insight

LoRA works because the parameter bottleneck isn’t binding. With policy gradient’s sparse signals, you have far more capacity than information to store. The bottleneck is signal density (1 bit/episode), not model capacity.

Why full fine-tuning is overkill: A 7B model has ~7 billion parameters versus ~1000 bits of information—a factor of 7 million excess capacity. LoRA’s modest parameter count naturally matches policy gradient’s information ceiling.

Empirical consistency: LLM-RL needs 1,000-10,000 episodes to converge, consistent with accumulating 1,000-10,000 bits at 1-3 bits/episode (depending on reward granularity).

Implications for Actor-Critic

If actor-critic could achieve substantially better correlation management (e.g., 100 bits/episode):

  • With 1000 episodes: 100,000 bits of information
  • LoRA capacity: Still 3-5× excess capacity
  • Conclusion: LoRA remains sufficient even for efficient actor-critic

Only with dramatic improvements approaching the theoretical ceiling would LoRA capacity become limiting—but this may be unachievable due to fundamental correlation in bootstrap methods.


Part 5: Implications and Future Directions

Why Policy Gradient + LoRA Dominates

The current state of LLM fine-tuning is dominated by policy gradient + LoRA because this combination achieves a practical equilibrium:

  • Stable: Single optimization target, no critic training instability
  • Parameter-efficient: LoRA provides 300-500× excess capacity relative to policy gradient’s information ceiling
  • Sample-inefficient: $\leq 1$ bit/episode with binary preferences requires thousands of episodes

This explains why practitioners default to this approach despite its sample inefficiency—the stability-efficiency tradeoff favors it over alternatives.

Understanding the Limitations

Our analysis reveals two distinct types of barriers:

1. Information-theoretic ceilings (provably unavoidable):

  • Policy gradient: $\leq \log_2(B)$ bits/episode—cannot be exceeded by any algorithm that learns from scalar episode returns
  • Actor-critic: $\leq T \log_2(B_\delta)$ bits/episode—cannot be exceeded by any algorithm that learns from $T$ signals with resolution $B_\delta$

2. Bootstrap correlation barrier (specific to TD methods):

  • The $T \log_2(B_\delta)$ ceiling assumes independent TD errors
  • Bootstrap methods inherently violate this: all $\delta_t$ share the same learned $V_\phi$
  • The actual achievable information bandwidth remains unmeasured

Key open question: How much of the gap between 1 bit/episode (policy gradient) and 8000 bits/episode (theoretical ceiling) is bridgeable? Is there a practical middle ground, or does bootstrap correlation prevent substantial improvements?

Research Directions

Priority 1: Empirically measure the achievable information bandwidth

  • Develop methods to quantify $I(S; \pi^*)$ in trained models
  • Measure TD error correlation across architectures and tasks
  • Establish empirical baselines for what’s actually achievable vs theoretical ceilings

Priority 2: Improve actor-critic stability at LLM scale

  • Low-rank value function architectures (matching LoRA structure)
  • Ensemble critics to reduce bias
  • Better optimization techniques for joint actor-critic training

Priority 3: Explore decorrelation techniques

  • Eligibility traces ( $\lambda$-returns) to diversify bootstrap targets
  • Multi-step returns with varying horizons
  • Note: These may have fundamental limits due to bootstrap structure

Priority 4: Circumvent bootstrap correlation entirely

  • Monte Carlo methods (no bootstrapping, but high variance)
  • Model-based RL (learn environment dynamics, plan without bootstrap)
  • Hybrid approaches that blend MC and TD

Priority 5: Engineer denser ground-truth signals

  • Process rewards provide intermediate feedback beyond just outcomes
  • Per-token human annotations increase signal granularity
  • Both approaches increase $B$ directly, sidestepping the bootstrap issue entirely

Not needed: More parameters. As shown in Part 4, LoRA already provides 300-500× excess capacity relative to policy gradient’s information ceiling. Even with 100× improved actor-critic (100 bits/episode × 1000 episodes = 100,000 bits), LoRA would still have 3-5× excess capacity.

Terminology: Two Senses of “Fundamental”

We use “fundamental” to describe barriers at different levels:

  1. Information-theoretic fundamentals: Theorems 1 and 2 establish ceilings that cannot be exceeded by any algorithm using those signal types, regardless of computational resources or algorithmic sophistication

  2. Fundamental to bootstrap methods: TD error correlation appears inherent to methods that use $V(s')$ to estimate targets for $V(s)$—this creates structural dependencies. However, this may not be fundamental to RL in general (Monte Carlo methods avoid it)

The distinction matters: information-theoretic barriers are absolute, while bootstrap correlation might be circumventable with alternative RL paradigms.


Limitations and Future Work

Theoretical limitations:

  1. Our bounds assume deterministic optimal policies (A1), which may not hold exactly in stochastic or degenerate settings
  2. Assumption A2’ (effective TD resolution) is not empirically validated—the choice of $B_\delta = 256$ is illustrative
  3. We model algorithms using idealized Bayesian inference, which doesn’t capture actual optimization dynamics

Empirical gaps:

  1. The achievable information bandwidth for actor-critic methods remains unmeasured
  2. We don’t empirically validate the correlation hypothesis or quantify its contribution
  3. The practical gap between theoretical ceilings and actual performance needs experimental investigation

Future work could:

  • Develop methods to directly measure information bandwidth in trained models
  • Empirically quantify TD error correlation across different architectures and tasks
  • Test whether decorrelation techniques (e.g., eligibility traces) improve information utilization
  • Explore whether the theory extends to other RL settings (model-based, offline, multi-agent)
  • Investigate the relationship between value function approximation quality and achievable information bandwidth

Conclusion

This information-theoretic analysis reveals why policy gradient + LoRA dominates current LLM fine-tuning and what barriers limit potential improvements:

The 1-bit bottleneck: Policy gradient’s compression of rich token-level dynamics into scalar returns creates a $\leq \log_2(B)$ bits/episode ceiling. For binary feedback, this is $\leq 1$ bit/episode—explaining both why 1000s of episodes are needed and why LoRA’s modest capacity (300-500× excess) suffices.

Actor-critic’s theoretical potential: Bootstrap methods can theoretically achieve $\leq T \log_2(B_\delta)$ bits/episode under independence assumptions—orders of magnitude higher than policy gradient. However, the structural correlation inherent to TD learning (all $\delta_t$ share the same $V_\phi$) creates a gap between theoretical ceiling and achievable performance. How much of this gap is bridgeable remains an open empirical question.

The path forward: As detailed in Part 5, progress requires first empirically measuring what’s achievable, then pursuing improvements through better critic training and decorrelation, non-bootstrap alternatives like Monte Carlo or model-based methods, or engineering denser supervision signals. Understanding these tradeoffs is essential for next-generation LLM fine-tuning methods.


References

LLM Fine-Tuning:

  • Ouyang, L., et al. (2022). “Training language models to follow instructions with human feedback.” NeurIPS. (InstructGPT)
  • Hu, E. J., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR.
  • Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv.

Information Theory:

  • Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience.
  • Russo, D., & Van Roy, B. (2014). “Learning to Optimize via Information-Directed Sampling.” Operations Research, 66(1), 230-252.

Inspiration: ThinkingMachines.ai (2025). “LoRA Without Regret.”


Citation

@article{li2025information, title = {Information Bandwidth in Reinforcement Learning}, author = {Li, Yingru}, journal = {Richard Li's Blog}, year = {2025}, url = {https://richardli.xyz/post/information-bandwidth-rl/} }
Read Entire Article