Document Version: 3.0
Core Concept: A cognitive learning framework designed to transform fixed hyperparameters (like learning rate, model capacity) into dynamic policies driven in real-time by the intrinsic "surprise" (Surprise) of data. It is essentially an adaptive hyperparameter scheduling algorithm that allows a model to autonomously decide "how much to learn" and "with what capacity to learn" based on the value of the learning content. This framework originates from the Integrated Predictive Workspace Theory, with further details available in the paper at https://github.com/dmf-archive/IPWT.
Traditional training paradigms rely on manually set hyperparameters that are typically fixed or decay according to a predetermined schedule throughout the training process. This "one-size-fits-all" approach ignores the vast differences in learning value contained in different data batches.
PILF's design philosophy is: to replace static, human-set rules with dynamic, data-driven policies.
It no longer blindly uses a fixed learning rate or model capacity. Instead, it dynamically and proportionally adjusts its learning behavior by assessing the Surprise from each data batch:
- Dynamic Learning Rate: When Surprise is moderate, it signals valuable "learnable zone" information, and the system assigns a higher learning rate. When Surprise is too low (redundant information) or too high (anomalous information), it assigns a learning rate close to zero, naturally achieving "ignore" and "reject" effects. This directly replaces manually set learning rate schedulers.
- Dynamic Capacity: In a Mixture-of-Experts (MoE) architecture, Surprise not only adjusts the learning rate but also determines the number of "experts" k to activate. Simple tasks (low Surprise) require only a few experts, while complex tasks (high Surprise) dynamically engage more experts. This replaces fixed Top-K routing.
PILR-S is the direct application of the PILF idea on any standard neural network. It focuses on one question: How to dynamically adjust the learning rate based on Surprise? A preliminary model zoo and an implementation of the predictive integrity calculation toolkit can be found at SigmaPI.
It replaces the traditional "gating" logic of whether to execute optimizer.step() with a smooth, continuous learning rate modulator.
Mechanism Explained:
- Surprise Calculation: After a cheap Feedforward pass, the ΣPI monitor can calculate the Surprise for the current batch. This process doesn't need to wait for the expensive backpropagation, enabling a rapid assessment of learning value.
- Dynamic Modulation: The PILR-S module receives the Surprise and calculates a smooth modulation factor lr_modifier (ranging from 0 to 1) using a Gaussian function exp(-0.5 * ((surprise - mu) / sigma)^2), based on its relationship with the Exponential Moving Average (EMA) and standard deviation (std) of Surprise.
- Weight Update: The standard loss.backward() is executed only after lr_modifier is calculated. Subsequently, the optimizer uses effective_lr = base_lr * lr_modifier to perform the weight update. optimizer.step() is always executed, but its update magnitude has been pre-emptively and dynamically scaled by Surprise.
PILF is the full implementation on an MoE architecture, extending the dynamic scheduling concept to model capacity allocation.
Training Loop Explained:
- Dual Dynamic Decision: The model receives data and calculates an initial Surprise. Based on this Surprise, PILF makes two decisions in parallel:
- Capacity Decision: k = g(Surprise), determining how many experts to activate.
- Learning Rate Decision: lr_modifier = f(Surprise), determining the learning intensity.
- Dynamic Routing and Computation: The gating network routes the task to the most appropriate experts based on the k value.
- Dynamic Weight Update: After calculating the loss and gradients, the optimizer uses the effective learning rate modulated by lr_modifier to update only the activated experts and the gating network.
- Transforms Hyperparameters into Policies: Converts learning rate and model capacity from developer-set "static hyperparameters" into "dynamic policies" that the model adjusts autonomously based on data value.
- Unifies "Learning" and "Forgetting": By linking the learning rate to Surprise, PILF provides a unified framework to handle learning, ignoring (low Surprise leads to low lr), and rejecting (high Surprise leads to low lr), thereby intrinsically mitigating catastrophic forgetting.
- On-Demand Resource Allocation: (PILF) achieves true on-demand computation, where simple tasks consume minimal resources, and complex tasks dynamically call upon more resources, significantly improving efficiency.
-
Stage 1: PILR-S (Dynamic Learning Rate)
- Goal: Replace fixed learning rate schedulers with a Surprise-driven dynamic learning rate on any standard model.
- Core Mechanism: effective_lr = base_lr * f(Surprise).
- Advantage: No need to modify the model architecture; can be a drop-in replacement for existing training workflows to quickly validate the effectiveness of the dynamic policy.
-
Stage 2: PILF (Dynamic Learning Rate + Dynamic Capacity)
- Goal: Implement a fully adaptive cognitive system on an MoE architecture.
- Core Mechanism: k = g(Surprise) and effective_lr = base_lr * f(Surprise) operate in parallel.
- Advantage: Maximizes computational efficiency and model capacity scalability.
This document is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
.png)


