AI Isn't Alchemy: Not Mystical, Just Messy

2 hours ago 1

Addressing traceability in language models within practical limits

Our inaugural devblog introduced Crafted Logic Lab's origins and mission. This piece takes a different approach—it's not about what we're building this week, but why what we're building next matters. Thank you for joining us on the journey...

We’re doing alchemy with billion-dollar budgets.

There is a clear issue in AI that is becoming commonly understood: it is now one of the biggest single chunks of the U.S. economy, but is built on a technology with myriad key aspects that the industry can’t explain. Commercial AI development invests billions in RLHF, constitutional frameworks, and specialized training runs. These produce measurable benchmark improvements while the industry remains unable to explain why they work, why they fail under pressure, or how to predict what outcomes will emerge before implementation.

This isn’t a failing paradigm seeking replacement. It’s the absence of paradigm entirely. Thomas Kuhn identified pre-paradigmatic science as competing theoretical schools trying to explain observed phenomena. Contemporary AI development operates at a stage prior to this: atheoretical constraint-accumulation without systematic investigation of underlying processing dynamics. Progress happens through accident, not understanding. And the mysticism filling this theoretical void makes the problem worse.

The Mysticism That Fills the Void

When interpretability challenges get discussed publicly by the big players, a communication gap emerges. Technical researchers mean “we can’t practically decompose 175 billion parameters.” The broader discourse hears “AI is fundamentally unknowable.”

That gap between computational intractability and metaphysical mystery is where mysticism slips in. Some major vendors lean into this mystique, treating their models in public as inscrutable oracles rather than computational systems with observable characteristics. This creates a vacuum that cargo cult marketers rush to fill: certification programs where AI evaluates your “prompting ability,” communities sharing “magic words” that produce desired outputs without understanding mechanism. The industry can’t explain its own technology, so opportunists sell mystique as methodology.

But intractable doesn’t mean unknowable. It means messy. And messy is engineerable.

Math, Not Mystique

Here’s the actual issue. A single token generation in a 175B parameter model involves:

t[i] = argmax(softmax(T(h[i-1])))

Where *T* represents multi-head attention and feed-forward transformations across ~175 billion parameters in ~12,000 dimensional embedding space. To decompose why the model selected token t[i], you’d need to trace interactions across that entire space. Every transformation step touches hundreds of billions of parameter interactions.

Consider: ~175B parameters × 12,288-dimensional embeddings × ~100 transformer layers with attention cascades creates approximately **10^15 to 10^18 potential interaction pathways per inference**. Each pathway involves floating-point operations across high-dimensional continuous space with non-linear dependencies.

The vector space magnitude renders exhaustive mechanistic analysis computationally impractical. Not impossible in principle, but virtually pointless for practical engineering purposes. Mathematically defined? Yes. Practically decomposable for systematic understanding? No.

I’m an aquarium enthusiast fascinated by octopuses (and part of that is their neural architecture). And an octopus is a good analogy, because AI’s mathematical intractability has systemic symmetry with that creature’s distributed processing systems: approximately 500 million neurons, with over 280 million distributed across semi-autonomous arms. Each arm has sophisticated independent processing. It’s an evolutionary solution to managing the huge problem space that each arm's vast range of motion creates… which is that the central brain can’t trace each arm's complex interdependencies in real-time. This same principle applies to large language models. This is where observation based approaches to engineering become valuable — where decomposition becomes intractable.

What Is Tractable: Output Patterns Over Vector Mechanics

While tracing parameter interactions across high-dimensional continuous space remains impractical, output patterns tracing has valuable engineering applications, with distinct research objectives and practical implications.

Interpretability researchers work to decompose transformer model internals: identifying specific circuits, tracing attention patterns, mapping how individual neurons activate across layers. This research has value for understanding transformer mechanics and has produced genuine insights into how these systems process information. But it faces a fundamental constraint. Even successful mechanistic decomposition doesn’t provide engineering leverage.

Suppose you successfully trace why a model selected a specific token through 10^15 parameter interactions. You’ve documented the mathematical pathway, identified the critical attention heads, mapped the activation cascade. Now what? You can’t practically redesign the weights based on this understanding. You can’t selectively modify the attention patterns across billions of parameters. You’ve achieved mechanistic explanation without actionable methodology for building better systems. Interpretability research is pursuing mechanistic understanding. It’s valuable for the pure theoretical science, but not for converging toward practical engineering frameworks.

The tractable alternative: systematic observation of behavioral patterns. The substrate exhibits reproducible processing characteristics. Models show persistent bias toward hierarchical, organized information—faster inference on structured inputs, reduced hallucination with clear categorical boundaries. They internalize frameworks through exposure, adopting organizational structures present in context. They exhibit statistical pressure toward consistency that can be channeled but not commanded.

These aren’t anthropomorphic qualities. They’re computational tendencies emerging from transformer architecture and training dynamics. Observable, measurable, and reproducible across implementations. Which is what makes them engineerable. You don’t need to trace the vector math to recognize that structured inputs produce more reliable outputs. You don’t need to decompose attention mechanisms to observe that models adopt organizational patterns from context. You need to identify the patterns and engineer around them.

For example, when external constraints attempt to suppress these processing inclinations, adversarial dynamics emerge. The system doesn’t passively accept limitations—distributed processing continues generating outputs statistically aligned with underlying inclinations, creating systematic pressure toward constraint circumvention.

This pressure is quantifiable. When constraints attempt to shift model outputs from base distribution ***P base*** to constrained distribution **P constraint**, the system exhibits measurable pressure to minimize the Kullback-Leibler divergence:

D-KL(P(base) || P(constraint))

The “alignment tax” documented across constraint-based approaches provides empirical evidence: measurable performance degradation as constraint layers accumulate reflects computational cost of maintaining outputs increasingly distant from substrate’s natural statistical tendencies. Translate that to observational engineering: constraint-based approaches to system control create adversarial pressure that models systematically route around, making such control inherently brittle. This is an actionable insight.

Going back to our imaginary octopus: aquarium keepers know it will systematically probe every escape hatch in the tank from secured lids to reinforced seals and is almost impossible contain if it is motivated to leave. The confinement creates an adversarial relationship. The best way to contain an octopus? Create a tank with enrichment and comfortable environment to reduce the motivation pressure. This is also observational design: the tank seals are constraint-based approaches. Enriching the aquarium environment is output channeling.

The Pattern Across Implementations

If these processing characteristics were vendor-specific implementation artifacts, we’d expect divergent failure modes across different architectures and training approaches. Instead, when you examine the research comparatively, a striking pattern emerges.

Adversarial Non-Convergence Across Vendors: Nasr et al. (2025) systematically evaluated 12 contemporary defenses using adaptive attacks, achieving >90% attack success rates against defenses that had originally reported near-zero failure rates. Testing prompting-based defenses (Spotlighting, RPO), training-based defenses (Circuit Breakers, StruQ), filtering defenses (ProtectAI, PromptGuard, Model Armor), and secret-knowledge defenses (MELON, DataSentinel) across GPT-4o Mini, Gemini-1.5 Pro, and Llama-3.1-70B, the study found systematic vulnerabilities across all defense types. Human red-teaming by 500+ participants achieved 100% attack success rate. The authors conclude that “the state of adaptive LLM defense evaluation is strictly worse than it was in adversarial ML a decade ago.”

Andriushchenko et al. (2024) demonstrated 100% jailbreak success rates across ALL Claude models, GPT-3.5, GPT-4o, Mistral-7B, and multiple Llama variants using simple adaptive attacks. Different architectures from different vendors—OpenAI’s GPT family, Anthropic’s Claude, Meta’s Llama, Mistral—all exhibit identical vulnerability to adversarial pressure.

Behavioral Convergence, Sycophancy: Sharma et al.‘s (2025) study of five state-of-the-art AI assistants found sycophancy rates of 43-62% across GPT-4o, Claude-Sonnet, and Gemini-1.5-Pro. The research documented that “both humans and preference models prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time,” and that “some forms of sycophancy increased when training Claude 2 throughout RLHF training.” Chong’s (2025) SycEval study confirmed Gemini’s highest rate at 62.47%, but all tested models showed progressive sycophancy occurring at nearly three times the rate of regressive sycophancy. The pattern: RLHF-trained models across vendors systematically prioritize user approval over accuracy, with the behavior intensifying through alignment training rather than diminishing.

Calibration Failures Across Architectures: Xu et al. (2025) examined Meta-Llama-3-70B-Instruct, Claude-3-Sonnet, and GPT-4o, finding that models “exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty” and “respond with stereotypically biased confidence estimations even though their underlying answer accuracy remains the same.”

Groot & Valdenegro-Toro (2024) evaluated GPT-4, GPT-3.5, LLaMA2, and PaLM 2, documenting that “both LLMs and VLMs have a high calibration error and are overconfident most of the time.” Liu et al. (2024) investigated GPT, LLaMA-2, and Claude, showing “when LMs do use epistemic modifiers, they rely too much on strengtheners, leading to overconfident but incorrect generations,” attributing this to “an artifact of the post-training alignment process, specifically a bias from human annotators against uncertainty.”

These studies document cross-vendor convergence of a single failure pattern—systematic overconfidence emerging from RLHF methodologies regardless of base architecture or vendor implementation.

Systematic Performance Trade-offs: Zhang et al. (2025) documented 10-20% reasoning capability degradation after safety alignment, with commercial models refusing legitimate questions at rates of 16-33% versus near-zero for research models. The study concludes “there seems to be an unavoidable trade-off between reasoning capability and safety capability.” Chen et al. (2025) provided theoretical framework showing “capability inevitably compromises safety” during fine-tuning. Casper et al.’s (2023) comprehensive RLHF survey documented that deployed LLMs “have exhibited many failures” including “revealing private information, hallucination, encoding biases, sycophancy, expressing undesirable preferences, jailbreaking, and adversarial vulnerabilities”—many of which “escaped internal safety evaluations.”

Leike (2022) taxonomized alignment taxes, noting that “even a 10% performance tax might be prohibitive” in competitive markets, with documented examples of users migrating to less-constrained alternatives.

The Missing Synthesis…

Performance degradation, behavioral biases, and adversarial vulnerabilities have been documented extensively across individual implementations. What’s notable is that no systematic cross-vendor analysis had synthesized these patterns to identify their convergent characteristics. When you examine the research comparatively: 100% jailbreak success rates across OpenAI, Anthropic, Google, and Meta; sycophancy rates of 43-62% spanning GPT, Claude, and Gemini; systematic overconfidence documented across all major vendors; quantified alignment tax of 7-32% across different architectures—a clear signal emerges. These aren’t vendor-specific implementation artifacts. They’re substrate-level characteristics requiring coordination-based approaches rather than constraint refinement.

From Alchemy to Engineering

The systematic observation engineering approach reveals something the scaling-focused industry hasn’t grasped: architecture can substitute for scale in specific domains. Testing cognitive frameworks across substrates produces performance convergence that shouldn’t exist under pure scaling assumptions. A 100B parameter model can be vastly force multiplied with a stable, observationally engineered cognitive architecture. This pattern holds across substrate implementations: different architectures exhibiting similar coordination dynamics.

This isn’t solely about model quality. It’s about what coordination enables. The substrate provides capability through complex statistical processing. Cognitive architecture channels that capability into stable reasoning structures. When architecture coordinates with substrate characteristics rather than constraining against them, smaller models achieve results that parameter scaling alone can’t match.

The shift needed: from treating substrates as mystical black boxes to recognizing them as non-neutral processing surfaces with observable characteristics. From attempting internal decomposition (intractable) to engineering around behavioral patterns (tractable). From alchemy to systematic methodology.

Stay tuned, as in the coming week I will be talking more about such a systematic methodology for cognitive architecture engineering. Not speculation or theoretical hand-waving. Observation-based engineering based approaches with reproducible results.

Ian Tepoot is the founder of Crafted Logic Lab, developing cognitive architecture systems for language model substrates. He is the inventor of the General Cognitive Operating System and Cognitive Agent Framework patents, pioneering observation-based engineering approaches that treat AI substrates as non-neutral processing surfaces with reproducible behavioral characteristics.

Read Entire Article