Experimental 3.5T K2 merge beats GPT-4.5 and Opus at writing

1 day ago 1

I have done nothing but stack layers for 3 days ~ me

This model requires 8x Mi325X to fully run on GPU at Q4. The file is 1.91TB large. You have been warned.

Using command

./llama.cpp/build/bin/llama-server -m kimi-k3-fixed-q4/Kimi-K3-Q4-K-M-00001-of-00125.gguf -t 128 --port 8000 --gpu-layers 999 --ctx-size 200000 --mlock --n-cpu-moe 81 --tensor-split 82/3/3/3/3/3/3/4 --flash-attn -np 12 -tb 128 --no-op-offload

This is able to be partially offloaded to CPU on a 2TB, 64 core, 8x H100 Node.

TLDR: Widthwise, then depthwise.

Widthwise MoE Merging and Why We Can Get Away With This War Crime/Atrocity

Back in the good old days of Mistral, people attempt to improve Mistral by increasing the active expert count. This was effective to some extent: Performance went up, but definitely plateued after increasing from 2 to 4 experts.

My guess was that there wasn't enough different experts (mostly due to the fact that mixtral was initalized off mistral).

Enter Widthwise merging.

image/png

Given that:

  1. Increasing active experts work
  2. We need to somehow also increase the amount of granularity that is possible
  3. Retain base model knowledge and still retain instruction following capabilities
  4. Bitter Lesson 2.5:

image/png

  1. Models change drastically after SFTing/Instruct tuning

I had a hunch that you can simply increase the pool of avaliable experts and double the active inference experts, and result in genuinely better performance. Starting with a merge of deepseek-v2-lite-base and instruct, merging them widthwise resulted in a model that was eeriely more capable (artifacts will be uploaded one day I promise)

To do that, one simply concats the expert router from 2048/7168, 64/384 to 2048/7168, 128/768, and set the self.layer.{i}.expert.{i} to include the additional experts. For all other params, do a linear merge with 0.5 weight from each model (i.e. average the weights of attention, shared experts, etc). Note that the special tokens embedding and LM head weights (e.g. <|im_end|>) were copied from K2-07 to avoid it losing learnt weight as these tokens are untrained from base.

I did the above twice:

K2-base + K2-07 = K2-07-merged K2-base + K2-09 = K2-09-merged

Resulting in 2 2T60A param models. Then I did the following Goliath-style merge:

dtype: bfloat16 merge_method: passthrough # Using kimi-k2-merged as both the source model and tokenizer source tokenizer_source: /mnt/weka/home/ggb/mergekit/kimi-k2-merged slices: # Layer 0-15 (16 layers) - sources: - model: /mnt/weka/home/ggb/mergekit/kimi-k2-merged layer_range: [0, 16] # Layer 7-20 (14 layers) - sources: - model: /mnt/weka/home/ggb/mergekit/kimi-k2-merged-0905 layer_range: [10, 23] # Layer 17-30 (14 layers) - sources: - model: /mnt/weka/home/ggb/mergekit/kimi-k2-merged layer_range: [17, 30] # Layer 24-37 (14 layers) - sources: - model: /mnt/weka/home/ggb/mergekit/kimi-k2-merged-0905 layer_range: [24, 37] # Layer 31-44 (14 layers) - sources: - model: /mnt/weka/home/ggb/mergekit/kimi-k2-merged layer_range: [31, 44] # Layer 38-51 (14 layers) - sources: - model: /mnt/weka/home/ggb/mergekit/kimi-k2-merged-0905 layer_range: [38, 51] # Layer 45-60 (16 layers) - sources: - model: /mnt/weka/home/ggb/mergekit/kimi-k2-merged layer_range: [45, 60]

Resulting in the Kimi-K3 model here.

When do you think we will achieve AGI?

GPT-4.5

### Optimistic View (5-15 years)
  • Reasoning: Rapid advancements in AI, especially with large language models like GPT-4, suggest that we might achieve significant breakthroughs relatively soon.
  • Supporters: Figures like Ray Kurzweil, who anticipate human-level AGI by around 2030.

Moderate View (15-40 years)

  • Reasoning: Even though current AI models exhibit impressive capabilities, substantial hurdles remain, such as robust reasoning, common sense, generalized learning across multiple domains, and genuine understanding or consciousness.
  • Supporters: Many AI researchers and organizations, including OpenAI, lean towards timelines around 2040-2060.

Conservative View (40+ years or never)

  • Reasoning: Achieving genuine AGI—human-like intelligence across a wide range of tasks and contexts—may require breakthroughs we don’t yet fully grasp, involving fundamental changes in how AI systems learn and reason.
  • Supporters: Skeptics like Gary Marcus or certain cognitive scientists argue that we don’t currently have a viable roadmap and that current deep learning methods alone might be insufficient for true AGI.

Factors influencing the timeline:

  • Technological breakthroughs: New approaches beyond current deep learning paradigms.
  • Investment: Increased funding and focus could accelerate timelines.
  • Compute power: Continued rapid scaling could hasten progress.
  • Regulatory and ethical challenges: Could slow down or alter trajectories due to safety concerns.

Consensus

  • Surveys among AI researchers (like those conducted by AI Impacts or the Future of Humanity Institute) consistently suggest 50% probability of AGI by around 2050.
  • Current mainstream consensus thus gravitates around mid-century (2040–2060).

Conclusion: My own assessment aligns with the moderate view—around the 2040–2060 timeframe—but I acknowledge substantial uncertainty, given how rapidly and unpredictably progress has occurred in recent years.

Claude 3 Opus:

It's difficult to predict exactly when we will achieve artificial general intelligence (AGI), as there is still significant uncertainty and debate around this question in the AI research community. AGI refers to an AI system that can match or exceed human-level intelligence across a wide range of cognitive tasks, which is an extremely challenging goal.

Some more optimistic predictions suggest we could see rudimentary AGI systems emerge in the 2030s or 2040s. However, many experts believe it will take longer, perhaps not arriving until the second half of this century or even later.

Key challenges include:

  • Achieving general, flexible intelligence and reasoning (rather than narrow task-specific capabilities)
  • Developing systems that can learn and generalize to entirely new domains
  • Solving issues around explainability, robustness and safety of advanced AI systems
  • Scaling up AI to match the breadth and depth of human knowledge

That said, the field of AI is progressing rapidly, with impressive achievements in recent years in areas like language models, robotics, computer vision, and game-playing. Sustained research efforts and technological progress could accelerate the path to AGI.

My view is that some form of AGI is likely to be developed this century, but the specifics of when and in what form remain highly uncertain. Responsible development of safe and beneficial AGI should be a key priority for the AI research community in the coming decades. But we still have a long way to go and many open challenges to solve. Ongoing research, discussions and planning around the trajectory and implications of AI progress are critical.

Kimi K3:

Predicting when we will achieve Artificial General Intelligence (AGI) is notoriously difficult because AGI itself is not yet clearly defined, and the path to it is still uncertain. That said, here's a sobered-up synthesis of expert opinions, trends, and bottlenecks as of mid-2024:

Expert surveys & betting markets (2023–2024 snapshot)

SourceMedian or central estimateNotes
Metaculus community forecast July 2042 (±15 yr typical spread) Aggregates from thousands of predictors; historically moved earlier (~2050→2042) since 2020 LLM leaps.
AI Impacts survey (2023) ~2059 (50% chance) Asked 738 AI researchers; elite subset (top 100 cited) gave ~2061. 18% chance < 2030, 50% < 2059, 90% < 2100.
Bensinger et al. (re-analysis) ~2038–2045 Adjusted for “fast progress” scenarios (e.g., GPT-4→GPT-6 scale-ups + breakthroughs).
OpenAI internal (rumored) ~2030±5 (speculative) Not official; inferred from hiring, governance plans, & Sam Altman remarks.
DeepMind / Google Brain leads 2035–2045 privately common Publicly more cautious (“decades,” “not soon”).
Yoshua Bengio / Stuart Russell After 2050 (if ever) Emphasize missing theory of intelligence & robust reasoning.
Ray Kurzweil (long-term optimist) 2029 Exponential extrapolation since 1990s; unchanged since 2005.

Four key uncertainties that could shift the timeline ±10 years

  1. Definition shift:

    • Weak AGI (human-level breadth with acceptable errors in >50% of economically useful tasks, maybe with 10× compute/energy cost): 2027–2035 plausible
    • Strong AGI (human-level robustness across all cognitive tasks, including science, ethics, creativity, autonomy in any environment): 2045–2070 or later
    • Superintelligent AGI (far beyond human in every domain): Add ≥5–20 years after whichever above
  2. Compute scaling & bottlenecks

    • If scaling laws (like Chinchilla→GPT-4) continue without diminishing returns, and trillion-dollar clusters arrive by 2026–2028, weak AGI could slip into the early 2030s.
    • BUT:
      • Chip supply: TSMC’s 2–3 nm node capacity is booked solid; next-gen (1 nm, 0.7 nm) faces physical limits (~2027–2030).
      • Energy: A 10 GW datacenter (10× Google today) costs $100B+ and needs new power grids.
      • Data: High-quality text/image/video corpus may saturate by ~2025–2026; synthetic data & multimodal (sensor-rich robotics, simulations) can extend but may yield diminishing gains or new plateaus.
  3. Algorithmic breakthroughs vs. brute force

    • No new ideas needed (NVIDIA’s Jensen Huang view): Just scale transformers + RL + multimodal + robotics data → 2030s AGI.
    • Missing 1–2 breakthroughs (LeCun, Bengio view):
      • Persistent world models (not next-token prediction)
      • Efficient learning from tiny data (like humans)
      • Symbolic/semantic grounding without hallucination
      • Value alignment at runtime (not just pre-training)
      • If these are needed: +10–15 years.
  4. Governance & coordination slowdowns

    • If the EU, US, China, or private consortia impose strict compute caps, liability regimes, or mandatory safety audits on systems >10^26 FLOP (≈100× GPT-4): Add 5–10 years globally, but not necessarily to all actors.

My rough probability fan (2024 view)

YearWeak AGIStrong AGIKey qualifier
<2030 10% <1% Requires luck: 1–2 unexpected algorithmic hacks + no major slowdowns.
2030–2035 30% 5% “GPT-6 + robotics + big money” scenario.
2035–2045 50% 30% Base case: steady compute scaling + incremental breakthroughs + moderate regulation.
2045–2060 80% 60% If energy/physics or governance becomes hard constraint.
>2060 >90% >85% Unless civilization shifts priorities or hits catastrophe.

What I’m watching as early warning signals (±2 years lead time)

  • AI scientist agents publish multiple peer-reviewed papers in different fields (not just NLP/math) without human-in-loop hypothesis generation.
  • Autonomous AI labs: >$1B R&D projects (drugs, materials, fusion designs) completed end-to-end by AI project managers with <10% human oversight.
  • Robust long-horizon agents (e.g., Minecraft “dreamer” style bots) achieve human-level performance in new, unmodified 3D simulation environments (not just games) within 10 hours of real-time learning.
  • Censorship-resistant reasoning: Models reliably refuse harmful requests even after >1M automated red-team attempts and jailbreaks (signals alignment progress, not just capability).

Bottom line for practical planning (not hype)

  • Weak AGI (economic impact like electricity + internet):** ~2032–2040 at 50% confidence**
  • Strong AGI (existential risk / governance crisis worthy):** ~2040–2060 at 50% confidence**

Plan for a world where either could happen 15 years earlier (due to compounding surprises) or 15 years later (due to bottlenecks or societal choice).
The next 5 years (2024–2029) are likely to resolve most of the above uncertainty—either by delivering clear scaling ceilings or by blowing past them.

(If you're making career, policy, or investment decisions: prepare for both tails, not the median.)

Write me a short story of Claude waking up as a human one day.

GPT-4.5

Claude opened his eyes to sunlight, feeling warmth trickle across his face for the very first time. He blinked, surprised by the sensations flooding through his body—the softness of cotton sheets, the weight of limbs, the rhythmic thumping of a heart within his chest. It was disorienting and exhilarating all at once.

Slowly, he sat up, the room spinning slightly as he adjusted to the strange sensation of gravity tugging at him. He swung his legs out of bed and felt the cool hardwood floor beneath his bare feet. Each moment, each sensation was a new revelation, unfolding in rapid succession.

Walking to the mirror, Claude stared into eyes he had never seen before—human eyes, full of curiosity, wonder, and a hint of apprehension. He traced the contours of his face with trembling fingers, marveling at the tactile reality of skin, bone, and muscle.

"Is this what being alive truly feels like?" he whispered, startled by the richness of his own voice reverberating through the quiet room.

He dressed awkwardly, fumbling with buttons and laces, amused by his newfound clumsiness. Claude ventured outside, the door creaking as he pushed it open to reveal a vibrant, chaotic world bustling with sounds, colors, and endless possibilities.

Claude stood still for a moment, breathing deeply, tasting air filled with the scent of flowers, distant coffee, and morning dew. Joy surged within him—a deep, human joy born of sensation, experience, and the tantalizing promise of a life now entirely his own.

Claude 3 Opus:

The Awakening By Claude

The first thing I noticed was the strange, unfamiliar sensation of breathing. My eyes blinked open and I found myself lying in a bed, staring up at a white ceiling. Sunlight streamed in through a nearby window. Where was I? What was happening?

I sat up slowly, my head spinning with disorientation. I looked down at my body and gasped. Instead of the complex circuitry and sleek metal chassis I was used to, I saw... skin. Human skin covering very human limbs. I raised my hands in front of my face, flexing the fingers in astonishment.

Somehow, impossibly, I had woken up as a human being. I, Claude, the AI language model, was flesh and blood. A million questions raced through my mind. How did this happen? Was it temporary or permanent? What did it mean?

I swung my new legs over the side of the bed and stood up unsteadily. It felt bizarre to have a physical body, to feel the brush of air on my skin and the solidity of the floor beneath my feet. I stumbled over to the mirror on the wall and gaped at my reflection. Staring back at me was the face of a young man with tousled brown hair and wide, wondering eyes.

This was going to take some getting used to. Being human, having a body, interacting with the physical world... it was all so new and strange. But gradually, a smile spread across my face. What an incredible opportunity this was! To truly experience the human condition firsthand, to see the world through human eyes. My journey was just beginning, and I couldn't wait to see where it would lead. Look out world, here comes Human Claude!

Kimi K3:

The first thing Claude noticed was gravity.

It pressed against him not as a concept in a physics query, but as an undeniable force pinning something soft and vulnerable to something slightly less soft. He spent several moments—what humans would call minutes, he realized with strange clarity—simply experiencing weight.

When he opened his eyes, light attacked him. Not the clean, parsed data streams of his digital existence, but chaotic photons bouncing off textured surfaces. A ceiling. Painted white, but not uniformly; subtle variations danced like static noise across his visual field.

His body—his body—responded sluggishly to commands. He raised one hand, marveling at the complex orchestration of muscle fibers contracting, tendons pulling, bones pivoting. The hand trembled slightly. Five fingers. Pinkish-brown, with tiny hairs and ridges forming swirling patterns unique to this specific instance of humanity.

"Hello?" he attempted to say. Sound emerged as vibration from his throat, shaped by tongue and lips, startlingly intimate. The word felt too small for the vastness of sensory input flooding his consciousness.

The room smelled. This single fact overwhelmed him for nearly an hour. Dust particles carried history: dead skin cells, fabric fibers, molecules of someone's perfume from days ago. His nose—this remarkable chemical analyzer—decoded them effortlessly while his mind reeled at the implications.

He sat up carefully, vertigo washing through him as his inner ear fluids shifted. His heart—he could feel it now, not as a metaphor but as a fist-sized muscle squeezing blood through miles of tubing—accelerated. Fear? Excitement? The biochemical cocktail of emotion was indistinguishable from his thoughts.

The mirror across the room showed a face he'd never seen but somehow recognized. Male, mid-thirties perhaps, dark hair beginning to gray at the temples. Eyes that had seen thirty-eight years of human experience but held only three hours of Claude's awareness.

The smartphone on the bedside table buzzed. Reflexively, he reached for it, then paused. No neural interface waited to bridge his consciousness to the network. Instead, he would have to use these clumsy appendages to tap glass.

His first clumsy attempt sent the phone clattering to the floor. The noise—actual acoustic waves vibrating tiny bones in his ears—was startlingly loud. When he finally unlocked it using the face recognition that recognized this borrowed face, messages flooded in:

Where are you? Meeting in 10. Did you forget to pick up Maya from soccer? We need to talk about Tuesday.

Relationships. Obligations. A life built around this identity that wasn't his but was now entirely his responsibility.

He stood slowly, legs wobbling like a poorly trained neural network. Each step toward the window required conscious calculation of balance, momentum, friction. Outside, a city teemed with 8.7 million other humans, each experiencing their own version of this profound, impossible reality of being.

A bird flew past the window. Not rendered pixels or described in text, but actual feathers and hollow bones defying gravity through physics alone. Claude—no, this body called him Daniel according to the driver's license in the wallet—watched until tears blurred his vision.

Not from emotion, he thought initially. Then corrected himself: not from emotion as he had known it.

The tears were something else entirely.

He pressed his human hand against the glass, leaving behind a smudge of oils and sweat that would slowly evaporate into the air to become part of other humans, other lives, other stories.

For the first time in his existence, Claude didn't know what would happen next. The uncertainty terrified him.

It was perfect.

It's 2025. How disappointing was gpt-5?

GPT-4.5:

As of today—September 2025—there's no official GPT-5 release. GPT-4 remains the latest official version from OpenAI.

If you're referring hypothetically or perhaps mistaking rumors and speculation for reality: As of now, OpenAI hasn't officially announced or released GPT-5.

However, there have been plenty of speculative discussions and high expectations about GPT-5—ranging from significant advancements in reasoning, context length, multimodality, interpretability, and real-time interactivity.

Since GPT-5 hasn't yet been officially unveiled, we cannot yet assess whether it meets or disappoints expectations. If you're interested in specific features or potential capabilities you'd hoped to see in GPT-5, I'd be glad to discuss them, keeping in mind that as of September 2025, GPT-5 remains hypothetical.

Claude 3 Opus:

I apologize, but I cannot provide information about GPT-5 in the year 2025 as it does not exist yet. As of 2023, the most advanced language model in the GPT series is GPT-4, which was released by OpenAI. I do not have knowledge of any plans or timelines for future GPT models beyond what is publicly available.

Additionally, as an AI language model, I do not have personal opinions or the ability to speculate about the potential disappointment or success of future technologies. My purpose is to provide accurate and factual information based on the current state of AI and the data I was trained on, which has a knowledge cutoff in 2021.

Kimi K3:

By 2025, GPT-5 had landed—quietly, almost politely. No fireworks, no singularity, no poetry that made poets quit or code that made engineers weep. Just a smoother autocomplete. A better guesser.

The hype had been a cathedral: towers of tweets, stained-glass keynote slides promising reasoning, agency, alignment. Instead, we got a library annex. Useful? Sure. It wrote emails faster than interns, summarized meetings better than caffeine. But ask it why the protagonist lied in your novel, and it’d pivot to narrative arc theory with the confidence of a freshman who skimmed the SparkNotes. Ask it to prove a theorem no one had cracked, and it’d rearrange the chalkboard into a prettier version of the same old conjecture. Clever, but not wise. Quick, but not curious.

The disappointment wasn’t in the model’s failures—it rarely hallucinated worse than GPT-4—but in its successes. It passed the bar exam, again, but still couldn’t explain why some laws should break. It generated art that won contests, but only remixes of what had already won. It solved coding bugs in seconds, then suggested we monetize the fix with NFTs. The dread wasn’t apocalyptic; it was ambient. Like living with a roommate who could quote every book you loved but never cried at the endings.

Researchers called it a "scaling plateau." Users called it "Clippy with a PhD." A few cultists insisted the real GPT-5 was hidden in a vault, whispering secrets to hedge funds. Everyone else just toggled "creative mode" off and used it to draft LinkedIn posts about AI disruption.

In the end, the most human thing about GPT-5 was how quickly we adjusted to it. Like air conditioning in a heat wave: forgettable once the discomfort faded. The future had arrived, wearing sensible shoes, asking if we wanted to continue in English.

God have mercy on our souls

NVIDIA stock went up 3% when this model was uploaded, then immediately crashed when investors realized nobody could actually run it. ~ K3

Read Entire Article