OpenAI Open Source Model Leaked on HF

3 months ago 1

[–]Affectionate-Cap-600 14 points15 points16 points 4 hours ago* (5 children)

what is this 'swiglu limit'? I haven't seen it in many configs. (maybe some kind of activation clipping?)

Also, initial context lenght of 4096 is quite bad, even llama 3 started with 8k. and it even had a sliding window (I still, I assume only in some of the layers or heads) of 128 (we are at the level of ModernBERT)

if this end up being 'open source SotA' this mean they really have some secret sauce in the training pipeline

edit: let's do some fast math...

active moe's MLP parameters: 2.880×2.880×3×4×36 = 3.583.180.800 (same range of llama 4 MoE) [edit: I should specify, same range of llama 4 *routed** active MoE MLP parameters, since they have a lot (relatively speaking) of always active parameters (since they use a dense layer in every other layer and 2 experts per token, of which one is 'shared', always active) ]*
total MoE MLP param: 2.880×2.880×3×128×36 = 114.661.785.600
attention parameters: (2.880×64×(64+8+8)+(2.880×64×64))×36 = 955.514.880 (less than 1B?!)
embedding layer / lm head: 2880*201.088 = 579.133.440 (x 2 if tie embeddings == False.)

Imo there are some possibilities... . 0) those configs are wrong . 1) this model will have some initial dense layer (like deepseek) or interleaved (like llama 4), and strangely this is not mentioned in any way in this config or 2) this is the sparser moe I've ever seen, with less modeling capability per forward pass than a 8B model: for context, llama 3.1 8B has 4096 hidden size (vs 2880), 14K intermediate size (vs 2880*4) and 32 layers (vs 36 of this model)

I'm aware that those numbers do not tell the while story, but it is a starting point, and it is everything we have right now.

still, if this model will demonstrate to be SotA, this will be an incredible achievement for openai, meaning that they have something other don't have (let it be some incredible training pipeline, optimization algorithms or 'just' incredibly valuable data)

obviously I may be totally wrong here!!, I'm just speculating based on those configs.

edit 2: formatting (as a sloppy bullet list) and a clarification