Speculations about DeepSeek-OCR and Quantization

2 days ago 1

October 28, 2025

Recently, DeepSeek released a new model and paper, titled DeepSeek-OCR. I've seen a bunch of people talk about it recently, and it's pretty cool stuff.

I love the idea of using images to compress text. As Andrej points out, the tokenizer is kind of ugly, and it would be cool to have a more natural system. Intuitively, it makes sense that a more optimal way to parse written letters/characters might involve some amount of visual processing, since that's sort of how humans do it, with many people pattern-matching several words (or more) at a time when reading quickly, instead of parsing text as individual letters/tokens.

What I've been trying to figure out is how suprising this compression rate should be. From their abstract:

Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%.

This sounds pretty good, but there are other ways to compress transformers. DeepSeek-OCR was released using BF16 weights, and since they didn't mention anything about their data precision in the paper, I'm assuming that's what they evaluated on.

On the other hand, weight quantization in the form of low-precision inference has been making substantial progress. I can't find much data on NVFP4, but it seems to promise 4-ish bits per weight and activation (a bit more due to scaling factors) for only slightly reduced accuracy. This means that the 10x-20x advantage claimed by the DeepSeek-OCR paper may just be a 2.5-5x advantage over a 4-bit quantized model. I would be very curious to see whether the image-token-to-text-token compression would remain as high as it is when this model is quantized to such low precision. Perhaps image tokens are just able to use this "extra space" better than text tokens.

Taking this a step further: if this advantage really does decrease from 10x-20x to 2.5x-5x when the model is quantized to 4 bit, then this might imply that text tokens should really be quantized to between 0.8 to 1.6 bits. It seems like a somewhat pleasing result that this line of reasoning should land in the vicinity of 1-bit quantization.

On the other hand, it is also interesting to ask which would better for performance: many 1-bit text tokens or 10-20x fewer 16 bit tokens. Or maybe something in the middle? It seems clear to me that fewer high-precision tokens would be desirable for low-latency decode, given its autoregressive nature, especially given that current GPU hardware targets 16-bit computations. Maybe the tradeoff would be different in other circumstances though.

Anyways, one more disclaimer that this has all been very speculative. I have not carried out any experiments of my own here.

I used to have a generic-looking website made in Jekyll, but then I saw Miku's and stole hers (source).

Read Entire Article