New algorithms to losslessly boost AI perf by up to 2.8x

2 days ago 4

We all know that AI is expensive, but a new set of algorithms developed by researchers at the Weizmann Institute of Science, Intel Labs, and d-Matrix could significantly reduce the cost of serving up your favorite large language model (LLM) with just a few lines of code.

Presented at the International Conference on Machine Learning this week and detailed in this paper, the algorithms offer a new spin on speculative decoding that they say can boost token generation rates by as much as 2.8x while also eliminating the need for specialized draft models.

Speculative decoding, if you're not familiar, isn't a new concept. It works by using a small "draft" model ("drafter" for short) to predict the outputs of larger, slower, but higher quality "target" models. 

If the draft model can successfully predict, say, the next four tokens in the sequence, that's four tokens the bigger model doesn't have to generate, and so we get a speed-up. If it's wrong, the larger model discards the draft tokens and generates new ones itself. That last bit is important as it means the entire process is lossless — there's no trade-off in quality required to get that speed-up.

The whole concept is a bit like predictive text on a modern smartphone. As you type, it tries to guess what you're going to say next. When it's right, you can complete the sentence with a single tap; when it's wrong, you just type it out yourself.

In practice, speculative decoding can effectively double or even triple token generation depending on the application. But as wonderful as 3x the tokens for the same amount of compute might sound, the trick is finding a compatible draft model.

One of the challenges to the adoption of speculative decoding is that the two models' vocabularies — i.e. their dictionaries — have to match. Unless the model you're trying to run happens to have a smaller variant, taking advantage of speculative decoding has often required training specialized draft models. Making matters worse, these specialized draft models have to be retrained every time a new target model, say a new version of Llama, comes out, Nadav Timor, a Ph.D student at the Weizmann Institute, tells El Reg.

Universal draft model

The algorithms aim to overcome this limitation by enabling any model to serve draft duty regardless of whether the vocabularies are the same or not.

To do this, the researchers explored three distinct approaches to the problem. The first of these, called Token-Level-Intersection (TLI), is essentially the equivalent of running diff on the two models' vocabularies to figure out which words the drafter should avoid. This way the draft model only predicts tokens that are also in the target model's vocabulary.

So long as there's sufficient overlap in the model's vocabularies, the rate at which the draft model's predictions are accepted stays high. Using this approach, the researchers observed a 1.7x speed up over conventional autoregressive decoding, where the entirety of the model weights are read from memory every time a token is generated. 

The second algorithm, called String-Level Exact Match (SLEM), works more like a translation layer between the draft and target model's tokenizers. 

Tokenizers, if you're not familiar, are how large language models break up words, punctuation, and other expressions into chunks they can understand. OpenAI has a great demo showing this in practice, which you can find here.

Draft predictions using the SLEM algorithm generate a complete string of tokens, which are converted into an intermediary format — in this case, plain text — that both models can understand. The output is then retokenized by the target model for review.

This approach, Timor notes, "replaces the standard verification method of speculative decoding with exact string matching, which is an even stricter verification method."

This introduced certain challenges for the team as differences in how the tokenizers handle text could introduce nearly imperceptible changes. "For example, if you have leading white spaces, it might squash them," he explained.

That might not sound like a big deal, but the string must match exactly, or it will be rejected and any potential speedup will be lost. To get around this, SLEM introduced a heuristic function to help smooth out the differences and drive up the acceptance rates. And, at least in long-context tasks like summarization and programming, the improvements can be dramatic up to 2.8x in the team's testing.

It's a single line change for developers

Neither of these algorithms, Timor emphasizes, is theoretical. Both SLEM and TLI are already part of Hugging Face's Transformers library, which is among the most widely deployed frameworks for running LLMs at scale today. "It's a single line change for developers," he said.

Which of these you should use is going to depend on what exactly you're doing with these models, Timor said. "Sometimes the first one works better, sometimes the second one does. You have to check it on your specific configuration."

In some cases, it may still be worth training a dedicated drafter. But as Timor points out, the algorithms researchers have developed significantly reduce the barrier to adoption for speculative decoding.

More research to be done

Timor's research into speculative decoding doesn't stop here. As we mentioned earlier, the team developed three algorithms.

The third, called String-Level Rejection Sampling (SLRS) aimed to address the relatively poor acceptance rates associated with string verification based approaches.

"It uses a generalized drafter that considers probabilities over strings rather than tokens, and we proved that it boosts acceptance rates," Timor said. "The problem is that computing this generalized drafter in runtime, it's computationally expensive, so you have to redesign vocabularies to make this algorithm practical."

The team is also looking at ways to address the explosive growth of model vocabularies and make the draft models even faster.

"The vocabularies are getting huge. Llama 4 for example, is like 200,000 tokens," Timor said, adding that most of that isn't actually used, driving up latency. "We're currently working on shrinking the vocabulary."

That research, he says, is ongoing. ®

Read Entire Article