ROSA+: RWKV's ROSA implementation with fallback statistical predictor
ROSA+ is an extension of the statistical next-token predictor proposed by BlinkDL in extending the RWKV language model. It provides an intuitive Python interface as well as a fallback Witten–Bell predictor for unknown sequences.
The implementation is self-contained in rosaplus.py. You can download the repository and use it from there.
Output: (verbatim)
ROSA+ can also be used to generate novel sequences that do not show up in the training dataset. You can enable this by always using the fallback predictor. It often leads to coherent, surprising results.
Output: (novel)
As you can see, these arrangement of sentences do not show up in the dataset (CTRL+F). Rather, ROSA+ intelligently splices and pulls together the features from ROSA to perform next-character prediction.
For any given prefix, you can also get the probability distribution for the next token:
Output:
As you can see, ROSA+ is extremely confident that 'u' is the next token (and it is correct!)
This is just a standalone example of ROSA and does not provide RWKV integration. You will have to go to the RWKV Discord or ask the main maintainer (BlinkDL) for assistance in this regard.
ROSA+ extends ROSA by:
- Allowing training and sampling on individual sequences, similar to a LLM
- Utilizing a (coherent) fallback Witten–Bell based predictor for when ROSA is unsure of the next token.
This makes it extremely fast, since ROSA is used for 99% of the predictions and the fallback only occurs for novel sequences.
Tokenization: The default tokenization is character-based (I will add support for new tokenizers coming soon.)
If you install orjson, it will use it automatically and lead to far faster import/export speed. Docs coming soon.
ROSA+ is entirely statistical-based -- it extends upon the ROSA predictor proposed by BlinkDL, then provides a probability predictor as a fallback. However, this means it only has a database-like understanding of text -- it can stitch together multiple sentences and demonstrate grammar, but it lacks the same context understanding as an NN (RWKV, Transformer etc.)
For instance, when trained on Shakespeare, and with always_fallback=True (forcing novel predictions), it generates text that "looks right", but switches between characters every stanza.
A ChatGPT analysis of ROSA+'s lines uncovers some insight:
A true NN-based model would outperform a standalone ROSA+ implementation because of the understanding of actual context. While ROSA+ has impressive surface-level understanding, it lacks deeper, low level meaning expressed by NNs.
You can view all the samples in the samples directory -- interestingly, in sample_default_output.txt, the model falls into an attractor state, repeating itself every ~3k lines, halfway through. However, in sample_novel_output.txt, you can spot some very novel sentences:
The phrases Well, well, peace be with you and I would wish it gone never show up in the training data.
- Autocorrect / word prediction
- Translation (possibly)
- Features for a lower level model
- Generating surface-level text that fools detectors
One may be able to create a coherent language model simply by feeding ROSA+ embeddings into a GRU. Since ROSA+ captures the immediate surface-level features of text, a sufficient neural network may be able to operate on these embeddings and alter the distribution for more fine-grained understanding.
Unless statistical LMs incorporate some kind of statistical attention mechanism (which is possible!) they will never be able to grasp a high-level understanding of text as do humans and LMs. A statistical LM is unable to copy data / tokens from one place to another, operate on a continous state, blend together tokens across different spaces, perform few-shot learning (needs neural mechanism!) or transfer learning (no state vectors!). Therefore, their purpose remains limited to grasping surface-level features of text, like syntax, or general structure.
Google pushed to make their translation software (which in the 2010s, was n-gram based) the best at the time, but even LSTMs (which were invented way before Transformers) managed to outperform them.
Do not let this discourage you though. It may be practical to incorporate some kind of continous state vector / representation within a statistical model, making it drastically more efficient than LLMs while preserving all the benefits of NN-based models. This is an active field of research at Bellevue College ML (BCML) -- and if pioneered, could result in language models thousands of times more efficient. Don't let an article discourage you.
.png)


