A couple of months ago I gave a talk about running on device SLMs in apps. It was well recieved and refreshing to be able to give a talk where mobile apps can gain an advantage from the rise in language model usage.
After the talk, I had a few pieces of feedback from attendees along the lines of “It would be good to see an example of an app using an SLM powered by RAG”.
These are fair comments. The tricky thing here is the lead time it takes to setup a model and show something meaningful to developers in a limited time. Fortunately, it does make for a great blog post!
So here it is! If you’re looking for steps to add a RAG powered language model to your Android app, this is the post for you!
If you’re unfamilar with the term, RAG stands for Retrieval Augmented Generation. It’s a technique for language models to access external information that isn’t available in their dataset after training. This allows models to be aware of upto date information they can use to provide more accurate answers to prompts.
Let’s look at a quick example. Imagine you have two language models, one model is using RAG to retrieve external information about capital cities in Europe whilst the other is only relying on its own knowledge. You decide to give both models the following prompt:
The result could look something like this:
As the diagram illustrates, you’re more likely to get a useful answer from the language model using RAG, instead of the model relying on its own knowledge. You may also notice less hallucinations from the RAG powered model, as it doesn’t need to make up information.
That’s really all you need to know about RAG at a high level. Next, let’s go deeper and write some code to enable your own RAG powered language model inside your app!
Let’s say you want to create an app using a language model to play the game of Simon Says. You want the model to be Simon, and use RAG to access a datasource of tasks to help decide what to ask the user. How would you do that?
The most straight forward way to do that on Android is with MediaPipe, a collection of tools to help your app use AI and machine learning techniques. To begin, add the MediaPipe dependency to your app build.gradle:
Next, you need to add a language model to your test device via your computer. For this example we’ll use Google’s Gemma3-1B, a lightweight language model that holds 1 Billion parameters worth of information.
Side Note: Before downloading the model, you may have to sign upto Kaggle and agree to Google’s AI Terms and Conditions.
Once the model is downloaded. It’s time to add the model to your device. You can do this via adb:
Alternatively, you can use the File Explorer using Android Studio to create the folders yourself and drag the model onto your device.
With the model added you can continue building the RAG pipeline to feed Gemma with information. Next, let’s look at adding the information gemma will rely to answer prompts.
The information language models rely on to perform RAG is not the same information you pass into it. Models require information to be in a specific format called Embeddings.
These are mathematical strings that represent the text’s semantic meaning. When the model recieves a prompt it will use it to search for the most relevant information matching it, and use it alongside its own information to provide an answer. These mathematical strings are created by a tool called an Embedder.
Embeddings are a whole subject on their own you are encouraged to read about. For this post you only need to know how to create them. You can do this on device by using the Gecko Embedder.
First, download the sentencepiece.model tokenizer and the Gecko_256_f32.tflite embedder model files to your computer. Then push them to your device:
With the embedder installed to your device, it’s time to provide a sample file to create the embeddings from. In Android Studio, in your app module assets folder, create a file called simon_says_responses.txt. Then in the file add the following text:
The file contains a couple of different responses one could give in a game of Simon Says, each split up with a <chunk_splitter> tag. This gives the embedder a signal to know how to separate each response when splitting the text into embeddings.
This process is called chunking, and can have a large effect on how well RAG performs via the language model. Experimentation with different sized chunks and responses is encouraged to find the right combination for your needs!
One thing to consider is app storage. Remember you’ve already installed a language model and an embedder onto your device. These take up gigabytyes of space, so make sure not to further bloat a device by using a text file that is too large!
You want to consider storing the sample file in the cloud and downloading it via the network to reduce issues with storage.
With the text file in place, it’s time to initialise the embedder:
The embedder takes three parameters, the path of the embedder and the tokenizer on the device, with a final parameter to set whether the embedder can use the device GPU when creating embeddings.
Setting this to true will speed up the creation of the embeddings if a GPU is available. Make sure to check the device capabilities before deciding to enable this value.
Next, create an instance of the language model:
There are alot of parameters involved above, don’t worry about these for now. We’ll come back to them later on.
With the language model created, you can focus back on the embeddings. The model needs a place to retrieve the embeddings from each time it recieves a prompt.
MediaPipe provides an SQLite Vector Store, which is a common tool to store embeddings. Let’s create one:
Here, everything begins to link together as the language model and embedder are both passed into the ChainConfig. The 786 is the size of each “vector” the database can store. The PromptBuilder is used to provide a prompt to help drive the RAG process.
Finally, let’s load the text file we created earlier in the assets folder into the embedder. First, load the file from the device and split the text into a list of strings:
Next, load the responses into the chainConfig created earlier.
Depending on your device and the size of the text file this can take a couple of seconds to a few minutes complete. A good rule of thumb is to prevent the language model from being used whilst this occurs. You could also run this operation on an IO thread to avoid blocking the main thread.
With that done, you’ve just converted your textfile into a set of embeddings kept in a vector store. You’ve also linked your language model to the store so it can now retrieve information!
The next section will show how to use those embeddings by passing the language model your prompts.
With most of the complex setup complete, passing a prompt into the language model is surprisingly easy. First you need to create a RetrievalAndInferenceChain using the chain config and invoke it.
Next, create a request and pass it into the chain.
With that the language model will process the prompt. During processing it will refer to the embeddings in your vector store to provide what it thinks is the most accurate answer.
As you would expect, what you ask the model will result in different responses. As the embeddings contain a range of simon says responses, it’s likely you will get a good response!
What about the cases where you recieve an unexpected result? At this point you need to go back and fine tune the parameters of your objects from the previous section.
Let’s do that in the next section.
If there’s one thing we’ve learned over the years from Generative AI, it’s that it isn’t an exact science.
You’ve no doubt seen stories where language models have “misbehaved” and produced a less than desired result, causing embarassment and reputational damage to companies.
This is the result of not enough testing being performed to ensure the responses from a model are expected. Testing not only helps avoid embarassment, it can also help to experiment with your language model output so it works even better!
Let’s take a look at what levers we can adjust to fine tune our RAG powered language model. Starting with LLmInferenceOptions.Builder:
The first parameter that can be changed is the setMaxTokens() parameter. This is the amount of “input” or “output” the language model can handle. The larger the value, the more data the language model can handle at once. Resulting in better answers.
In our example, this means the model can handle a text input and generate an output that consists of 1200 tokens. If we wanted to handle less text and also generate a smaller response, we could set MaxTokens to a smaller value.
Be careful with this value, as you could find your model unexpectedly handling more tokens that it expects. Causing an app crash.
Let’s move onto LlmInferenceSessionOptions.Builder():
Here, you can set a couple of different parameters. .setTemperature(), .setTopK(), and .setTopP(). Let’s dive them into each of them.
.setTemperature() can be thought of how “random” the responses from the language model can be. The lower the value, the more “predictable” the responses from the model can be. The higher the value the more “creative” the responses will be, leading to more unexpected responses.
For this example it’s set to 0.6, meaning the model will provide semi-creative, but not unimaginable responses.
The temperature is a good value to experiment with, as you may find different values provide better responses depending on your use case. In a game of Simons Says, some creativity is welcome!
.setTopK() is a way of saying “only consider the top K results to return the user”. Language models whilst processing a prompt generate a number of responses, potentially thousands!
Each of these responses are given a probability as to how likely they are to be the right answer. To limit the amount of responses to consider, the topK value can be set to help focus the model. If you’re happy with less probabable responses being considered, you would want to set this value high.
Similar to the temperarture property, this is a good property to experiment with. You may find the model works better with less or more responses to consider, depending on your needs.
For a game of Simon Says, we want the model to be thinking about alot of different responses to keep the game fresh. So 5000 seems like a good value.
.setTopP() builds upon the limit set by topK by saying “only consider results that have a probability of P”. As mentioned earlier, language models assign a probablility to each response it generates as to how likely it can be. Setting topP means the model can easily discard any responses that don’t have the minimum probability needed.
To show an example, if the model had a topP set to 0.4 and was considering the following responses:
The first two responses would be considered because the probability is higher than 0.4. The response about the car would be discarded, as it only has a probability of 0.3.
Similar to temperature and topK, topP allows you to define how creative your language model can be. If you want to consider less probable responses to prompts, setting a low P value will help.
In our example, it’s set to 1.0. That’s because we want the model to be absolutely sure about its responses. It’s playing a childrens game after all!
Experimenting with these values will generate very different results from your language models. Try them out and see what happens!
I hope you’ve enjoyed this walkthrough of how to add a RAG powered language model to your Android App! If you’re looking to learn more about RAG and Android, here are a few links I recommend:
-
Simons Says App: Clone and run the sample code for this blog post to see it in action. It shows how to setup the RAG pipeline with Gemma using Android architecture best practices.
-
MediaPipe RAG: Check out the RAG section on the MediaPipe Google Developer docs Highly recommended reading.
-
Setting Temperature, TopK, and TopP in LLMs: Learn more about how setting the Temperature, TopK, and TopP values can help control the results from language models. Another highly recommended article.