ESP32-LLM: Running a Little Language Model on the ESP32

4 months ago 4

LLM on ESP32 https://youtu.be/ILh38jd0GNU

I wanted to see if it was possible to run a Large Language Model (LLM) on the ESP32. Surprisingly it is possible, though probably not very useful.

The "Large" Language Model used is actually quite small. It is a 260K parameter tinyllamas checkpoint trained on the tiny stories dataset.

The LLM implementation is done using llama.2c with minor optimizations to make it run faster on the ESP32.

LLMs require a great deal of memory. Even this small one still requires 1MB of RAM. I used the LILYGO T-Camera S3 ESP32-S3 because it has 8MB of embedded PSRAM and a screen.

Optimizing Llama2.c for the ESP32

With the following changes to llama2.c, I am able to achieve 19.13 tok/s:

  1. Utilizing both cores of the ESP32 during math heavy operations.
  2. Utilizing some special dot product functions from the ESP-DSP library that are designed for the ESP32-S3. These functions utilize some of the few SIMD instructions the ESP32-S3 has.
  3. Maxing out CPU speed to 240 MHz and PSRAM speed to 80MHZ and increasing the instruction cache size.

This requires the ESP-IDF toolchain to be installed

idf.py build idf.py -p /dev/{DEVICE_PORT} flash
Read Entire Article