GPULlama3.java – Llama3.java on Steroids

4 months ago 27

Llama3 models written in native Java automatically accelerated on GPUs with TornadoVM. This project allows you to run Llama3 inference efficiently, leveraging TornadoVM's parallel computing features for enhanced performance.

Builds on Llama3.java, based on the original Llama 3, 3.1, and 3.2 models, with TornadoVM support for parallelism and hardware acceleration.

Thanks to Alfonso² Peterssen for the original implementation of Llama3.java.

Previous intergration of TornadoVM and Llama2 it can be found in llama2.tornadovm.

[Interactive-mode] Running on a RTX 5090 with nvtop on bottom to track GPU utilization and memory usage.

[Instruct-mode] Running on a RTX 5090

TornadoVM-Accelerated Inference Performance and Optimization Status

We are at the early stages of Java entering the AI world with features added to the JVM that enable faster execution such as GPU acceleration, Vector acceleration, high-performance access to off-heap memory and others.

This repository provides the first Java-native implementation of Llama3 that automatically compiles and executes Java code on GPUs via TornadoVM. The baseline numbers presented below provide a solid starting point for achieving more competitive performance compared to llama.cpp or native CUDA implementations. Our roadmap provides the upcoming set of features that will dramatically improve the numbers below with the clear target being to achieve performance parity with the fastest implementations.

If you achieve additional performance data points (e.g. new hardware or platforms) please let us know to add them below.

In addition, if you are interested to learn more about the challenges of managed programming languages and GPU acceleration, you can read our book or consult the TornadoVM educational pages.

Vendor / Backend Hardware Llama-3.2-1B-Instruct Llama-3.2-3B-Instruct Optimizations

		FP16	FP16	Support
NVIDIA / OpenCL-PTX	RTX 3070	52 tokens/s	22.96 tokens/s	✅
	RTX 4090	66.07 tokens/s	35.51 tokens/s	✅
	RTX 5090	96.65 tokens/s	47.68 tokens/s	✅
	L4 Tensor	52.96 tokens/s	22.68 tokens/s	✅
Intel / OpenCL	Arc A770	15.65 tokens/s	7.02 tokens/s	(WIP)
Apple Silicon / OpenCL	M3 Pro	14.04 tokens/s	6.78 tokens/s	(WIP)
	M4 Pro	16.77 tokens/s	8.56 tokens/s	(WIP)
AMD / OpenCL	Radeon RX	(WIP)	(WIP)	(WIP)

⚠️ Note on Apple Silicon Performance

TornadoVM currently runs on Apple Silicon via OpenCL, which has been officially deprecated since macOS 10.14.

Despite being deprecated, OpenCL can still run on Apple Silicon; albeit, with older drivers which do not support all optimizations of TornadoVM. Therefore, the performance is not optimal since TornadoVM does not have a Metal backend yet (it currently has OpenCL, PTX, and SPIR-V backends). We recommend using Apple silicon for development and for performance testing to use OpenCL/PTX compatible Nvidia GPUs for the time being (until we add a Metal backend to TornadoVM and start optimizing it).

Ensure you have the following installed and configured:

Java 21: Required for Vector API support & TornadoVM.
TornadoVM with OpenCL or PTX backends.
Maven: For building the Java project.

When cloning this repository, use the --recursive flag to ensure that TornadoVM is properly included as submodule:

# Clone the repository with all submodules git clone --recursive https://github.com/beehive-lab/GPULlama3.java.git # Navigate to the project directory cd GPULlama3.java # Update the submodules to match the exact commit point recorded in this repository git submodule update --recursive # Enter the TornadoVM submodule directory cd external/tornadovm # Optional: Create and activate a Python virtual environment if needed python3 -m venv venv source ./venv/bin/activate # Install TornadoVM with a supported JDK 21 and select the backends (--backend opencl,ptx). # To see the compatible JDKs run: ./bin/tornadovm-installer --listJDKs # For example, to install with OpenJDK 21 and build the OpenCL backend, run: ./bin/tornadovm-installer --jdk jdk21 --backend opencl # Source the TornadoVM environment variables source setvars.sh # Navigate back to the project root directory cd ../../ # Make the llama-tornado script executable chmod +x llama-tornado # Source the project-specific environment paths -> this will ensure the correct paths are set for the project and the TornadoVM SDK # Expect to see: [INFO] Environment configured for Llama3 with TornadoVM at: /home/YOUR_PATH_TO_TORNADOVM source set_paths # Build the project using Maven (skip tests for faster build) # mvn clean package -DskipTests or just make make # Run the model (make sure you have downloaded the model file first - see below) ./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke"

The above model can we swapped with one of the other models, such as beehive-llama-3.2-3b-instruct-fp16.gguf or beehive-llama-3.2-8b-instruct-fp16.gguf, depending on your needs. Check models below.

Download FP16 quantized .gguf files from:

Please be gentle with huggingface.co servers:

Note FP16 models are first-class citizens for the current version.

# Llama 3.2 (1B) - FP16 wget https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-1b-instruct-fp16.gguf # Llama 3.2 (3B) - FP16 wget https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-3b-instruct-fp16.gguf # Llama 3 (8B) - FP16 wget https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-8b-instruct-fp16.gguf

[Experimental] you can download the Q8 and Q4 used in the original implementation of Llama3.java, but for now are going to be dequanted to FP16 for TornadoVM support:

# Llama 3.2 (1B) - Q4_0 curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf # Llama 3.2 (3B) - Q4_0 curl -L -O https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf # Llama 3 (8B) - Q4_0 curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf # Llama 3.2 (1B) - Q8_0 curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf # Llama 3.1 (8B) - Q8_0 curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf

To execute Llama3 models with TornadoVM on GPUs use the llama-tornado script with the --gpu flag.

Run a model with a text prompt:

./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "Explain the benefits of GPU acceleration."

GPU Execution (FP16 Model)

Enable GPU acceleration with Q8_0 quantization:

llama-tornado --gpu --verbose-init --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke"

Troubleshooting GPU Memory Issues

You may encounter an out-of-memory error like:

Exception in thread "main" uk.ac.manchester.tornado.api.exceptions.TornadoOutOfMemoryException: Unable to allocate 100663320 bytes of memory. To increase the maximum device memory, use -Dtornado.device.memory=<X>GB

This indicates that the default GPU memory allocation (7GB) is insufficient for your model.

First, check your GPU specifications. If your GPU has high memory capacity, you can increase the GPU memory allocation using the --gpu-memory flag:

# For 3B models, try increasing to 15GB ./llama-tornado --gpu --model beehive-llama-3.2-3b-instruct-fp16.gguf --prompt "Tell me a joke" --gpu-memory 15GB # For 8B models, you may need even more (20GB or higher) ./llama-tornado --gpu --model beehive-llama-3.2-8b-instruct-fp16.gguf --prompt "Tell me a joke" --gpu-memory 20GB

GPU Memory Requirements by Model Size

Model Size Recommended GPU Memory

1B models	7GB (default)
3B models	15GB
8B models	20GB+

Note: If you still encounter memory issues, try:

Using Q4_0 instead of Q8_0 quantization (requires less memory).
Closing other GPU-intensive applications in your system.

Supported command-line options include:

cmd ➜ llama-tornado --help usage: llama-tornado [-h] --model MODEL_PATH [--prompt PROMPT] [-sp SYSTEM_PROMPT] [--temperature TEMPERATURE] [--top-p TOP_P] [--seed SEED] [-n MAX_TOKENS] [--stream STREAM] [--echo ECHO] [-i] [--instruct] [--gpu] [--opencl] [--ptx] [--gpu-memory GPU_MEMORY] [--heap-min HEAP_MIN] [--heap-max HEAP_MAX] [--debug] [--profiler] [--profiler-dump-dir PROFILER_DUMP_DIR] [--print-bytecodes] [--print-threads] [--print-kernel] [--full-dump] [--show-command] [--execute-after-show] [--opencl-flags OPENCL_FLAGS] [--max-wait-events MAX_WAIT_EVENTS] [--verbose] GPU-accelerated LLaMA.java model runner using TornadoVM options: -h, --help show this help message and exit --model MODEL_PATH Path to the LLaMA model file (e.g., beehive-llama-3.2-8b-instruct-fp16.gguf) (default: None) LLaMA Configuration: --prompt PROMPT Input prompt for the model (default: None) -sp SYSTEM_PROMPT, --system-prompt SYSTEM_PROMPT System prompt for the model (default: None) --temperature TEMPERATURE Sampling temperature (0.0 to 2.0) (default: 0.1) --top-p TOP_P Top-p sampling parameter (default: 0.95) --seed SEED Random seed (default: current timestamp) (default: None) -n MAX_TOKENS, --max-tokens MAX_TOKENS Maximum number of tokens to generate (default: 512) --stream STREAM Enable streaming output (default: True) --echo ECHO Echo the input prompt (default: False) Mode Selection: -i, --interactive Run in interactive/chat mode (default: False) --instruct Run in instruction mode (default) (default: True) Hardware Configuration: --gpu Enable GPU acceleration (default: False) --opencl Use OpenCL backend (default) (default: None) --ptx Use PTX/CUDA backend (default: None) --gpu-memory GPU_MEMORY GPU memory allocation (default: 7GB) --heap-min HEAP_MIN Minimum JVM heap size (default: 20g) --heap-max HEAP_MAX Maximum JVM heap size (default: 20g) Debug and Profiling: --debug Enable debug output (default: False) --profiler Enable TornadoVM profiler (default: False) --profiler-dump-dir PROFILER_DUMP_DIR Directory for profiler output (default: /home/mikepapadim/repos/gpu-llama3.java/prof.json) TornadoVM Execution Verbose: --print-bytecodes Print bytecodes (tornado.print.bytecodes=true) (default: False) --print-threads Print thread information (tornado.threadInfo=true) (default: False) --print-kernel Print kernel information (tornado.printKernel=true) (default: False) --full-dump Enable full debug dump (tornado.fullDebug=true) (default: False) --verbose-init Enable timers for TornadoVM initialization (llama.EnableTimingForTornadoVMInit=true) (default: False) Command Display Options: --show-command Display the full Java command that will be executed (default: False) --execute-after-show Execute the command after showing it (use with --show-command) (default: False) Advanced Options: --opencl-flags OPENCL_FLAGS OpenCL compiler flags (default: -cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only) --max-wait-events MAX_WAIT_EVENTS Maximum wait events for TornadoVM event pool (default: 32000) --verbose, -v Verbose output (default: False)

Debug & Profiling Options

View TornadoVM's internal behavior:

# Print thread information during execution ./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads # Show bytecode compilation details ./llama-tornado --gpu --model model.gguf --prompt "..." --print-bytecodes # Display generated GPU kernel code ./llama-tornado --gpu --model model.gguf --prompt "..." --print-kernel # Enable full debug output with all details ./llama-tornado --gpu --model model.gguf --prompt "..." --debug --full-dump # Combine debug options ./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads --print-bytecodes --print-kernel

Easy Integration with Your Codebase or Tools

To integrate llama-tornado into your codebase or IDE (e.g., IntelliJ) or custom build system (like IntelliJ, Maven, or Gradle), use the --show-command flag. This flag shows the exact Java command with all JVM flags that are being invoked under the hood in order to enable seamless execution on GPUs with TornadoVM. Hence, it makes it simple to replicate or embed the invoked flags in any external tool or codebase.

llama-tornado --gpu --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke" --show-command

📋 Click to see the full Java command

/home/mikepapadim/.sdkman/candidates/java/current/bin/java \ -server \ -XX:+UnlockExperimentalVMOptions \ -XX:+EnableJVMCI \ -Xms20g -Xmx20g \ --enable-preview \ -Djava.library.path=/home/mikepapadim/manchester/TornadoVM/bin/sdk/lib \ -Djdk.module.showModuleResolution=false \ --module-path .:/home/mikepapadim/manchester/TornadoVM/bin/sdk/share/java/tornado \ -Dtornado.load.api.implementation=uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph \ -Dtornado.load.runtime.implementation=uk.ac.manchester.tornado.runtime.TornadoCoreRuntime \ -Dtornado.load.tornado.implementation=uk.ac.manchester.tornado.runtime.common.Tornado \ -Dtornado.load.annotation.implementation=uk.ac.manchester.tornado.annotation.ASMClassVisitor \ -Dtornado.load.annotation.parallel=uk.ac.manchester.tornado.api.annotations.Parallel \ -Duse.tornadovm=true \ -Dtornado.threadInfo=false \ -Dtornado.debug=false \ -Dtornado.fullDebug=false \ -Dtornado.printKernel=false \ -Dtornado.print.bytecodes=false \ -Dtornado.device.memory=7GB \ -Dtornado.profiler=false \ -Dtornado.log.profiler=false \ -Dtornado.profiler.dump.dir=/home/mikepapadim/repos/gpu-llama3.java/prof.json \ -Dtornado.enable.fastMathOptimizations=true \ -Dtornado.enable.mathOptimizations=false \ -Dtornado.enable.nativeFunctions=fast \ -Dtornado.loop.interchange=true \ -Dtornado.eventpool.maxwaitevents=32000 \ "-Dtornado.opencl.compiler.flags=-cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only" \ --upgrade-module-path /home/mikepapadim/manchester/TornadoVM/bin/sdk/share/java/graalJars \ @/home/mikepapadim/manchester/TornadoVM/bin/sdk/etc/exportLists/common-exports \ @/home/mikepapadim/manchester/TornadoVM/bin/sdk/etc/exportLists/opencl-exports \ --add-modules ALL-SYSTEM,tornado.runtime,tornado.annotation,tornado.drivers.common,tornado.drivers.opencl \ -cp /home/mikepapadim/repos/gpu-llama3.java/target/gpu-llama3-1.0-SNAPSHOT.jar \ com.example.LlamaApp \ -m beehive-llama-3.2-1b-instruct-fp16.gguf \ --temperature 0.1 \ --top-p 0.95 \ --seed 1746903566 \ --max-tokens 512 \ --stream true \ --echo false \ -p "tell me a joke" \ --instruct

Current Features & Roadmap

Support for GGUF format models with full FP16 and partial support for Q8_0 and Q4_0 quantization.
Instruction-following and chat modes for various use cases.
Interactive CLI with --interactive and --instruct modes.
Flexible backend switching - choose OpenCL or PTX at runtime (need to build TornadoVM with both enabled).
Cross-platform compatibility:
- ✅ NVIDIA GPUs (OpenCL & PTX )
- ✅ Intel GPUs (OpenCL)
- ✅ Apple GPUs (OpenCL)

Click here to view a more detailed list of the transformer optimizations implemented in TornadoVM.

Click here to see the roadmap of the project.

This work is partially funded by the following EU & UKRI grants (most recent first):

EU Horizon Europe & UKRI AERO 101092850.
EU Horizon Europe & UKRI P2CODE 101093069.
EU Horizon Europe & UKRI ENCRYPT 101070670.
EU Horizon Europe & UKRI TANGO 101070052.

MIT

Read Entire Article