A high-performance Large Language Model inference engine written in pure Object Pascal.
PasLLM is a native Pascal implementation for running LLMs locally with optimized quantization and inference capabilities. It supports multiple model architectures and features advanced 4-bit quantization formats for efficient model deployment.
It is currently CPU-only, with no GPU acceleration. GPU acceleration will be added in the future using my PasVulkan framework, but this will take time and effort. Until at least Q2 2026, I'm focusing on other professional projects, so please be patient. The same applies to support for multi-modal models, models with newer architectures (Mamba, etc.) and so on.
- Pure Object Pascal - No Python or external dependencies for inference
- Cross-Platform - Compatible with Delphi ≥11.2 and FreePascal ≥3.3.1
- Multiple Architectures - Support for Llama, Qwen, Phi, Gemma, Mixtral, and more
- Advanced Quantization - Custom Q4*NL formats (Q40NL, Q41NL, Q42NL, Q43NL) with superior tail reconstruction
- Optimized Performance - Native Pascal implementation with platform-specific optimizations
- CLI and GUI - Both command-line interface and visual applications (FMX, VCL, LCL)
PasLLM implements several custom 4-bit and 8-bit quantization formats designed for optimal quality/size tradeoff:
- Q40NL - 4.5 bits/weight with non-linear decode (often better than Q40)
- Q41NL - Alternative non-linearity with increased tail emphasis
- Q42NL - Enhanced variant with improved reconstruction
- Q43NL - Advanced format with multiple optimization methods (gradient, coarse-fine, grid)
- Q40 - 4 bits/weight standard quantization (matches llama.cpp Q4_0 quality)
- Q80 - 8-bit quantization for higher quality (matches llama.cpp Q8_0 quality)
- Q3F8 - 8x 3 bits weights + 1x 8-bit float FP8 scale per block for 4-bits/weight efficiency
- FP8 - 8-bit floating point support
- FP16 - 16-bit floating point support
- BF16 - Brain Floating Point 16-bit support (basically truncated 32-bit float where the lower 16 bits are cut)
- FP32 - Standard 32-bit floating point support (for reference and testing)
These formats achieve 99.5-99.97% of full precision quality while maintaining compact model sizes.
Pre-quantized models are available at https://mega.nz/folder/krcgHCpZ#0tjLqup_Hc4THWC9itDrTg, which must be placed in the bin/models/ directory. Supported architectures include:
- Llama - 3/3.1/3.2 (1B, 3B, 8B variants, including abliterated/uncensored)
- Qwen 2.5 - 0.5B, 1.5B, 3B, 7B Instruct
- Qwen 3 - 0.6B, 1.7B, 4B, 8B, 14B, 32B (including thinking/coder/abliterated variants, 30B MoE models)
- Phi-3 - Mini, Medium (4K context)
- Gemma - 1.1 (2B) (support for Gemma 2 and 3 coming later)
- SmolLM 2 - 135M, 360M, 1.7B
- SmolLM 3 - 3B
- Mixtral - 8x7B Instruct
- EuroMoE - 2.6B (0.6B active)
- SimpleChat - 4B, 14B
- DeepSeek - R1 variants
- TinyLlama - 1.1B Chat
FreePascal:
Delphi: Open src/pasllmcli/pasllmcli.dproj in Delphi IDE and build.
Models from Hugging Face can be converted to PasLLM format using the convert.py script in the tools/ directory. Example usage:
- 4-bit Quantization Formats - Complete specification of Q4*NL formats
Dual-licensed under:
- AGPL 3.0 for open-source use
- Commercial license available for proprietary applications (contact: [email protected])
Benjamin Rosseaux (BeRo)
GitHub: @BeRo1985
Contact: [email protected]
- Code is compatible with both Delphi ≥11.2 and FreePascal ≥3.3.1
- Compiles on 32-bit and 64-bit platforms (x86-32, x86-64, ARM, ARM64), but 64-bit is preferred due to model sizes (32-bit may run out of memory). The 64-bit targets are more tested and verified. 32-bit support is unofficial and at your own risk.
- No platform-specific or third-party dependencies (unless out-ifdef-able)
.png)


