🚀 Live Demo: https://webpizza-ai-poc.vercel.app/
⚠️ Experimental POC: This is a proof-of-concept for testing purposes only. It may contain bugs and errors. Loosely inspired by DataPizza AI.
100% Client-Side AI Document Chat - No servers, no APIs, complete privacy.
Chat with your PDF documents using AI that runs entirely in your browser via WebGPU.
- 🔒 100% Private: All processing happens in your browser - your documents never leave your device
- ⚡ Dual Engine: Choose between standard WebLLM or optimized WeInfer (~3.76x faster)
- 🤖 Multiple Models: Phi-3, Llama 3, Mistral 7B, Qwen, Gemma
- 📄 PDF Support: Upload and chat with your PDF documents
- 🎯 RAG Pipeline: Advanced retrieval-augmented generation with vector search
- 💾 Local Storage: Documents cached in IndexedDB for instant access
- 🚀 WebGPU Accelerated: Leverage your GPU for fast inference
- Frontend: Angular 20
- LLM Engines:
- WebLLM v0.2.79 (Standard)
- WeInfer v0.2.43 (Optimized with buffer reuse + async pipeline)
- Embeddings: Transformers.js (all-MiniLM-L6-v2)
- PDF Parsing: PDF.js v5.4.296
- Vector Store: IndexedDB with cosine similarity
- Compute: WebGPU / WebAssembly
- Modern browser with WebGPU support (Chrome 113+, Edge 113+)
- 4GB+ RAM available
- Modern GPU (Intel HD 5500+, NVIDIA GTX 650+, AMD HD 7750+, Apple M1+)
# Install dependencies
npm install
# Start dev server
npm start
# Build for production
npm run build
- Open chrome://flags or edge://flags
- Search for "WebGPU"
- Enable "Unsafe WebGPU"
- Restart browser
Check your browser: https://webgpureport.org/
- Select Engine: Choose between WebLLM (standard) or WeInfer (optimized)
- Choose Model: Select an LLM based on your hardware capabilities
- Upload PDF: Drop your document (first load downloads model ~1-4GB)
- Ask Questions: Chat with your document using natural language
- ❌ No data collection
- ❌ No server uploads
- ❌ No tracking cookies
- ❌ No analytics
- ✅ 100% client-side processing
- ✅ Your data never leaves your device
See our Privacy Policy and Cookie Policy for details.
- Phi-3 Mini: ~3-6 tokens/sec
- Llama 3.2 1B: ~8-12 tokens/sec
- Mistral 7B: ~2-4 tokens/sec
- ~3.76x faster across all models
- Buffer reuse optimization
- Asynchronous pipeline processing
- GPU sampling optimization
# Install Vercel CLI
npm i -g vercel
# Deploy
vercel
The project includes vercel.json with optimal configuration for WebGPU and routing.
Ensure your hosting supports:
- SPA routing (all routes → index.html)
- Cross-Origin headers for WebGPU:
- Cross-Origin-Embedder-Policy: require-corp
- Cross-Origin-Opener-Policy: same-origin
| Chrome | 113+ | ✅ Full Support |
| Edge | 113+ | ✅ Full Support |
| Safari | 18+ | ⚠️ Experimental |
| Firefox | - | ❌ Not Yet |
- Phi-3-mini-4k-instruct-q4f16_1-MLC (~2GB)
- Llama-3.2-1B-Instruct-q4f16_1-MLC (~1GB)
- Llama-3.2-3B-Instruct-q4f16_1-MLC (~1.5GB)
- Mistral-7B-Instruct-v0.3-q4f16_1-MLC (~4GB)
- Qwen2.5-1.5B-Instruct-q4f16_1-MLC (~1GB)
- Phi-3-mini-4k-instruct-q4f16_1-MLC (~2GB)
- Qwen2-1.5B-Instruct-q4f16_1-MLC (~1GB)
- Mistral-7B-Instruct-v0.3-q4f16_1-MLC (~4GB)
- Llama-3-8B-Instruct-q4f16_1-MLC (~4GB)
- gemma-2b-it-q4f16_1-MLC (~1.2GB)
- Check browser version (Chrome/Edge 113+)
- Enable chrome://flags#enable-unsafe-webgpu
- Update graphics drivers
- Test at https://webgpureport.org/
- Try a smaller model (Llama 1B, Qwen)
- Use WeInfer engine for 3.76x speedup
- Close other tabs/applications
- Check GPU isn't throttling
- Use smaller models
- Close other browser tabs
- Increase browser memory limit
- Clear browser cache and restart
This is a proof-of-concept project. Contributions, issues, and feature requests are welcome!
MIT License - See LICENSE file for details
Emanuele Strazzullo
- Website: emanuelestrazzullo.dev
- LinkedIn: linkedin.com/in/emanuelestrazzullo
- MLC LLM - WebLLM inference engine
- WeInfer - Optimized WebLLM fork
- Transformers.js - Browser ML library
- PDF.js - PDF parsing
- Hugging Face - Model hosting
Made with ❤️ by Emanuele Strazzullo
.png)
![New Chat Control Proposal [pdf]](https://news.najib.digital/site/assets/img/broken.gif)