Show HN: WebPizza – AI/RAG pipeline running in the browser with WebGPU

2 hours ago 1

🚀 Live Demo: https://webpizza-ai-poc.vercel.app/

⚠️ Experimental POC: This is a proof-of-concept for testing purposes only. It may contain bugs and errors. Loosely inspired by DataPizza AI.

100% Client-Side AI Document Chat - No servers, no APIs, complete privacy.

Chat with your PDF documents using AI that runs entirely in your browser via WebGPU.

🔒 100% Private: All processing happens in your browser - your documents never leave your device
⚡ Dual Engine: Choose between standard WebLLM or optimized WeInfer (~3.76x faster)
🤖 Multiple Models: Phi-3, Llama 3, Mistral 7B, Qwen, Gemma
📄 PDF Support: Upload and chat with your PDF documents
🎯 RAG Pipeline: Advanced retrieval-augmented generation with vector search
💾 Local Storage: Documents cached in IndexedDB for instant access
🚀 WebGPU Accelerated: Leverage your GPU for fast inference

Frontend: Angular 20
LLM Engines:
- WebLLM v0.2.79 (Standard)
- WeInfer v0.2.43 (Optimized with buffer reuse + async pipeline)
Embeddings: Transformers.js (all-MiniLM-L6-v2)
PDF Parsing: PDF.js v5.4.296
Vector Store: IndexedDB with cosine similarity
Compute: WebGPU / WebAssembly

Modern browser with WebGPU support (Chrome 113+, Edge 113+)
4GB+ RAM available
Modern GPU (Intel HD 5500+, NVIDIA GTX 650+, AMD HD 7750+, Apple M1+)

# Install dependencies npm install # Start dev server npm start # Build for production npm run build

Enable WebGPU (if needed)

Open chrome://flags or edge://flags
Search for "WebGPU"
Enable "Unsafe WebGPU"
Restart browser

Check your browser: https://webgpureport.org/

Select Engine: Choose between WebLLM (standard) or WeInfer (optimized)
Choose Model: Select an LLM based on your hardware capabilities
Upload PDF: Drop your document (first load downloads model ~1-4GB)
Ask Questions: Chat with your document using natural language

❌ No data collection
❌ No server uploads
❌ No tracking cookies
❌ No analytics
✅ 100% client-side processing
✅ Your data never leaves your device

See our Privacy Policy and Cookie Policy for details.

Phi-3 Mini: ~3-6 tokens/sec
Llama 3.2 1B: ~8-12 tokens/sec
Mistral 7B: ~2-4 tokens/sec

~3.76x faster across all models
Buffer reuse optimization
Asynchronous pipeline processing
GPU sampling optimization

# Install Vercel CLI npm i -g vercel # Deploy vercel

The project includes vercel.json with optimal configuration for WebGPU and routing.

Ensure your hosting supports:

SPA routing (all routes → index.html)
Cross-Origin headers for WebGPU:
- Cross-Origin-Embedder-Policy: require-corp
- Cross-Origin-Opener-Policy: same-origin

Browser Version WebGPU Support

Chrome	113+	✅ Full Support
Edge	113+	✅ Full Support
Safari	18+	⚠️ Experimental
Firefox	-	❌ Not Yet

Available Models (WebLLM)

- Phi-3-mini-4k-instruct-q4f16_1-MLC (~2GB) - Llama-3.2-1B-Instruct-q4f16_1-MLC (~1GB) - Llama-3.2-3B-Instruct-q4f16_1-MLC (~1.5GB) - Mistral-7B-Instruct-v0.3-q4f16_1-MLC (~4GB) - Qwen2.5-1.5B-Instruct-q4f16_1-MLC (~1GB)

Available Models (WeInfer)

- Phi-3-mini-4k-instruct-q4f16_1-MLC (~2GB) - Qwen2-1.5B-Instruct-q4f16_1-MLC (~1GB) - Mistral-7B-Instruct-v0.3-q4f16_1-MLC (~4GB) - Llama-3-8B-Instruct-q4f16_1-MLC (~4GB) - gemma-2b-it-q4f16_1-MLC (~1.2GB)

Check browser version (Chrome/Edge 113+)
Enable chrome://flags#enable-unsafe-webgpu
Update graphics drivers
Test at https://webgpureport.org/

Try a smaller model (Llama 1B, Qwen)
Use WeInfer engine for 3.76x speedup
Close other tabs/applications
Check GPU isn't throttling

Use smaller models
Close other browser tabs
Increase browser memory limit
Clear browser cache and restart

This is a proof-of-concept project. Contributions, issues, and feature requests are welcome!

MIT License - See LICENSE file for details

Emanuele Strazzullo

Website: emanuelestrazzullo.dev
LinkedIn: linkedin.com/in/emanuelestrazzullo

MLC LLM - WebLLM inference engine
WeInfer - Optimized WebLLM fork
Transformers.js - Browser ML library
PDF.js - PDF parsing
Hugging Face - Model hosting

Made with ❤️ by Emanuele Strazzullo

Read Entire Article

Show HN: WebPizza – AI/RAG pipeline running in the browser with WebGPU

Enable WebGPU (if needed)

Available Models (WebLLM)

Available Models (WeInfer)

Related

New Chat Control Proposal [pdf]

Restaurants Brace for the Demise of the Penny

Level up as a Product Manager in 2 minutes a day (for free)