Show HN: WebPizza – AI/RAG pipeline running in the browser with WebGPU

2 hours ago 1

Deployed on Vercel

🚀 Live Demo: https://webpizza-ai-poc.vercel.app/

⚠️ Experimental POC: This is a proof-of-concept for testing purposes only. It may contain bugs and errors. Loosely inspired by DataPizza AI.

100% Client-Side AI Document Chat - No servers, no APIs, complete privacy.

Chat with your PDF documents using AI that runs entirely in your browser via WebGPU.

  • 🔒 100% Private: All processing happens in your browser - your documents never leave your device
  • Dual Engine: Choose between standard WebLLM or optimized WeInfer (~3.76x faster)
  • 🤖 Multiple Models: Phi-3, Llama 3, Mistral 7B, Qwen, Gemma
  • 📄 PDF Support: Upload and chat with your PDF documents
  • 🎯 RAG Pipeline: Advanced retrieval-augmented generation with vector search
  • 💾 Local Storage: Documents cached in IndexedDB for instant access
  • 🚀 WebGPU Accelerated: Leverage your GPU for fast inference
  • Frontend: Angular 20
  • LLM Engines:
    • WebLLM v0.2.79 (Standard)
    • WeInfer v0.2.43 (Optimized with buffer reuse + async pipeline)
  • Embeddings: Transformers.js (all-MiniLM-L6-v2)
  • PDF Parsing: PDF.js v5.4.296
  • Vector Store: IndexedDB with cosine similarity
  • Compute: WebGPU / WebAssembly
  • Modern browser with WebGPU support (Chrome 113+, Edge 113+)
  • 4GB+ RAM available
  • Modern GPU (Intel HD 5500+, NVIDIA GTX 650+, AMD HD 7750+, Apple M1+)
# Install dependencies npm install # Start dev server npm start # Build for production npm run build

Enable WebGPU (if needed)

  1. Open chrome://flags or edge://flags
  2. Search for "WebGPU"
  3. Enable "Unsafe WebGPU"
  4. Restart browser

Check your browser: https://webgpureport.org/

  1. Select Engine: Choose between WebLLM (standard) or WeInfer (optimized)
  2. Choose Model: Select an LLM based on your hardware capabilities
  3. Upload PDF: Drop your document (first load downloads model ~1-4GB)
  4. Ask Questions: Chat with your document using natural language
  • ❌ No data collection
  • ❌ No server uploads
  • ❌ No tracking cookies
  • ❌ No analytics
  • ✅ 100% client-side processing
  • ✅ Your data never leaves your device

See our Privacy Policy and Cookie Policy for details.

  • Phi-3 Mini: ~3-6 tokens/sec
  • Llama 3.2 1B: ~8-12 tokens/sec
  • Mistral 7B: ~2-4 tokens/sec
  • ~3.76x faster across all models
  • Buffer reuse optimization
  • Asynchronous pipeline processing
  • GPU sampling optimization
# Install Vercel CLI npm i -g vercel # Deploy vercel

The project includes vercel.json with optimal configuration for WebGPU and routing.

Ensure your hosting supports:

  • SPA routing (all routes → index.html)
  • Cross-Origin headers for WebGPU:
    • Cross-Origin-Embedder-Policy: require-corp
    • Cross-Origin-Opener-Policy: same-origin
Browser Version WebGPU Support
Chrome 113+ ✅ Full Support
Edge 113+ ✅ Full Support
Safari 18+ ⚠️ Experimental
Firefox - ❌ Not Yet

Available Models (WebLLM)

- Phi-3-mini-4k-instruct-q4f16_1-MLC (~2GB) - Llama-3.2-1B-Instruct-q4f16_1-MLC (~1GB) - Llama-3.2-3B-Instruct-q4f16_1-MLC (~1.5GB) - Mistral-7B-Instruct-v0.3-q4f16_1-MLC (~4GB) - Qwen2.5-1.5B-Instruct-q4f16_1-MLC (~1GB)

Available Models (WeInfer)

- Phi-3-mini-4k-instruct-q4f16_1-MLC (~2GB) - Qwen2-1.5B-Instruct-q4f16_1-MLC (~1GB) - Mistral-7B-Instruct-v0.3-q4f16_1-MLC (~4GB) - Llama-3-8B-Instruct-q4f16_1-MLC (~4GB) - gemma-2b-it-q4f16_1-MLC (~1.2GB)
  1. Check browser version (Chrome/Edge 113+)
  2. Enable chrome://flags#enable-unsafe-webgpu
  3. Update graphics drivers
  4. Test at https://webgpureport.org/
  • Try a smaller model (Llama 1B, Qwen)
  • Use WeInfer engine for 3.76x speedup
  • Close other tabs/applications
  • Check GPU isn't throttling
  • Use smaller models
  • Close other browser tabs
  • Increase browser memory limit
  • Clear browser cache and restart

This is a proof-of-concept project. Contributions, issues, and feature requests are welcome!

MIT License - See LICENSE file for details

Emanuele Strazzullo


Made with ❤️ by Emanuele Strazzullo

Read Entire Article