When AI assistants (Claude, ChatGPT, etc.) crawl your documentation sites, they consume HTML bloat:
- Navigation, footers, ads = 60-80% wasted tokens
- Harder to parse, slower responses
- Expensive API costs
HTML-to-MD AI automatically detects AI crawlers and serves them clean, optimized Markdown instead.
| 🚀 Edge Deployment | Runs on Cloudflare's global network (300+ locations) |
| 🤖 Smart Detection | Recognizes Claude, ChatGPT, Google AI, Bing AI, Perplexity, and more |
| 💾 Built-in Caching | Fast responses with configurable TTL (< 10ms cache hits) |
| 🎯 Zero Code Changes | Deploy and forget - perfect for static sites |
| 📝 Clean Output | GitHub-flavored Markdown with frontmatter |
| ⚙️ Highly Configurable | Custom content selectors, AI patterns, caching rules |
- Cloudflare account (free tier works)
- Domain managed by Cloudflare
- Node.js v18+
That's it! 🎉 Your site now serves Markdown to AI crawlers automatically.
Edit wrangler.toml:
Give each site a unique worker name:
Deploy each separately:
AI coding assistants like Cursor and Windsurf use generic HTTP libraries (axios, got, node-fetch, undici) instead of specific AI user-agents. To detect these:
⚠️ WARNING: Generic HTTP clients are also used by regular applications!
When to enable:
- ✅ You control the domain and know your traffic patterns
- ✅ Your docs site has minimal programmatic API access
- ✅ You want to optimize for AI coding assistants like Cursor
When NOT to enable:
- ❌ Your site is accessed by scripts/bots using axios/fetch
- ❌ You have webhooks or API integrations
- ❌ You're unsure about your traffic patterns
Detected patterns when enabled:
- axios/* - Axios HTTP client
- node-fetch/* - Node.js fetch
- got/* - Got HTTP client
- undici/* - Undici HTTP client
Best practice: Start with DETECT_HTTP_CLIENTS = "false" (default), then enable after analyzing your traffic logs.
- Detect - Check User-Agent for AI patterns
- Fetch - Get HTML from origin server
- Convert - Transform to clean Markdown
- Cache - Store result at edge
- Serve - Return optimized content
| Anthropic Claude | ClaudeBot, Claude-Web, claude-cli/* | ✅ Supported |
| OpenAI ChatGPT | GPTBot, ChatGPT-User, ChatGPT | ✅ Supported |
| Google AI | Google-Extended, Gemini, GoogleOther | ✅ Supported |
| Microsoft Bing AI | BingPreview, Copilot, Bing.*AI | ✅ Supported |
| Perplexity | PerplexityBot, Perplexity | ✅ Supported |
| Meta AI | Meta-ExternalAgent, FacebookBot | ✅ Supported |
| Apple Intelligence | Applebot-Extended | ✅ Supported |
| You.com | YouBot | ✅ Supported |
| Cohere | Cohere-AI | ✅ Supported |
| Cache Hit | < 10ms |
| Cache Miss | 100-200ms |
| Token Reduction | 60-80% |
| Edge Locations | 300+ globally |
| Cost | Free tier (100K requests/day) |
Worker not intercepting requests
- ✅ Check route in wrangler.toml matches your domain
- ✅ Verify DNS points to Cloudflare (orange cloud ☁️)
- ✅ Confirm deployment: wrangler deployments list
- ✅ Check worker status in Cloudflare dashboard
- ✅ Adjust contentSelectors in src/index.ts to match your HTML structure
- ✅ Check removeSelectors aren't too aggressive
- ✅ Enable DEBUG=true and check logs: wrangler tail
- ✅ Set ENABLE_CACHE=false for debugging
- ✅ Reduce CACHE_TTL for frequently updated content
- ✅ Purge cache in Cloudflare dashboard
This project is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
You are free to:
- ✅ Share, copy, and redistribute
- ✅ Adapt, remix, and build upon this work
Under these terms:
- 📝 Attribution — Give appropriate credit
- 🚫 NonCommercial — No commercial use without permission
For commercial licensing, please contact [kevinmcgahey1114 @ gmail.com].
See LICENSE for full details.
- Turndown - HTML to Markdown conversion
- Cloudflare Workers - Edge computing platform
- Inspired by the need to optimize AI assistant interactions
.png)


