Simple Cloudflare Worker that serves Markdown to AI crawlers

1 month ago 7

When AI assistants (Claude, ChatGPT, etc.) crawl your documentation sites, they consume HTML bloat:

  • Navigation, footers, ads = 60-80% wasted tokens
  • Harder to parse, slower responses
  • Expensive API costs

HTML-to-MD AI automatically detects AI crawlers and serves them clean, optimized Markdown instead.


Feature Description
🚀 Edge Deployment Runs on Cloudflare's global network (300+ locations)
🤖 Smart Detection Recognizes Claude, ChatGPT, Google AI, Bing AI, Perplexity, and more
💾 Built-in Caching Fast responses with configurable TTL (< 10ms cache hits)
🎯 Zero Code Changes Deploy and forget - perfect for static sites
📝 Clean Output GitHub-flavored Markdown with frontmatter
⚙️ Highly Configurable Custom content selectors, AI patterns, caching rules

  • Cloudflare account (free tier works)
  • Domain managed by Cloudflare
  • Node.js v18+
# Clone the repo git clone https://github.com/thekevinm/html-to-md-ai.git cd html-to-md-ai/packages/cloudflare-worker # Install dependencies npm install # Configure cp wrangler.example.toml wrangler.toml # Edit wrangler.toml - set your domain route # Login to Cloudflare npx wrangler login # Deploy npm run deploy

That's it! 🎉 Your site now serves Markdown to AI crawlers automatically.


Edit wrangler.toml:

name = "html-to-md-ai-worker" route = "docs.yoursite.com/*" [vars] ORIGIN_URL = "" # Leave empty to use request URL CACHE_TTL = "3600" # 1 hour cache ENABLE_CACHE = "true" # Enable caching DEBUG = "false" # Debug logging

Give each site a unique worker name:

# Site 1 name = "html-to-md-site1" route = "docs.site1.com/*" # Site 2 (separate wrangler.toml) name = "html-to-md-site2" route = "docs.site2.com/*"

Deploy each separately:

wrangler deploy -c wrangler.toml.site1 wrangler deploy -c wrangler.toml.site2

🔧 Advanced: Detecting AI Coding Assistants

AI coding assistants like Cursor and Windsurf use generic HTTP libraries (axios, got, node-fetch, undici) instead of specific AI user-agents. To detect these:

[vars] # Detect generic HTTP client user-agents DETECT_HTTP_CLIENTS = "true"

⚠️ WARNING: Generic HTTP clients are also used by regular applications!

When to enable:

  • ✅ You control the domain and know your traffic patterns
  • ✅ Your docs site has minimal programmatic API access
  • ✅ You want to optimize for AI coding assistants like Cursor

When NOT to enable:

  • ❌ Your site is accessed by scripts/bots using axios/fetch
  • ❌ You have webhooks or API integrations
  • ❌ You're unsure about your traffic patterns

Detected patterns when enabled:

  • axios/* - Axios HTTP client
  • node-fetch/* - Node.js fetch
  • got/* - Got HTTP client
  • undici/* - Undici HTTP client

Best practice: Start with DETECT_HTTP_CLIENTS = "false" (default), then enable after analyzing your traffic logs.


graph LR A[AI Crawler] -->|User-Agent: ClaudeBot| B[Edge Worker] B -->|Check Cache| C{Cached?} C -->|Yes| D[Return Markdown] C -->|No| E[Fetch HTML] E --> F[Convert to MD] F --> G[Cache Result] G --> D
Loading
  1. Detect - Check User-Agent for AI patterns
  2. Fetch - Get HTML from origin server
  3. Convert - Transform to clean Markdown
  4. Cache - Store result at edge
  5. Serve - Return optimized content

AI Tool User-Agent Patterns Status
Anthropic Claude ClaudeBot, Claude-Web, claude-cli/* ✅ Supported
OpenAI ChatGPT GPTBot, ChatGPT-User, ChatGPT ✅ Supported
Google AI Google-Extended, Gemini, GoogleOther ✅ Supported
Microsoft Bing AI BingPreview, Copilot, Bing.*AI ✅ Supported
Perplexity PerplexityBot, Perplexity ✅ Supported
Meta AI Meta-ExternalAgent, FacebookBot ✅ Supported
Apple Intelligence Applebot-Extended ✅ Supported
You.com YouBot ✅ Supported
Cohere Cohere-AI ✅ Supported

# Regular browser (should get HTML) curl https://your-site.com/docs # AI crawler (should get Markdown) curl -H "User-Agent: ClaudeBot" https://your-site.com/docs # Check response headers curl -I -H "User-Agent: ClaudeBot" https://your-site.com/docs # Expected: Content-Type: text/markdown; X-AI-Bot: Claude
cd packages/cloudflare-worker npm run dev # Test in another terminal curl -H "User-Agent: claude-cli/1.0" http://localhost:8787/

Metric Value
Cache Hit < 10ms
Cache Miss 100-200ms
Token Reduction 60-80%
Edge Locations 300+ globally
Cost Free tier (100K requests/day)

html-to-md-ai/ ├── packages/ │ ├── core/ # Detection & conversion library │ │ ├── src/ │ │ │ ├── detector.ts # AI user-agent detection │ │ │ ├── converter.ts # HTML → Markdown │ │ │ └── index.ts │ │ └── package.json │ │ │ └── cloudflare-worker/ # Main edge function ⭐ │ ├── src/ │ │ └── index.ts # Worker implementation │ ├── wrangler.toml # Your config (gitignored) │ ├── wrangler.example.toml │ └── README.md │ ├── examples/ │ └── docusaurus/ # Docusaurus integration │ ├── README.md # This file ├── CHANGELOG.md # Version history └── LICENSE # CC BY-NC 4.0

Worker not intercepting requests
  • ✅ Check route in wrangler.toml matches your domain
  • ✅ Verify DNS points to Cloudflare (orange cloud ☁️)
  • ✅ Confirm deployment: wrangler deployments list
  • ✅ Check worker status in Cloudflare dashboard
Incomplete Markdown output
  • ✅ Adjust contentSelectors in src/index.ts to match your HTML structure
  • ✅ Check removeSelectors aren't too aggressive
  • ✅ Enable DEBUG=true and check logs: wrangler tail
Cache issues
  • ✅ Set ENABLE_CACHE=false for debugging
  • ✅ Reduce CACHE_TTL for frequently updated content
  • ✅ Purge cache in Cloudflare dashboard

This project is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

You are free to:

  • ✅ Share, copy, and redistribute
  • ✅ Adapt, remix, and build upon this work

Under these terms:

  • 📝 Attribution — Give appropriate credit
  • 🚫 NonCommercial — No commercial use without permission

For commercial licensing, please contact [kevinmcgahey1114 @ gmail.com].

See LICENSE for full details.


  • Turndown - HTML to Markdown conversion
  • Cloudflare Workers - Edge computing platform
  • Inspired by the need to optimize AI assistant interactions

Read Entire Article