Simple Cloudflare Worker that serves Markdown to AI crawlers

1 month ago 7

When AI assistants (Claude, ChatGPT, etc.) crawl your documentation sites, they consume HTML bloat:

Navigation, footers, ads = 60-80% wasted tokens
Harder to parse, slower responses
Expensive API costs

HTML-to-MD AI automatically detects AI crawlers and serves them clean, optimized Markdown instead.

Feature Description

🚀 Edge Deployment	Runs on Cloudflare's global network (300+ locations)
🤖 Smart Detection	Recognizes Claude, ChatGPT, Google AI, Bing AI, Perplexity, and more
💾 Built-in Caching	Fast responses with configurable TTL (< 10ms cache hits)
🎯 Zero Code Changes	Deploy and forget - perfect for static sites
📝 Clean Output	GitHub-flavored Markdown with frontmatter
⚙️ Highly Configurable	Custom content selectors, AI patterns, caching rules

Cloudflare account (free tier works)
Domain managed by Cloudflare
Node.js v18+

# Clone the repo git clone https://github.com/thekevinm/html-to-md-ai.git cd html-to-md-ai/packages/cloudflare-worker # Install dependencies npm install # Configure cp wrangler.example.toml wrangler.toml # Edit wrangler.toml - set your domain route # Login to Cloudflare npx wrangler login # Deploy npm run deploy

That's it! 🎉 Your site now serves Markdown to AI crawlers automatically.

Edit wrangler.toml:

name = "html-to-md-ai-worker" route = "docs.yoursite.com/*" [vars] ORIGIN_URL = "" # Leave empty to use request URL CACHE_TTL = "3600" # 1 hour cache ENABLE_CACHE = "true" # Enable caching DEBUG = "false" # Debug logging

Give each site a unique worker name:

# Site 1 name = "html-to-md-site1" route = "docs.site1.com/*" # Site 2 (separate wrangler.toml) name = "html-to-md-site2" route = "docs.site2.com/*"

Deploy each separately:

wrangler deploy -c wrangler.toml.site1 wrangler deploy -c wrangler.toml.site2

🔧 Advanced: Detecting AI Coding Assistants

AI coding assistants like Cursor and Windsurf use generic HTTP libraries (axios, got, node-fetch, undici) instead of specific AI user-agents. To detect these:

[vars] # Detect generic HTTP client user-agents DETECT_HTTP_CLIENTS = "true"

⚠️ WARNING: Generic HTTP clients are also used by regular applications!

When to enable:

✅ You control the domain and know your traffic patterns
✅ Your docs site has minimal programmatic API access
✅ You want to optimize for AI coding assistants like Cursor

When NOT to enable:

❌ Your site is accessed by scripts/bots using axios/fetch
❌ You have webhooks or API integrations
❌ You're unsure about your traffic patterns

Detected patterns when enabled:

axios/* - Axios HTTP client
node-fetch/* - Node.js fetch
got/* - Got HTTP client
undici/* - Undici HTTP client

Best practice: Start with DETECT_HTTP_CLIENTS = "false" (default), then enable after analyzing your traffic logs.

graph LR A[AI Crawler] -->|User-Agent: ClaudeBot| B[Edge Worker] B -->|Check Cache| C{Cached?} C -->|Yes| D[Return Markdown] C -->|No| E[Fetch HTML] E --> F[Convert to MD] F --> G[Cache Result] G --> D

Detect - Check User-Agent for AI patterns
Fetch - Get HTML from origin server
Convert - Transform to clean Markdown
Cache - Store result at edge
Serve - Return optimized content

AI Tool User-Agent Patterns Status

Anthropic Claude	ClaudeBot, Claude-Web, claude-cli/*	✅ Supported
OpenAI ChatGPT	GPTBot, ChatGPT-User, ChatGPT	✅ Supported
Google AI	Google-Extended, Gemini, GoogleOther	✅ Supported
Microsoft Bing AI	BingPreview, Copilot, Bing.*AI	✅ Supported
Perplexity	PerplexityBot, Perplexity	✅ Supported
Meta AI	Meta-ExternalAgent, FacebookBot	✅ Supported
Apple Intelligence	Applebot-Extended	✅ Supported
You.com	YouBot	✅ Supported
Cohere	Cohere-AI	✅ Supported

# Regular browser (should get HTML) curl https://your-site.com/docs # AI crawler (should get Markdown) curl -H "User-Agent: ClaudeBot" https://your-site.com/docs # Check response headers curl -I -H "User-Agent: ClaudeBot" https://your-site.com/docs # Expected: Content-Type: text/markdown; X-AI-Bot: Claude

cd packages/cloudflare-worker npm run dev # Test in another terminal curl -H "User-Agent: claude-cli/1.0" http://localhost:8787/

Metric Value

Cache Hit	< 10ms
Cache Miss	100-200ms
Token Reduction	60-80%
Edge Locations	300+ globally
Cost	Free tier (100K requests/day)

html-to-md-ai/ ├── packages/ │ ├── core/ # Detection & conversion library │ │ ├── src/ │ │ │ ├── detector.ts # AI user-agent detection │ │ │ ├── converter.ts # HTML → Markdown │ │ │ └── index.ts │ │ └── package.json │ │ │ └── cloudflare-worker/ # Main edge function ⭐ │ ├── src/ │ │ └── index.ts # Worker implementation │ ├── wrangler.toml # Your config (gitignored) │ ├── wrangler.example.toml │ └── README.md │ ├── examples/ │ └── docusaurus/ # Docusaurus integration │ ├── README.md # This file ├── CHANGELOG.md # Version history └── LICENSE # CC BY-NC 4.0

Worker not intercepting requests

✅ Check route in wrangler.toml matches your domain
✅ Verify DNS points to Cloudflare (orange cloud ☁️)
✅ Confirm deployment: wrangler deployments list
✅ Check worker status in Cloudflare dashboard

Incomplete Markdown output

✅ Adjust contentSelectors in src/index.ts to match your HTML structure
✅ Check removeSelectors aren't too aggressive
✅ Enable DEBUG=true and check logs: wrangler tail

Cache issues

✅ Set ENABLE_CACHE=false for debugging
✅ Reduce CACHE_TTL for frequently updated content
✅ Purge cache in Cloudflare dashboard

This project is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

You are free to:

✅ Share, copy, and redistribute
✅ Adapt, remix, and build upon this work

Under these terms:

📝 Attribution — Give appropriate credit
🚫 NonCommercial — No commercial use without permission

For commercial licensing, please contact [kevinmcgahey1114 @ gmail.com].

See LICENSE for full details.

Turndown - HTML to Markdown conversion
Cloudflare Workers - Edge computing platform
Inspired by the need to optimize AI assistant interactions

Read Entire Article

Simple Cloudflare Worker that serves Markdown to AI crawlers

🔧 Advanced: Detecting AI Coding Assistants

Related

Ask HN: Has AI changed how your approach software architectu...

The biggest consumers of electricity are hidden in plain sig...

What would a "simplified" Starship plan for the Moon look li...