LLMs.txt Generator with Automated Monitoring

4 months ago 12

An intelligent web application that generates llms.txt files for websites and automatically monitors them for changes. This tool follows the llms.txt specification to create AI-friendly documentation files that help Large Language Models better understand website content.

🆕 Improved Documentation Structure

This README has been restructured for better user experience:

📋 Grouped sections: Related topics are now organized together
🎬 Quick Demo: Try it in 30 seconds with visual examples
🚀 Progressive flow: Getting Started → Architecture → Advanced Topics
🔧 Consolidated config: All settings in one comprehensive section
📚 Better navigation: Grouped table of contents for easier browsing

✨ Features
🎬 Quick Demo
🛠 Technology Stack
📦 Installation & Setup
🎯 Usage

📁 Project Structure
🔧 Configuration
🚢 Deployment
⏰ Automated Monitoring Setup

📊 Understanding Change Detection
⚡ Performance & Scaling
🧪 Testing
🐛 Troubleshooting

🚀 Future Enhancements
🤝 Contributing
📋 Quick Reference

Intelligent Website Crawling: Automatically discovers and analyzes website pages
AI-Enhanced Content: Uses OpenAI to improve descriptions and organization
Smart Categorization: Dynamic section organization based on content themes
Dual File Generation: Creates both llms.txt (curated) and llms-full.txt (comprehensive)
Existing File Detection: Automatically uses existing llms.txt files when found

Automated Monitoring (NEW!)

🔄 Smart Change Detection: Monitors website structure changes automatically
📅 Flexible Scheduling: From hourly to weekly check intervals
🎯 Intelligent Updates: Only regenerates when significant changes detected
📊 Change Analytics: Detailed reports on what changed and why
🤖 Auto-scaling AI: Processing scales with website size

Beautiful UI: Responsive design built with Next.js and Tailwind CSS
Real-time Progress: Live feedback during crawling and generation
Monitoring Dashboard: Comprehensive interface for managing automated updates
Instant Downloads: Direct download of generated files

# Clone and start (automated setup) git clone <your-repo-url> && cd llm_txt_creator ./start.sh # Open browser to http://localhost:3000 # Enter a website URL (e.g., https://docs.anthropic.com) # Click "Generate llms.txt" and watch the magic! ✨

Real-time crawling: Pages discovered and analyzed live
AI enhancement: Content improved and categorized automatically
Dual outputs: Both curated and comprehensive versions
Monitoring setup: Add sites for automatic updates

# Anthropic Documentation > AI safety company building reliable, interpretable, and steerable AI systems ## Getting Started - Quickstart Guide - Essential setup and first API calls - Authentication - API key setup and security best practices ## API Reference - Messages API - Core conversational AI interface - Streaming - Real-time response handling ...

Next.js 15 - React framework with App Router
TypeScript - Type-safe development
Tailwind CSS - Modern styling
Lucide React - Beautiful icons

Vercel Functions - Serverless Python functions for production
FastAPI - Local development server with hot reload
OpenAI GPT-4 - AI-enhanced content processing
aiohttp - Async HTTP client for web crawling
BeautifulSoup4 - HTML parsing and content extraction

Vercel Cron Jobs - Automatic scheduling every 6 hours
Change Detection - Structure fingerprinting and diff analysis
Smart Thresholds - Updates only for significant changes (5%+)

llm_txt_creator/ # Root project directory ├── app/ # 🎨 FRONTEND (Next.js) │ ├── page.tsx # Main generator interface │ ├── monitor/ # Monitoring dashboard │ │ └── page.tsx # /monitor route │ ├── layout.tsx # App-wide layout │ └── globals.css # Global styles ├── api/ # ☁️ PRODUCTION API (Vercel Functions) │ ├── generate.py # Main generation endpoint │ ├── scheduler.py # Monitoring management │ ├── cron.py # Automated scheduling │ └── health.py # Health checks ├── backend/ # 🔧 DEVELOPMENT API (FastAPI) │ ├── main.py # Main API (equivalent to api/generate.py) │ ├── scheduler.py # Scheduler service (equivalent to api/scheduler.py) │ └── run_dev.py # Development server runner ├── vercel.json # Deployment config with cron jobs ├── package.json # Frontend dependencies └── requirements.txt # Python dependencies

How The Two APIs Work Together

Development Mode (Local):

Frontend (localhost:3000) ↓ Backend FastAPI (localhost:8000) - Main API ↓ Backend FastAPI (localhost:8001) - Scheduler API

Production Mode (Vercel):

Frontend (yourapp.vercel.app) ↓ Vercel Functions (/api/generate, /api/scheduler, /api/cron) ↓ Automatic Cron Jobs (every 6 hours)

Node.js 18+ (for frontend)
Python 3.9+ (for local backend development)
OpenAI API Key (for AI enhancement)
Vercel CLI (for deployment)

Clone the repository

git clone <your-repo-url> cd llm_txt_creator
Install dependencies
Set up environment variables

Edit .env with your OpenAI API key:

NEXT_PUBLIC_API_URL=http://localhost:8000 OPENAI_API_KEY=your_openai_api_key_here
Start development servers

Option A: Use the automated start script (Recommended)

This script automatically:
- Creates Python virtual environment if needed
- Installs all backend dependencies
- Starts both FastAPI servers (ports 8000 & 8001)
- Starts Next.js development server (port 3000)
- Provides clear status messages and error handling
Option B: Use the convenience script

cd backend python run_dev.py

Option C: Manual startup

# Terminal 1 - Main API cd backend && python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload # Terminal 2 - Scheduler Service cd backend && python -m uvicorn scheduler:scheduler_app --host 0.0.0.0 --port 8001 --reload # Terminal 3 - Frontend npm run dev
Open your browser
- Main App: http://localhost:3000
- Monitor Dashboard: http://localhost:3000/monitor
- API Docs: http://localhost:8000/docs
- Scheduler Docs: http://localhost:8001/docs

Enter Website URL: Input the URL you want to analyze
Configure Settings: Choose maximum pages to crawl (10-100)
Generate Files: Click "Generate llms.txt" and wait for processing
Download Results: Download both llms.txt and llms-full.txt files
Review Analysis: View the pages analyzed and their importance scores

🔄 Automated Monitoring (NEW!)

Adding Sites to Monitoring

Navigate to /monitor page
Enter website URL (e.g., https://docs.anthropic.com)
Choose check interval (recommended: 24 hours)
Select max pages to crawl (recommended: 20 pages)
Click "Add to Monitoring"

Change Detection:

Creates "fingerprints" of website structure (URLs, titles, sections)
Detects new pages, removed pages, and modified content
Calculates change severity: Major (50%+), Moderate (20%+), Minor (5%+)

Smart Updates:

Only regenerates llms.txt when changes are significant (5%+ threshold)
AI processing scales with site size to prevent timeouts
Detailed change reports show exactly what changed

Automatic Scheduling:

Production: Cron jobs run every 6 hours automatically
Configurable: Set custom intervals from hourly to weekly
Manual Override: Force immediate checks anytime

# Generate llms.txt curl -X POST "http://localhost:8000/generate" \ -H "Content-Type: application/json" \ -d '{ "url": "https://docs.anthropic.com", "max_pages": 20, "depth_limit": 3 }' # Add site to monitoring curl -X POST "http://localhost:8001/scheduler" \ -H "Content-Type: application/json" \ -d '{ "action": "add_site", "url": "https://docs.anthropic.com", "max_pages": 20, "check_interval": 86400 }' # Check for updates curl -X POST "http://localhost:8001/scheduler" \ -H "Content-Type: application/json" \ -d '{"action": "check_updates"}'

# Generate llms.txt curl -X POST "https://your-app.vercel.app/api/generate" \ -H "Content-Type: application/json" \ -d '{ "url": "https://docs.anthropic.com", "max_pages": 20, "depth_limit": 3 }' # Monitoring endpoints curl -X POST "https://your-app.vercel.app/api/scheduler" \ -H "Content-Type: application/json" \ -d '{ "action": "add_site", "url": "https://docs.anthropic.com", "max_pages": 20, "check_interval": 86400 }'

# For local development (.env) NEXT_PUBLIC_API_URL=http://localhost:8000 # OpenAI API key for AI enhancement (required) OPENAI_API_KEY=your_openai_api_key_here

Production (Vercel Dashboard)

OPENAI_API_KEY=your_openai_api_key_here # NEXT_PUBLIC_API_URL automatically detected

max_pages: Maximum number of pages to crawl (default: 20)
depth_limit: Maximum crawl depth from the root URL (default: 3)
check_interval: Monitoring interval in seconds (default: 86400 = 24 hours)

When OpenAI API key is provided:

Enhanced Descriptions: AI-improved page descriptions
Smart Categorization: Dynamic section organization
Content Cleanup: Removes redundancy and improves clarity
Scalable Processing: Adjusts AI usage based on website size

Monitoring Best Practices

Documentation sites: Perfect for monitoring (docs., developers.)
News sites: Good for content updates (moderate frequency)
Large sites: Use smaller page limits (10-20 pages)

Critical docs: Every 6-12 hours
Regular updates: Daily (24 hours) - Recommended
Stable sites: Every 3 days
Archive sites: Weekly

Small crawls (≤20 pages): Full AI enhancement
Medium crawls (21-50 pages): AI with 8 pages max per section
Large crawls (51-100 pages): AI limited to 5 pages per section
Very large crawls (>100 pages): No AI enhancement (prevents timeouts)

Production Deployment with Vercel

The project includes automated deployment to Vercel with cron job scheduling.

Quick Deploy:

Manual Deploy:

# Login to Vercel vercel login # Deploy to production vercel --prod

⚠️ Important Notes:

Vercel Free Plan: Function timeout limited to 60 seconds max, cron jobs limited to daily frequency
Vercel Pro Plan: Function timeout can be up to 300 seconds, unlimited cron frequency
For large websites (>50 pages), consider upgrading to Pro plan or use local development
Free plan: Cron jobs run daily at 12:00 PM UTC
Pro plan: Can run every 6 hours or any custom schedule

Environment Variables Required:

# Set in Vercel dashboard or via CLI vercel env add OPENAI_API_KEY

The vercel.json includes:

{ "functions": { "api/generate.py": { "maxDuration": 300 }, "api/scheduler.py": { "maxDuration": 600 }, "api/cron.py": { "maxDuration": 900 } }, "crons": [ { "path": "/api/cron", "schedule": "0 */6 * * *" } ] }

# Test main API curl http://localhost:8000/health # Test scheduler API curl http://localhost:8001/cron # Test frontend open http://localhost:3000

Test Production Deployment

# Test Vercel functions curl https://your-app.vercel.app/api/health # Test monitoring dashboard open https://your-app.vercel.app/monitor

📊 Understanding Change Detection

New pages: Recently added documentation or content
Removed pages: Deleted or moved content
Modified pages: Title changes, section reassignments
Structural changes: Navigation reorganization, new product areas

Major (50%+): Large restructures, new product launches → Always update
Moderate (20%+): New documentation sections → Always update
Minor (5%+): New pages, title changes → Always update
Minimal (<5%): Minor tweaks → Skip update (prevents noise)

{ "severity": "moderate", "new_pages": [ "https://docs.example.com/new-api-guide", "https://docs.example.com/beta-features" ], "removed_pages": [ "https://docs.example.com/deprecated-api" ], "modified_pages": [ { "url": "https://docs.example.com/quickstart", "old_title": "Quick Start", "new_title": "Getting Started Guide", "old_section": "Documentation", "new_section": "Getting Started" } ] }

Execution Time: 10 seconds (Hobby), 60 seconds (Pro), 900 seconds (cron)
Memory: Up to 1024MB
Payload Size: 4.5MB request/response limit

Concurrent checks: System handles multiple sites efficiently
Smart scheduling: Only checks sites when intervals are due
Change thresholds: Prevents unnecessary regeneration
Timeout management: Graceful degradation for large sites

Use appropriate page limits: See Configuration for recommendations
Monitor function execution times: Check Vercel dashboard for performance metrics
Consider Pro plan: For larger sites requiring longer execution times
Batch monitoring: System automatically batches multiple site checks efficiently

"Failed to fetch" errors locally:

Check that .env has NEXT_PUBLIC_API_URL=http://localhost:8000
Ensure backend servers are running on ports 8000 and 8001

"Site not being monitored":

Add the site first using the monitor interface
Check the URL format (include https://)

"No changes detected but site updated":

Check if changes are below 5% threshold
Force manual check to see latest status
Consider if changes are in content vs. structure

"Update failed":

Check if the website is accessible
Verify the site doesn't block crawlers
Look for SSL/security issues

export DEBUG=1 export LOG_LEVEL=DEBUG python run_dev.py

Database persistence: Store monitoring data permanently
Email notifications: Alert when sites update
Webhook integration: Push updates to external systems
Advanced scheduling: Per-site custom schedules
Change analytics: Track patterns and trends
Team collaboration: Shared monitoring dashboards

ChangeDetector: Improve change detection algorithms
AutoUpdater: Add new notification methods
LLMSTxtGenerator: Enhance content organization
Frontend: Better visualization and management tools

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Test locally with python run_dev.py
Test monitoring features on /monitor page
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

This project is licensed under the MIT License - see the LICENSE file for details.

Jeremy Howard for proposing the llms.txt standard
llmstxt.org for the specification
The open source community for the amazing tools used in this project

If you encounter any issues or have questions:

Check this README for common solutions
Review the API documentation at /docs endpoints (local development)
Create an issue with detailed information
Include error messages and steps to reproduce
Mention whether you're running locally or on Vercel

Main App: http://localhost:3000
Monitor Dashboard: http://localhost:3000/monitor
Main API Docs: http://localhost:8000/docs
Scheduler API Docs: http://localhost:8001/docs

# Start development (automated script - recommended) ./start.sh # Start development (convenience script) cd backend && python run_dev.py # Deploy to production (automated script - recommended) ./deploy-vercel.sh # Deploy to production (manual) vercel --prod # Test APIs locally curl http://localhost:8000/health curl http://localhost:8001/cron # Add site to monitoring # Visit /monitor page or use API directly

# Local development (.env) NEXT_PUBLIC_API_URL=http://localhost:8000 OPENAI_API_KEY=your_openai_api_key_here # Production (Vercel Dashboard) OPENAI_API_KEY=your_openai_api_key_here

⏰ Automated Monitoring Setup

Production Setup (Vercel) - Automatic

Good News: Cron jobs are automatically configured when you deploy to Vercel! 🎉

Deploy to Vercel (using either method above)

./deploy-vercel.sh # or vercel --prod
Cron jobs are automatically enabled:
- ✅ Free Plan: Runs daily at 12:00 PM UTC (0 12 * * *)
- ✅ Pro Plan: Can run every 6 hours (0 */6 * * *) or custom schedule
- ✅ Checks all monitored sites for changes
- ✅ Updates llms.txt files when significant changes detected
- ✅ 60-second execution limit (Free) or 300+ seconds (Pro)
Verify cron is working:

# Check cron endpoint manually curl https://your-app.vercel.app/api/cron # Check Vercel dashboard # Go to: Project → Functions → View function logs
Monitor cron activity:
- Visit your app's /monitor page
- Check "Last Update" timestamps
- Look for "Auto-updated" entries in the monitoring dashboard

For local development, you can simulate cron behavior:

Option A: Manual cron trigger

# Trigger cron check manually curl http://localhost:8001/cron # Or visit in browser open http://localhost:8001/cron

Option B: Set up local cron (macOS/Linux)

# Edit your crontab crontab -e # Add this line to run every hour during development 0 * * * * curl -s http://localhost:8001/cron >/dev/null 2>&1 # Or every 6 hours to match production 0 */6 * * * curl -s http://localhost:8001/cron >/dev/null 2>&1

Option C: Use a cron service

# Install a cron alternative like 'node-cron' for local development npm install node-cron # Create a simple Node.js script for local cron cat > local-cron.js << 'EOF' const cron = require('node-cron'); const fetch = require('node-fetch'); // Run every 6 hours cron.schedule('0 */6 * * *', async () => { try { const response = await fetch('http://localhost:8001/cron'); console.log(`Cron job completed: ${response.status}`); } catch (error) { console.error('Cron job failed:', error); } }); console.log('Local cron scheduler started...'); EOF # Run the local cron scheduler node local-cron.js

Understanding Cron Schedule

Vercel Free Plan Schedule: 0 12 * * *

┌───────────── minute (0) │ ┌─────────── hour (12 = 12:00 PM UTC) │ │ ┌───────── day of month (*) │ │ │ ┌─────── month (*) │ │ │ │ ┌───── day of week (*) │ │ │ │ │ │ │ │ │ │ 0 12 * * *

This means:

12:00 PM UTC daily (4 AM or 5 AM Pacific, depending on DST)

Vercel Pro Plan Schedule: 0 */6 * * *

┌───────────── minute (0) │ ┌─────────── hour (*/6 = every 6 hours) │ │ ┌───────── day of month (*) │ │ │ ┌─────── month (*) │ │ │ │ ┌───── day of week (*) │ │ │ │ │ │ │ │ │ │ 0 */6 * * *

This means:

12:00 AM UTC (4 PM or 5 PM Pacific, depending on DST)
6:00 AM UTC (10 PM or 11 PM Pacific)
12:00 PM UTC (4 AM or 5 AM Pacific)
6:00 PM UTC (10 AM or 11 AM Pacific)

Customizing Cron Schedule

⚠️ Vercel Plan Limitations:

Free (Hobby) Plan: Only daily schedules allowed (e.g., 0 12 * * *)
Pro Plan: Any schedule frequency supported

To change the monitoring frequency, edit vercel.json:

{ "crons": [ { "path": "/api/cron", "schedule": "0 0 * * *" // Daily at midnight UTC (Free plan compatible) } ] }

Free plan compatible schedules:

0 0 * * * - Daily at midnight UTC
0 12 * * * - Daily at noon UTC (default)
0 6 * * * - Daily at 6 AM UTC
0 0 * * 1 - Weekly on Mondays

Pro plan additional schedules:

0 */1 * * * - Every hour
0 */2 * * * - Every 2 hours
0 */6 * * * - Every 6 hours
0 */12 * * * - Every 12 hours

After changing the schedule:

# Redeploy to apply changes vercel --prod

Troubleshooting Cron Jobs

Cron not running:

# 1. Check if cron endpoint works manually curl https://your-app.vercel.app/api/cron # 2. Check Vercel function logs # Visit Vercel Dashboard → Project → Functions → api/cron.py # 3. Verify vercel.json syntax cat vercel.json | python -m json.tool

No sites being checked:

Make sure you've added sites to monitoring via /monitor page
Check that sites have valid URLs (include https://)
Verify OpenAI API key is set in Vercel environment variables

Cron running but not updating:

Changes might be below 5% threshold (prevents noise)
Check the specific site manually: force update via /monitor page
Look at function logs for error messages

In the app:

Visit /monitor page
Look for "Last Update" column
Check for recent timestamps
Look for "Auto-updated" vs "Manual" in update history

In Vercel Dashboard:

Go to your project
Click "Functions" tab
Click on api/cron.py
View execution logs and duration

Expected behavior:

Cron runs every 6 hours
Only updates sites with significant changes (5%+)
Updates multiple sites efficiently in single execution
Completes within 15-minute timeout limit

Built with ❤️ for the llms.txt standard and automated monitoring!

Read Entire Article