An intelligent web application that generates llms.txt files for websites and automatically monitors them for changes. This tool follows the llms.txt specification to create AI-friendly documentation files that help Large Language Models better understand website content.
This README has been restructured for better user experience:
- 📋 Grouped sections: Related topics are now organized together
- 🎬 Quick Demo: Try it in 30 seconds with visual examples
- 🚀 Progressive flow: Getting Started → Architecture → Advanced Topics
- 🔧 Consolidated config: All settings in one comprehensive section
- 📚 Better navigation: Grouped table of contents for easier browsing
- Intelligent Website Crawling: Automatically discovers and analyzes website pages
- AI-Enhanced Content: Uses OpenAI to improve descriptions and organization
- Smart Categorization: Dynamic section organization based on content themes
- Dual File Generation: Creates both llms.txt (curated) and llms-full.txt (comprehensive)
- Existing File Detection: Automatically uses existing llms.txt files when found
- 🔄 Smart Change Detection: Monitors website structure changes automatically
- 📅 Flexible Scheduling: From hourly to weekly check intervals
- 🎯 Intelligent Updates: Only regenerates when significant changes detected
- 📊 Change Analytics: Detailed reports on what changed and why
- 🤖 Auto-scaling AI: Processing scales with website size
- Beautiful UI: Responsive design built with Next.js and Tailwind CSS
- Real-time Progress: Live feedback during crawling and generation
- Monitoring Dashboard: Comprehensive interface for managing automated updates
- Instant Downloads: Direct download of generated files
- Real-time crawling: Pages discovered and analyzed live
- AI enhancement: Content improved and categorized automatically
- Dual outputs: Both curated and comprehensive versions
- Monitoring setup: Add sites for automatic updates
- Next.js 15 - React framework with App Router
- TypeScript - Type-safe development
- Tailwind CSS - Modern styling
- Lucide React - Beautiful icons
- Vercel Functions - Serverless Python functions for production
- FastAPI - Local development server with hot reload
- OpenAI GPT-4 - AI-enhanced content processing
- aiohttp - Async HTTP client for web crawling
- BeautifulSoup4 - HTML parsing and content extraction
- Vercel Cron Jobs - Automatic scheduling every 6 hours
- Change Detection - Structure fingerprinting and diff analysis
- Smart Thresholds - Updates only for significant changes (5%+)
Development Mode (Local):
Production Mode (Vercel):
- Node.js 18+ (for frontend)
- Python 3.9+ (for local backend development)
- OpenAI API Key (for AI enhancement)
- Vercel CLI (for deployment)
-
Clone the repository
git clone <your-repo-url> cd llm_txt_creator -
Install dependencies
-
Set up environment variables
Edit .env with your OpenAI API key:
NEXT_PUBLIC_API_URL=http://localhost:8000 OPENAI_API_KEY=your_openai_api_key_here -
Start development servers
Option A: Use the automated start script (Recommended)
This script automatically:
- Creates Python virtual environment if needed
- Installs all backend dependencies
- Starts both FastAPI servers (ports 8000 & 8001)
- Starts Next.js development server (port 3000)
- Provides clear status messages and error handling
Option B: Use the convenience script
cd backend python run_dev.pyOption C: Manual startup
# Terminal 1 - Main API cd backend && python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload # Terminal 2 - Scheduler Service cd backend && python -m uvicorn scheduler:scheduler_app --host 0.0.0.0 --port 8001 --reload # Terminal 3 - Frontend npm run dev -
Open your browser
- Main App: http://localhost:3000
- Monitor Dashboard: http://localhost:3000/monitor
- API Docs: http://localhost:8000/docs
- Scheduler Docs: http://localhost:8001/docs
- Enter Website URL: Input the URL you want to analyze
- Configure Settings: Choose maximum pages to crawl (10-100)
- Generate Files: Click "Generate llms.txt" and wait for processing
- Download Results: Download both llms.txt and llms-full.txt files
- Review Analysis: View the pages analyzed and their importance scores
- Navigate to /monitor page
- Enter website URL (e.g., https://docs.anthropic.com)
- Choose check interval (recommended: 24 hours)
- Select max pages to crawl (recommended: 20 pages)
- Click "Add to Monitoring"
Change Detection:
- Creates "fingerprints" of website structure (URLs, titles, sections)
- Detects new pages, removed pages, and modified content
- Calculates change severity: Major (50%+), Moderate (20%+), Minor (5%+)
Smart Updates:
- Only regenerates llms.txt when changes are significant (5%+ threshold)
- AI processing scales with site size to prevent timeouts
- Detailed change reports show exactly what changed
Automatic Scheduling:
- Production: Cron jobs run every 6 hours automatically
- Configurable: Set custom intervals from hourly to weekly
- Manual Override: Force immediate checks anytime
- max_pages: Maximum number of pages to crawl (default: 20)
- depth_limit: Maximum crawl depth from the root URL (default: 3)
- check_interval: Monitoring interval in seconds (default: 86400 = 24 hours)
When OpenAI API key is provided:
- Enhanced Descriptions: AI-improved page descriptions
- Smart Categorization: Dynamic section organization
- Content Cleanup: Removes redundancy and improves clarity
- Scalable Processing: Adjusts AI usage based on website size
- Documentation sites: Perfect for monitoring (docs., developers.)
- News sites: Good for content updates (moderate frequency)
- Large sites: Use smaller page limits (10-20 pages)
- Critical docs: Every 6-12 hours
- Regular updates: Daily (24 hours) - Recommended
- Stable sites: Every 3 days
- Archive sites: Weekly
- Small crawls (≤20 pages): Full AI enhancement
- Medium crawls (21-50 pages): AI with 8 pages max per section
- Large crawls (51-100 pages): AI limited to 5 pages per section
- Very large crawls (>100 pages): No AI enhancement (prevents timeouts)
The project includes automated deployment to Vercel with cron job scheduling.
Quick Deploy:
Manual Deploy:
⚠️ Important Notes:
- Vercel Free Plan: Function timeout limited to 60 seconds max, cron jobs limited to daily frequency
- Vercel Pro Plan: Function timeout can be up to 300 seconds, unlimited cron frequency
- For large websites (>50 pages), consider upgrading to Pro plan or use local development
- Free plan: Cron jobs run daily at 12:00 PM UTC
- Pro plan: Can run every 6 hours or any custom schedule
Environment Variables Required:
The vercel.json includes:
- New pages: Recently added documentation or content
- Removed pages: Deleted or moved content
- Modified pages: Title changes, section reassignments
- Structural changes: Navigation reorganization, new product areas
- Major (50%+): Large restructures, new product launches → Always update
- Moderate (20%+): New documentation sections → Always update
- Minor (5%+): New pages, title changes → Always update
- Minimal (<5%): Minor tweaks → Skip update (prevents noise)
- Execution Time: 10 seconds (Hobby), 60 seconds (Pro), 900 seconds (cron)
- Memory: Up to 1024MB
- Payload Size: 4.5MB request/response limit
- Concurrent checks: System handles multiple sites efficiently
- Smart scheduling: Only checks sites when intervals are due
- Change thresholds: Prevents unnecessary regeneration
- Timeout management: Graceful degradation for large sites
- Use appropriate page limits: See Configuration for recommendations
- Monitor function execution times: Check Vercel dashboard for performance metrics
- Consider Pro plan: For larger sites requiring longer execution times
- Batch monitoring: System automatically batches multiple site checks efficiently
"Failed to fetch" errors locally:
- Check that .env has NEXT_PUBLIC_API_URL=http://localhost:8000
- Ensure backend servers are running on ports 8000 and 8001
"Site not being monitored":
- Add the site first using the monitor interface
- Check the URL format (include https://)
"No changes detected but site updated":
- Check if changes are below 5% threshold
- Force manual check to see latest status
- Consider if changes are in content vs. structure
"Update failed":
- Check if the website is accessible
- Verify the site doesn't block crawlers
- Look for SSL/security issues
- Database persistence: Store monitoring data permanently
- Email notifications: Alert when sites update
- Webhook integration: Push updates to external systems
- Advanced scheduling: Per-site custom schedules
- Change analytics: Track patterns and trends
- Team collaboration: Shared monitoring dashboards
- ChangeDetector: Improve change detection algorithms
- AutoUpdater: Add new notification methods
- LLMSTxtGenerator: Enhance content organization
- Frontend: Better visualization and management tools
- Fork the repository
- Create a feature branch (git checkout -b feature/amazing-feature)
- Test locally with python run_dev.py
- Test monitoring features on /monitor page
- Commit your changes (git commit -m 'Add amazing feature')
- Push to the branch (git push origin feature/amazing-feature)
- Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Jeremy Howard for proposing the llms.txt standard
- llmstxt.org for the specification
- The open source community for the amazing tools used in this project
If you encounter any issues or have questions:
- Check this README for common solutions
- Review the API documentation at /docs endpoints (local development)
- Create an issue with detailed information
- Include error messages and steps to reproduce
- Mention whether you're running locally or on Vercel
- Main App: http://localhost:3000
- Monitor Dashboard: http://localhost:3000/monitor
- Main API Docs: http://localhost:8000/docs
- Scheduler API Docs: http://localhost:8001/docs
Good News: Cron jobs are automatically configured when you deploy to Vercel! 🎉
-
Deploy to Vercel (using either method above)
./deploy-vercel.sh # or vercel --prod -
Cron jobs are automatically enabled:
- ✅ Free Plan: Runs daily at 12:00 PM UTC (0 12 * * *)
- ✅ Pro Plan: Can run every 6 hours (0 */6 * * *) or custom schedule
- ✅ Checks all monitored sites for changes
- ✅ Updates llms.txt files when significant changes detected
- ✅ 60-second execution limit (Free) or 300+ seconds (Pro)
-
Verify cron is working:
# Check cron endpoint manually curl https://your-app.vercel.app/api/cron # Check Vercel dashboard # Go to: Project → Functions → View function logs -
Monitor cron activity:
- Visit your app's /monitor page
- Check "Last Update" timestamps
- Look for "Auto-updated" entries in the monitoring dashboard
For local development, you can simulate cron behavior:
Option A: Manual cron trigger
Option B: Set up local cron (macOS/Linux)
Option C: Use a cron service
Vercel Free Plan Schedule: 0 12 * * *
This means:
- 12:00 PM UTC daily (4 AM or 5 AM Pacific, depending on DST)
Vercel Pro Plan Schedule: 0 */6 * * *
This means:
- 12:00 AM UTC (4 PM or 5 PM Pacific, depending on DST)
- 6:00 AM UTC (10 PM or 11 PM Pacific)
- 12:00 PM UTC (4 AM or 5 AM Pacific)
- 6:00 PM UTC (10 AM or 11 AM Pacific)
⚠️ Vercel Plan Limitations:
- Free (Hobby) Plan: Only daily schedules allowed (e.g., 0 12 * * *)
- Pro Plan: Any schedule frequency supported
To change the monitoring frequency, edit vercel.json:
Free plan compatible schedules:
- 0 0 * * * - Daily at midnight UTC
- 0 12 * * * - Daily at noon UTC (default)
- 0 6 * * * - Daily at 6 AM UTC
- 0 0 * * 1 - Weekly on Mondays
Pro plan additional schedules:
- 0 */1 * * * - Every hour
- 0 */2 * * * - Every 2 hours
- 0 */6 * * * - Every 6 hours
- 0 */12 * * * - Every 12 hours
After changing the schedule:
Cron not running:
No sites being checked:
- Make sure you've added sites to monitoring via /monitor page
- Check that sites have valid URLs (include https://)
- Verify OpenAI API key is set in Vercel environment variables
Cron running but not updating:
- Changes might be below 5% threshold (prevents noise)
- Check the specific site manually: force update via /monitor page
- Look at function logs for error messages
In the app:
- Visit /monitor page
- Look for "Last Update" column
- Check for recent timestamps
- Look for "Auto-updated" vs "Manual" in update history
In Vercel Dashboard:
- Go to your project
- Click "Functions" tab
- Click on api/cron.py
- View execution logs and duration
Expected behavior:
- Cron runs every 6 hours
- Only updates sites with significant changes (5%+)
- Updates multiple sites efficiently in single execution
- Completes within 15-minute timeout limit
Built with ❤️ for the llms.txt standard and automated monitoring!
.png)


