A Python microframework for web scraping that provides an API
2 weeks ago
1
A Python-based modular web scraping framework focused on efficient single URL crawling, supporting asynchronous processing, API services, and highly customizable spider modules.
🎯 Single URL Focus - Efficient single webpage crawling
⚡ Asynchronous Processing - Support for async task queues and concurrent processing
🔧 Modular Design - Extensible spider module system
🔄 Auto Retry - Intelligent retry mechanism and delay control
🧪 Custom Parsing - Users have full control over data extraction logic in parse methods
🤖 AI-Friendly - Simple interface design makes it perfect for AI-assisted spider development
# Terminal 1: Start API server
python cmd/server.py --port 8080
# Terminal 2: Start task executor
python cmd/crawl.py
# Single URL crawling
curl -X POST "http://127.0.0.1:8080/api/v1/crawl/single" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "spider_name": "default"}'# Query task status
curl "http://127.0.0.1:8080/api/v1/task/{task_id}/status"# Get task results
curl "http://127.0.0.1:8080/api/v1/task/{task_id}/result"
# Verify system functionality
python -m tests.test_ip_crawl
Basic Configuration (config.py)
# Request ConfigurationDEFAULT_TIMEOUT=30# Request timeout (seconds)DEFAULT_RETRY_COUNT=3# Default retry countDEFAULT_DELAY=1.0# Request interval (seconds)DEFAULT_TASKS_DIR="data/tasks"# Task directory# API ConfigurationAPI_HOST="127.0.0.1"# API hostAPI_PORT=8000# API port
No complex inheritance chains or framework-specific patterns
🎯 AI-Friendly Design:
# AI can easily generate spiders like this:classAIGeneratedSpider(BaseSpider):
defparse(self, raw_content: str, url: str, headers: dict) ->dict:
# AI can focus purely on data extraction logic# No need to understand complex framework internalssoup=BeautifulSoup(raw_content, 'html.parser')
return {"data": "extracted by AI"}
🚀 Perfect for AI Prompts:
"Create a spider that extracts product information from e-commerce pages"
"Generate a news article spider that gets title, content, and publish date"
"Build a spider for social media posts with likes, comments, and shares"
The framework handles all the complexity (HTTP requests, retries, async processing, task management) while AI focuses on the core parsing logic!