AI-powered web scraping with intelligent extraction
Transform any website into structured data using Playwright automation and GPT-4o extraction. Built for modern web applications, RAG pipelines, and data workflows.
🤖 LLM Extraction - Convert web content to structured JSON using OpenAI
📦 Batch Processing - Process multiple URLs efficiently with controlled concurrency
🧬 API-first - REST endpoints secured with API keys, documented with Swagger.
🎭 Browser Automation - Full Playwright support with stealth mode
📝 Multiple Formats - Output as HTML, Markdown, or plain text
📥 Download Options - Individual files, ZIP archives, or consolidated JSON
⚡ Smart Caching - File-based caching with configurable TTL
🔄 Job Queue - Background processing with BullMQ and Redis
🕷️ Web Crawling - Multi-page crawling with configurable strategies
🐳 Docker Ready - One-command deployment
git clone https://github.com/stretchcloud/deepscrape.git
cd deepscrape
npm install
cp .env.example .env
Edit .env with your settings:
API_KEY = your-secret-key
OPENAI_API_KEY = your-openai-key
REDIS_HOST = localhost
CACHE_ENABLED = true
Test: curl http://localhost:3000/health
curl -X POST http://localhost:3000/api/scrape \
-H " Content-Type: application/json" \
-H " X-API-Key: your-secret-key" \
-d ' {
"url": "https://example.com",
"options": { "extractorFormat": "markdown" }
}' | jq -r ' .content' > content.md
Extract structured data using JSON Schema:
curl -X POST http://localhost:3000/api/extract-schema \
-H " Content-Type: application/json" \
-H " X-API-Key: your-secret-key" \
-d ' {
"url": "https://news.example.com/article",
"schema": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Article headline"
},
"author": {
"type": "string",
"description": "Author name"
},
"publishDate": {
"type": "string",
"description": "Publication date"
}
},
"required": ["title"]
}
}' | jq -r ' .extractedData' > schemadata.md
Scrapes a URL and uses an LLM (GPT-4o) to generate a concise summary of its content.
curl -X POST http://localhost:3000/api/summarize \
-H " Content-Type: application/json" \
-H " X-API-Key: test-key" \
-d ' {
"url": "https://en.wikipedia.org/wiki/Large_language_model",
"maxLength": 300,
"options": {
"temperature": 0.3,
"waitForSelector": "body",
"extractorFormat": "markdown"
}
}' | jq -r ' .summary' > summary-output.md
Technical Documentation Analysis
Extract key information from technical documentation:
curl -X POST http://localhost:3000/api/extract-schema \
-H " Content-Type: application/json" \
-H " X-API-Key: test-key" \
-d ' {
"url": "https://docs.github.com/en/rest/overview/permissions-required-for-github-apps",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"overview": {"type": "string"},
"permissionCategories": {"type": "array", "items": {"type": "string"}},
"apiEndpoints": {
"type": "array",
"items": {
"type": "object",
"properties": {
"endpoint": {"type": "string"},
"requiredPermissions": {"type": "array", "items": {"type": "string"}}
}
}
}
},
"required": ["title", "overview"]
},
"options": {
"extractorFormat": "markdown"
}
}' | jq -r ' .extractedData' > output.md
Comparative Analysis from Academic Papers
Extract and compare methodologies from research papers:
curl -X POST http://localhost:3000/api/extract-schema \
-H " Content-Type: application/json" \
-H " X-API-Key: test-key" \
-d ' {
"url": "https://arxiv.org/abs/2005.14165",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"authors": {"type": "array", "items": {"type": "string"}},
"abstract": {"type": "string"},
"methodology": {"type": "string"},
"results": {"type": "string"},
"keyContributions": {"type": "array", "items": {"type": "string"}},
"citations": {"type": "number"}
}
},
"options": {
"extractorFormat": "markdown"
}
}' | jq -r ' .extractedData' > output.md
Complex Data Analysis from Medium Articles
Extract complex data structure from any medium articles
curl -X POST http://localhost:3000/api/extract-schema \
-H " Content-Type: application/json" \
-H " X-API-Key: test-key" \
-d ' {
"url": "https://johnchildseddy.medium.com/typescript-llms-lessons-learned-from-9-months-in-production-4910485e3272",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"keyInsights": {"type": "array", "items": {"type": "string"}},
"technicalChallenges": {"type": "array", "items": {"type": "string"}},
"businessImpact": {"type": "string"}
}
},
"options": {
"extractorFormat": "markdown"
}
}' | jq -r ' .extractedData' > output.md
Process multiple URLs efficiently with controlled concurrency, automatic retries, and comprehensive download options.
curl -X POST http://localhost:3000/api/batch/scrape \
-H " Content-Type: application/json" \
-H " X-API-Key: your-secret-key" \
-d ' {
"urls": [
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart",
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/deploy-vais-prompt",
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/overview",
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/vertex-ai-studio-express-mode-quickstart",
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/vertex-ai-express-mode-api-quickstart"
],
"concurrency": 3,
"options": {
"extractorFormat": "markdown",
"waitForTimeout": 2000,
"stealthMode": true
}
}'
Response:
{
"success" : true ,
"batchId" : " 550e8400-e29b-41d4-a716-446655440000" ,
"totalUrls" : 5 ,
"estimatedTime" : 50000 ,
"statusUrl" : " http://localhost:3000/api/batch/scrape/550e8400.../status"
}
curl -X GET http://localhost:3000/api/batch/scrape/{batchId}/status \
-H " X-API-Key: your-secret-key"
Response:
{
"success" : true ,
"batchId" : " 550e8400-e29b-41d4-a716-446655440000" ,
"status" : " completed" ,
"totalUrls" : 5 ,
"completedUrls" : 4 ,
"failedUrls" : 1 ,
"progress" : 100 ,
"processingTime" : 45230 ,
"results" : [... ]
}
1. Download as ZIP Archive (Recommended)
# Download all results as markdown files in a ZIP
curl -X GET " http://localhost:3000/api/batch/scrape/{batchId}/download/zip?format=markdown" \
-H " X-API-Key: your-secret-key" \
--output " batch_results.zip"
# Extract the ZIP to get individual files
unzip batch_results.zip
ZIP Contents:
1_example_com_page1.md
2_example_com_page2.md
3_example_com_page3.md
4_docs_example_com_api.md
batch_summary.json
2. Download Individual Results
# Get job IDs from status endpoint, then download individual files
curl -X GET " http://localhost:3000/api/batch/scrape/{batchId}/download/{jobId}?format=markdown" \
-H " X-API-Key: your-secret-key" \
--output " page1.md"
3. Download Consolidated JSON
# All results in a single JSON file
curl -X GET " http://localhost:3000/api/batch/scrape/{batchId}/download/json" \
-H " X-API-Key: your-secret-key" \
--output " batch_results.json"
curl -X POST http://localhost:3000/api/batch/scrape \
-H " Content-Type: application/json" \
-H " X-API-Key: your-secret-key" \
-d ' {
"urls": ["https://example.com", "https://example.org"],
"concurrency": 5,
"timeout": 300000,
"maxRetries": 3,
"failFast": false,
"webhook": "https://your-app.com/webhook",
"options": {
"extractorFormat": "markdown",
"useBrowser": true,
"stealthMode": true,
"waitForTimeout": 5000,
"blockAds": true,
"actions": [
{"type": "click", "selector": ".accept-cookies", "optional": true},
{"type": "wait", "timeout": 2000}
]
}
}'
curl -X DELETE http://localhost:3000/api/batch/scrape/{batchId} \
-H " X-API-Key: your-secret-key"
Start a multi-page crawl (automatically exports markdown files):
curl -X POST http://localhost:3000/api/crawl \
-H " Content-Type: application/json" \
-H " X-API-Key: your-secret-key" \
-d ' {
"url": "https://docs.example.com",
"limit": 50,
"maxDepth": 3,
"strategy": "bfs",
"includePaths": ["^/docs/.*"],
"scrapeOptions": {
"extractorFormat": "markdown"
}
}'
Response includes output directory:
{
"success" : true ,
"id" : " abc123-def456" ,
"url" : " http://localhost:3000/api/crawl/abc123-def456" ,
"message" : " Crawl initiated successfully. Individual pages will be exported as markdown files." ,
"outputDirectory" : " ./crawl-output/abc123-def456"
}
Check crawl status (includes exported files info):
curl http://localhost:3000/api/crawl/{job-id} \
-H " X-API-Key: your-secret-key"
Status response shows exported files:
{
"success" : true ,
"status" : " completed" ,
"crawl" : {... },
"jobs" : [... ],
"count" : 15 ,
"exportedFiles" : {
"count" : 15 ,
"outputDirectory" : " ./crawl-output/abc123-def456" ,
"files" : [" ./crawl-output/abc123-def456/2024-01-15_abc123_example.com_page1.md" , ... ]
}
}
Endpoint
Method
Description
/api/scrape
POST
Scrape single URL
/api/extract-schema
POST
Extract structured data
/api/summarize
POST
Generate content summary
/api/batch/scrape
POST
Start batch processing
/api/batch/scrape/:id/status
GET
Get batch status
/api/batch/scrape/:id/download/zip
GET
Download batch as ZIP
/api/batch/scrape/:id/download/json
GET
Download batch as JSON
/api/batch/scrape/:id/download/:jobId
GET
Download individual result
/api/batch/scrape/:id
DELETE
Cancel batch processing
/api/crawl
POST
Start web crawl
/api/crawl/:id
GET
Get crawl status
/api/cache
DELETE
Clear cache
# Core
API_KEY = your-secret-key
PORT = 3000
# OpenAI
OPENAI_API_KEY = your-key
OPENAI_DEPLOYMENT_NAME = gpt-4o
LLM_TEMPERATURE = 0.2
# Cache
CACHE_ENABLED = true
CACHE_TTL = 3600
CACHE_DIRECTORY = ./cache
# Redis (for job queue)
REDIS_HOST = localhost
REDIS_PORT = 6379
# Crawl file export
CRAWL_OUTPUT_DIR = ./crawl-output
interface ScraperOptions {
extractorFormat ?: 'html' | 'markdown' | 'text'
waitForSelector ?: string
waitForTimeout ?: number
actions ?: BrowserAction [ ] // click, scroll, wait, fill
skipCache ?: boolean
cacheTtl ?: number
stealthMode ?: boolean
proxy ?: string
userAgent ?: string
}
# Build and run
docker build -t deepscrape .
docker run -d -p 3000:3000 --env-file .env deepscrape
# Or use docker-compose
docker-compose up -d
Interact with dynamic content:
{
"url" : " https://example.com" ,
"options" : {
"actions" : [
{ "type" : " click" , "selector" : " #load-more" },
{ "type" : " wait" , "timeout" : 2000 },
{ "type" : " scroll" , "position" : 1000 }
]
}
}
BFS (default) - Breadth-first exploration
DFS - Depth-first for deep content
Best-First - Priority-based on content relevance
Use clear description fields in your JSON Schema
Start with simple schemas and iterate
Lower temperature values for consistent results
Include examples in descriptions for better accuracy
Each crawled page is automatically exported as a markdown file with:
Filename format : YYYY-MM-DD_crawlId_hostname_path.md
YAML frontmatter with metadata (URL, title, crawl date, status)
Organized structure : ./crawl-output/{crawl-id}/
Automatic summary : Generated when crawl completes
Example file structure:
crawl-output/
├── abc123-def456/
│ ├── 2024-01-15_abc123_docs.example.com_getting-started.md
│ ├── 2024-01-15_abc123_docs.example.com_api-reference.md
│ ├── 2024-01-15_abc123_docs.example.com_tutorials.md
│ ├── abc123-def456_summary.md
│ ├── abc123-def456_consolidated.md # 🆕 All pages in one file
│ └── abc123-def456_consolidated.json # 🆕 Structured JSON export
└── xyz789-ghi012/
└── ...
Consolidated Export Features:
Single Markdown : All crawled pages combined into one readable file
JSON Export : Structured data with metadata for programmatic use
Auto-Generated : Created automatically when crawl completes
Rich Metadata : Preserves all page metadata and crawl statistics
File content example:
---
url : " https://docs.example.com/getting-started"
title : " Getting Started Guide"
crawled_at : " 2024-01-15T10:30:00.000Z"
status : 200
content_type : " markdown"
load_time : 1250ms
browser_mode : false
---
# Getting Started Guide
Welcome to the getting started guide...
┌───────────────┐ REST ┌────────────────────────┐
│ Client │────────────▶│ Express API Gateway │
└───────────────┘ └────────┬───────────────┘
│ (Job Payload)
▼
┌───────────────────────┐
│ BullMQ Job Queue │ (Redis)
└────────┬──────────────┘
│
pulls job │ pushes result
▼
┌─────────────────┐ Playwright ┌─────────────────┐ GPT-4o ┌──────────────┐
│ Scraper Worker │──────────▶│ Extractor │────────▶│ OpenAI │
└─────────────────┘ └─────────────────┘ └──────────────┘
(Headless Browser) (HTML → MD/Text/JSON) (LLM API)
│
▼
Cache Layer (FS/Redis)
📦 Batch processing with controlled concurrency
📥 Multiple download formats (ZIP, JSON, individual files)
🚸 Browser pooling & warm-up
🧠 Automatic schema generation (LLM)
📊 Prometheus metrics & Grafana dashboard
🌐 Cloud-native cache backends (S3/Redis)
🌈 Web UI playground
🔔 Advanced webhook payloads with retry logic
📈 Batch processing analytics and insights
Apache 2.0 - see LICENSE file
Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request
Star ⭐ this repo if you find it useful!