Show HN: My £250 Fiverr disaster led me to build and open-source DeepScrape

4 months ago 8

AI-powered web scraping with intelligent extraction

Transform any website into structured data using Playwright automation and GPT-4o extraction. Built for modern web applications, RAG pipelines, and data workflows.

🤖 LLM Extraction - Convert web content to structured JSON using OpenAI
📦 Batch Processing - Process multiple URLs efficiently with controlled concurrency
🧬 API-first - REST endpoints secured with API keys, documented with Swagger.
🎭 Browser Automation - Full Playwright support with stealth mode
📝 Multiple Formats - Output as HTML, Markdown, or plain text
📥 Download Options - Individual files, ZIP archives, or consolidated JSON
⚡ Smart Caching - File-based caching with configurable TTL
🔄 Job Queue - Background processing with BullMQ and Redis
🕷️ Web Crawling - Multi-page crawling with configurable strategies
🐳 Docker Ready - One-command deployment

git clone https://github.com/stretchcloud/deepscrape.git cd deepscrape npm install cp .env.example .env

Edit .env with your settings:

API_KEY=your-secret-key OPENAI_API_KEY=your-openai-key REDIS_HOST=localhost CACHE_ENABLED=true

Test: curl http://localhost:3000/health

curl -X POST http://localhost:3000/api/scrape \ -H "Content-Type: application/json" \ -H "X-API-Key: your-secret-key" \ -d '{ "url": "https://example.com", "options": { "extractorFormat": "markdown" } }' | jq -r '.content' > content.md

Extract structured data using JSON Schema:

curl -X POST http://localhost:3000/api/extract-schema \ -H "Content-Type: application/json" \ -H "X-API-Key: your-secret-key" \ -d '{ "url": "https://news.example.com/article", "schema": { "type": "object", "properties": { "title": { "type": "string", "description": "Article headline" }, "author": { "type": "string", "description": "Author name" }, "publishDate": { "type": "string", "description": "Publication date" } }, "required": ["title"] } }' | jq -r '.extractedData' > schemadata.md

Scrapes a URL and uses an LLM (GPT-4o) to generate a concise summary of its content.

curl -X POST http://localhost:3000/api/summarize \ -H "Content-Type: application/json" \ -H "X-API-Key: test-key" \ -d '{ "url": "https://en.wikipedia.org/wiki/Large_language_model", "maxLength": 300, "options": { "temperature": 0.3, "waitForSelector": "body", "extractorFormat": "markdown" } }' | jq -r '.summary' > summary-output.md

Technical Documentation Analysis

Extract key information from technical documentation:

curl -X POST http://localhost:3000/api/extract-schema \ -H "Content-Type: application/json" \ -H "X-API-Key: test-key" \ -d '{ "url": "https://docs.github.com/en/rest/overview/permissions-required-for-github-apps", "schema": { "type": "object", "properties": { "title": {"type": "string"}, "overview": {"type": "string"}, "permissionCategories": {"type": "array", "items": {"type": "string"}}, "apiEndpoints": { "type": "array", "items": { "type": "object", "properties": { "endpoint": {"type": "string"}, "requiredPermissions": {"type": "array", "items": {"type": "string"}} } } } }, "required": ["title", "overview"] }, "options": { "extractorFormat": "markdown" } }' | jq -r '.extractedData' > output.md

Comparative Analysis from Academic Papers

Extract and compare methodologies from research papers:

curl -X POST http://localhost:3000/api/extract-schema \ -H "Content-Type: application/json" \ -H "X-API-Key: test-key" \ -d '{ "url": "https://arxiv.org/abs/2005.14165", "schema": { "type": "object", "properties": { "title": {"type": "string"}, "authors": {"type": "array", "items": {"type": "string"}}, "abstract": {"type": "string"}, "methodology": {"type": "string"}, "results": {"type": "string"}, "keyContributions": {"type": "array", "items": {"type": "string"}}, "citations": {"type": "number"} } }, "options": { "extractorFormat": "markdown" } }' | jq -r '.extractedData' > output.md

Complex Data Analysis from Medium Articles

Extract complex data structure from any medium articles

curl -X POST http://localhost:3000/api/extract-schema \ -H "Content-Type: application/json" \ -H "X-API-Key: test-key" \ -d '{ "url": "https://johnchildseddy.medium.com/typescript-llms-lessons-learned-from-9-months-in-production-4910485e3272", "schema": { "type": "object", "properties": { "title": {"type": "string"}, "author": {"type": "string"}, "keyInsights": {"type": "array", "items": {"type": "string"}}, "technicalChallenges": {"type": "array", "items": {"type": "string"}}, "businessImpact": {"type": "string"} } }, "options": { "extractorFormat": "markdown" } }' | jq -r '.extractedData' > output.md

Process multiple URLs efficiently with controlled concurrency, automatic retries, and comprehensive download options.

curl -X POST http://localhost:3000/api/batch/scrape \ -H "Content-Type: application/json" \ -H "X-API-Key: your-secret-key" \ -d '{ "urls": [ "https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart", "https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/deploy-vais-prompt", "https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/overview", "https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/vertex-ai-studio-express-mode-quickstart", "https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/vertex-ai-express-mode-api-quickstart" ], "concurrency": 3, "options": { "extractorFormat": "markdown", "waitForTimeout": 2000, "stealthMode": true } }'

Response:

{ "success": true, "batchId": "550e8400-e29b-41d4-a716-446655440000", "totalUrls": 5, "estimatedTime": 50000, "statusUrl": "http://localhost:3000/api/batch/scrape/550e8400.../status" }

curl -X GET http://localhost:3000/api/batch/scrape/{batchId}/status \ -H "X-API-Key: your-secret-key"

Response:

{ "success": true, "batchId": "550e8400-e29b-41d4-a716-446655440000", "status": "completed", "totalUrls": 5, "completedUrls": 4, "failedUrls": 1, "progress": 100, "processingTime": 45230, "results": [...] }

1. Download as ZIP Archive (Recommended)

# Download all results as markdown files in a ZIP curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/zip?format=markdown" \ -H "X-API-Key: your-secret-key" \ --output "batch_results.zip" # Extract the ZIP to get individual files unzip batch_results.zip

ZIP Contents:

1_example_com_page1.md 2_example_com_page2.md 3_example_com_page3.md 4_docs_example_com_api.md batch_summary.json

2. Download Individual Results

# Get job IDs from status endpoint, then download individual files curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/{jobId}?format=markdown" \ -H "X-API-Key: your-secret-key" \ --output "page1.md"

3. Download Consolidated JSON

# All results in a single JSON file curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/json" \ -H "X-API-Key: your-secret-key" \ --output "batch_results.json"

curl -X POST http://localhost:3000/api/batch/scrape \ -H "Content-Type: application/json" \ -H "X-API-Key: your-secret-key" \ -d '{ "urls": ["https://example.com", "https://example.org"], "concurrency": 5, "timeout": 300000, "maxRetries": 3, "failFast": false, "webhook": "https://your-app.com/webhook", "options": { "extractorFormat": "markdown", "useBrowser": true, "stealthMode": true, "waitForTimeout": 5000, "blockAds": true, "actions": [ {"type": "click", "selector": ".accept-cookies", "optional": true}, {"type": "wait", "timeout": 2000} ] } }'

curl -X DELETE http://localhost:3000/api/batch/scrape/{batchId} \ -H "X-API-Key: your-secret-key"

Start a multi-page crawl (automatically exports markdown files):

curl -X POST http://localhost:3000/api/crawl \ -H "Content-Type: application/json" \ -H "X-API-Key: your-secret-key" \ -d '{ "url": "https://docs.example.com", "limit": 50, "maxDepth": 3, "strategy": "bfs", "includePaths": ["^/docs/.*"], "scrapeOptions": { "extractorFormat": "markdown" } }'

Response includes output directory:

{ "success": true, "id": "abc123-def456", "url": "http://localhost:3000/api/crawl/abc123-def456", "message": "Crawl initiated successfully. Individual pages will be exported as markdown files.", "outputDirectory": "./crawl-output/abc123-def456" }

Check crawl status (includes exported files info):

curl http://localhost:3000/api/crawl/{job-id} \ -H "X-API-Key: your-secret-key"

Status response shows exported files:

{ "success": true, "status": "completed", "crawl": {...}, "jobs": [...], "count": 15, "exportedFiles": { "count": 15, "outputDirectory": "./crawl-output/abc123-def456", "files": ["./crawl-output/abc123-def456/2024-01-15_abc123_example.com_page1.md", ...] } }

Endpoint Method Description

/api/scrape	POST	Scrape single URL
/api/extract-schema	POST	Extract structured data
/api/summarize	POST	Generate content summary
/api/batch/scrape	POST	Start batch processing
/api/batch/scrape/:id/status	GET	Get batch status
/api/batch/scrape/:id/download/zip	GET	Download batch as ZIP
/api/batch/scrape/:id/download/json	GET	Download batch as JSON
/api/batch/scrape/:id/download/:jobId	GET	Download individual result
/api/batch/scrape/:id	DELETE	Cancel batch processing
/api/crawl	POST	Start web crawl
/api/crawl/:id	GET	Get crawl status
/api/cache	DELETE	Clear cache

# Core API_KEY=your-secret-key PORT=3000 # OpenAI OPENAI_API_KEY=your-key OPENAI_DEPLOYMENT_NAME=gpt-4o LLM_TEMPERATURE=0.2 # Cache CACHE_ENABLED=true CACHE_TTL=3600 CACHE_DIRECTORY=./cache # Redis (for job queue) REDIS_HOST=localhost REDIS_PORT=6379 # Crawl file export CRAWL_OUTPUT_DIR=./crawl-output

interface ScraperOptions { extractorFormat?: 'html' | 'markdown' | 'text' waitForSelector?: string waitForTimeout?: number actions?: BrowserAction[] // click, scroll, wait, fill skipCache?: boolean cacheTtl?: number stealthMode?: boolean proxy?: string userAgent?: string }

# Build and run docker build -t deepscrape . docker run -d -p 3000:3000 --env-file .env deepscrape # Or use docker-compose docker-compose up -d

Interact with dynamic content:

{ "url": "https://example.com", "options": { "actions": [ { "type": "click", "selector": "#load-more" }, { "type": "wait", "timeout": 2000 }, { "type": "scroll", "position": 1000 } ] } }

BFS (default) - Breadth-first exploration
DFS - Depth-first for deep content
Best-First - Priority-based on content relevance

Use clear description fields in your JSON Schema
Start with simple schemas and iterate
Lower temperature values for consistent results
Include examples in descriptions for better accuracy

Each crawled page is automatically exported as a markdown file with:

Filename format: YYYY-MM-DD_crawlId_hostname_path.md
YAML frontmatter with metadata (URL, title, crawl date, status)
Organized structure: ./crawl-output/{crawl-id}/
Automatic summary: Generated when crawl completes

Example file structure:

crawl-output/ ├── abc123-def456/ │ ├── 2024-01-15_abc123_docs.example.com_getting-started.md │ ├── 2024-01-15_abc123_docs.example.com_api-reference.md │ ├── 2024-01-15_abc123_docs.example.com_tutorials.md │ ├── abc123-def456_summary.md │ ├── abc123-def456_consolidated.md # 🆕 All pages in one file │ └── abc123-def456_consolidated.json # 🆕 Structured JSON export └── xyz789-ghi012/ └── ...

Consolidated Export Features:

Single Markdown: All crawled pages combined into one readable file
JSON Export: Structured data with metadata for programmatic use
Auto-Generated: Created automatically when crawl completes
Rich Metadata: Preserves all page metadata and crawl statistics

File content example:

--- url: "https://docs.example.com/getting-started" title: "Getting Started Guide" crawled_at: "2024-01-15T10:30:00.000Z" status: 200 content_type: "markdown" load_time: 1250ms browser_mode: false --- # Getting Started Guide Welcome to the getting started guide...

┌───────────────┐ REST ┌────────────────────────┐ │ Client │────────────▶│ Express API Gateway │ └───────────────┘ └────────┬───────────────┘ │ (Job Payload) ▼ ┌───────────────────────┐ │ BullMQ Job Queue │ (Redis) └────────┬──────────────┘ │ pulls job │ pushes result ▼ ┌─────────────────┐ Playwright ┌─────────────────┐ GPT-4o ┌──────────────┐ │ Scraper Worker │──────────▶│ Extractor │────────▶│ OpenAI │ └─────────────────┘ └─────────────────┘ └──────────────┘ (Headless Browser) (HTML → MD/Text/JSON) (LLM API) │ ▼ Cache Layer (FS/Redis)