Tool that tells you how hard a website is to scrape

2 weeks ago 2

Know before you scrape. Analyze any website's anti-bot protections in seconds.

Stop wasting hours building scrapers only to discover the site has Cloudflare + JavaScript rendering + CAPTCHA + rate limiting. caniscrape does reconnaissance upfront so you know exactly what you're dealing with before writing a single line of code.

caniscrape analyzes a URL and tells you:

What protections are active (WAF, CAPTCHA, rate limits, TLS fingerprinting, honeypots)
Difficulty score (0-10 scale: Easy → Very Hard)
Specific recommendations on what tools/proxies you'll need
Estimated complexity so you can decide: build it yourself or use a service

Required dependency:

# Install wafw00f (WAF detection) pipx install wafw00f # Install Playwright browsers (for JS detection) playwright install chromium

caniscrape https://example.com

Identifies Web Application Firewalls (Cloudflare, Akamai, Imperva, DataDome, PerimeterX, etc.)

Tests with burst and sustained traffic patterns
Detects HTTP 429s, timeouts, throttling, soft bans
Determines blocking threshold (requests/min)

Compares content with/without JS execution
Detects SPAs (React, Vue, Angular)
Calculates percentage of content missing without JS

Scans for reCAPTCHA, hCaptcha, Cloudflare Turnstile
Tests if CAPTCHA appears on load or after rate limiting
Monitors network traffic for challenge endpoints

Compares standard Python clients vs browser-like clients
Detects if site blocks based on TLS handshake signatures

Scans for invisible "honeypot" links (bot traps)
Detects if site is monitoring mouse/scroll behavior

Checks scraping permissions
Extracts recommended crawl-delay

# Find ALL WAFs (slower, may trigger rate limits) caniscrape https://example.com --find-all

# Use curl_cffi for better stealth (slower but more likely to succeed) caniscrape https://example.com --impersonate

# Check 2/3 of links (more accurate, slower) caniscrape https://example.com --thorough # Check ALL links (most accurate, very slow on large sites) caniscrape https://example.com --deep

caniscrape https://example.com --impersonate --find-all --thorough

The tool calculates a 0-10 difficulty score based on:

Factor Impact

CAPTCHA on page load	+5 points
CAPTCHA after rate limit	+4 points
DataDome/PerimeterX WAF	+4 points
Akamai/Imperva WAF	+3 points
Aggressive rate limiting	+3 points
Cloudflare WAF	+2 points
Honeypot traps detected	+2 points
TLS fingerprinting active	+1 point

Score interpretation:

0-2: Easy (basic scraping will work)
3-4: Medium (need some precautions)
5-7: Hard (requires advanced techniques)
8-10: Very Hard (consider using a service)

Python 3.9+
pip or pipx

# 1. Install caniscrape pip install caniscrape # 2. Install wafw00f (WAF detection) # Option A: Using pipx (recommended) python -m pip install --user pipx pipx install wafw00f # Option B: Using pip pip install wafw00f # 3. Install Playwright browsers (for JS/CAPTCHA/behavioral detection) playwright install chromium

Core dependencies (installed automatically):

click - CLI framework
rich - Terminal formatting
aiohttp - Async HTTP requests
beautifulsoup4 - HTML parsing
playwright - Headless browser automation
curl_cffi - Browser impersonation

External tools (install separately):

wafw00f - WAF detection

Before building a scraper: Check if it's even feasible
Debugging scraper issues: Identify what protection broke your scraper
Client estimates: Give accurate time/cost estimates for scraping projects

Pipeline planning: Know what infrastructure you'll need (proxies, CAPTCHA solvers)
Cost estimation: Calculate proxy/CAPTCHA costs before committing to a data source

Site selection: Find the easiest data sources for your research
Compliance: Check robots.txt before scraping

⚠️ Limitations & Disclaimers

Dynamic protections: Some sites only trigger defenses under specific conditions
Behavioral AI: Advanced ML-based bot detection that adapts in real-time
Account-based restrictions: Protections that only activate for logged-in users

This tool is for reconnaissance only - it does not bypass protections
Always respect robots.txt and terms of service
Some sites may consider aggressive scanning hostile - use --find-all and --deep sparingly
You are responsible for how you use this tool and any scrapers you build

Analysis takes 30-60 seconds per URL
Some checks require making multiple requests (may trigger rate limits)
Results are a snapshot - protections can change over time

Found a bug? Have a feature request? Contributions are welcome!

Fork the repo
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

MIT License - see LICENSE file for details

Built on top of:

wafw00f - WAF detection
Playwright - Browser automation
curl_cffi - Browser impersonation

Questions? Feedback? Open an issue on GitHub.

Remember: This tool tells you HOW HARD it will be to scrape. It doesn't do the scraping for you. Use it to make informed decisions before you start building.

Read Entire Article