Show HN: AnyCrawl v0.0.1-alpha.5 – custom user-agent and richer scraping API

12 hours ago 1

AnyCrawl is a high-performance web crawling and scraping application that excels in multiple domains:

SERP Crawling: Support for multiple search engines with batch processing capabilities
Web Crawling: Efficient single-page content extraction
Site Crawling: Comprehensive full-site crawling with intelligent traversal
High Performance: Multi-threading and multi-process architecture
Batch Processing: Efficient handling of batch crawling tasks

Built with modern architectures and optimized for LLMs (Large Language Models), AnyCrawl provides:

📖 For detailed documentation, visit Docs

docker compose up --build

Variable Description Default Example

NODE_ENV	Runtime environment	production	production, development
ANYCRAWL_API_PORT	API service port	8080	8080
ANYCRAWL_HEADLESS	Use headless mode for browser engines	true	true, false
ANYCRAWL_PROXY_URL	Proxy server URL (supports HTTP and SOCKS)	(none)	http://proxy:8080
ANYCRAWL_IGNORE_SSL_ERROR	Ignore SSL certificate errors	true	true, false
ANYCRAWL_KEEP_ALIVE	Keep connections alive between requests	true	true, false
ANYCRAWL_AVAILABLE_ENGINES	Available scraping engines (comma-separated)	cheerio,playwright,puppeteer	playwright,puppeteer
ANYCRAWL_API_DB_TYPE	Database type	sqlite	sqlite, postgresql
ANYCRAWL_API_DB_CONNECTION	Database connection string/path	/usr/src/app/db/database.db	/path/to/db.sqlite, postgresql://user:pass@localhost/db
ANYCRAWL_REDIS_URL	Redis connection URL	redis://redis:6379	redis://localhost:6379
ANYCRAWL_API_AUTH_ENABLED	Enable API authentication	false	true, false
ANYCRAWL_API_CREDITS_ENABLED	Enable credit system	false	true, false

💡 You can use Playground to test APIs and generate code examples for your preferred programming language.

Note: If you are self-hosting AnyCrawl, make sure to replace https://api.anycrawl.dev with your own server URL.

curl -X POST http://localhost:8080/v1/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \ -d '{ "url": "https://example.com", "engine": "cheerio" }'

Parameter Type Description Default

url	string (required)	The URL to be scraped. Must be a valid URL starting with http:// or https://	-
engine	string	Scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering with modern engine), puppeteer (JavaScript rendering with Chrome)	cheerio
proxy	string	Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port	(none)

Search Engine Results (SERP)

curl -X POST http://localhost:8080/v1/search \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \ -d '{ "query": "AnyCrawl", "limit": 10, "engine": "google", "lang": "all" }'

Parameter Type Description Default

query	string (required)	Search query to be executed	-
engine	string	Search engine to use. Options: google	google
pages	integer	Number of search result pages to retrieve	1
lang	string	Language code for search results (e.g., 'en', 'zh', 'all')	en-US

Google

Q: Can I use proxies? A: Yes, AnyCrawl supports both HTTP and SOCKS proxies. Configure them through the ANYCRAWL_PROXY_URL environment variable.
Q: How to handle JavaScript-rendered content? A: AnyCrawl supports Puppeteer and Playwright for JavaScript rendering needs.

We welcome contributions! Please see our Contributing Guide for details.

This project is licensed under the MIT License - see the LICENSE file for details.

Our mission is to build foundational products for the AI ecosystem, providing essential tools that empower both individuals and enterprises to develop AI applications. We are committed to accelerating the advancement of AI technology by delivering robust, scalable infrastructure that serves as the cornerstone for innovation in artificial intelligence.

_{Built with ❤️ by the Any4AI team}

Read Entire Article