Webcloner-JS – A powerful, stealthy website cloner/scraper

7 hours ago 2

A powerful, stealthy website cloner/scraper built with TypeScript that downloads entire websites for offline use. Supports HTTP proxy authentication, comprehensive asset downloading (CSS, JS, images, SVG sprites, fonts, etc.), and intelligent URL rewriting.

🚀 Complete Website Cloning - Downloads HTML, CSS, JavaScript, images, fonts, and all other assets
🔒 HTTP Proxy Support - Connect through HTTP proxies with username/password authentication
🎯 SVG Sprite Support - Properly handles SVG sprites with xlink:href references
🔄 Smart URL Rewriting - Converts all URLs to relative local paths for offline browsing
🕷️ Stealthy Crawling - Configurable delays, random user agents, and realistic headers
📦 Asset Discovery - Extracts assets from:
- HTML tags (img, script, link, etc.)
- CSS files (background images, fonts, etc.)
- Inline styles
- SVG sprites and references
- srcset attributes
- Data attributes (data-src, data-lazy-src)
🎨 CSS Processing - Parses CSS files to download referenced assets
🌐 External Link Handling - Optional following of external links
📊 Progress Tracking - Real-time statistics and detailed logging
⚙️ Highly Configurable - Control depth, patterns, delays, and more

# Install dependencies npm install # Build the project npm run build # Or use directly with ts-node npm run dev -- <url> [options]

# Clone a website to default directory (./cloned-site) npm run dev -- https://example.com # Specify output directory npm run dev -- https://example.com -o ./my-site # Set crawl depth npm run dev -- https://example.com -d 5

# Using proxy with authentication npm run dev -- https://example.com \ --proxy-host proxy.example.com \ --proxy-port 8080 \ --proxy-user myusername \ --proxy-pass mypassword # Using proxy without authentication npm run dev -- https://example.com \ --proxy-host proxy.example.com \ --proxy-port 8080

# Full example with all options npm run dev -- https://example.com \ -o ./output \ -d 3 \ --delay 200 \ --follow-external \ --user-agent "Mozilla/5.0 Custom Agent" \ --include ".*\\.example\\.com.*" ".*\\.cdn\\.com.*" \ --exclude ".*\\.pdf$" ".*login.*" \ --header "Authorization: Bearer token123" \ --header "X-Custom-Header: value" \ --proxy-host proxy.example.com \ --proxy-port 8080 \ --proxy-user username \ --proxy-pass password

Option Description Default

<url>	Target website URL to clone	Required
-o, --output <dir>	Output directory	./cloned-site
-d, --depth <number>	Maximum crawl depth	3
--delay <ms>	Delay between requests (milliseconds)	100
--proxy-host <host>	Proxy server host	-
--proxy-port <port>	Proxy server port	-
--proxy-user <username>	Proxy authentication username	-
--proxy-pass <password>	Proxy authentication password	-
--user-agent <agent>	Custom user agent string	Random
--follow-external	Follow external links	false
--include <patterns...>	Include URL patterns (regex)	All
--exclude <patterns...>	Exclude URL patterns (regex)	None
--header <header...>	Custom headers (format: "Key: Value")	-

import { WebsiteCloner } from "./src/cloner.js"; const cloner = new WebsiteCloner({ targetUrl: "https://example.com", outputDir: "./cloned-site", maxDepth: 3, delay: 100, proxy: { host: "proxy.example.com", port: 8080, username: "user", password: "pass", }, userAgent: "Custom User Agent", followExternalLinks: false, includePatterns: [".*\\.example\\.com.*"], excludePatterns: [".*\\.pdf$"], headers: { Authorization: "Bearer token", }, }); await cloner.clone();

Initial Request - Downloads the target URL's HTML content
Asset Extraction - Parses HTML to find all assets:
- Stylesheets (<link rel="stylesheet">)
- Scripts (<script src>)
- Images (<img>, srcset, background images)
- SVG sprites (<use xlink:href>)
- Fonts (from CSS @font-face)
- Videos, audio, iframes, etc.
Asset Download - Downloads each asset with proper referer headers
CSS Processing - Parses CSS files to find and download referenced assets
URL Rewriting - Converts all absolute URLs to relative local paths
Link Crawling - Follows links within the same domain (respecting depth limit)
File Organization - Saves files maintaining directory structure

The cloner properly handles SVG sprites referenced with xlink:href:

cloned-site/ ├── index.html # Main page ├── about.html # Other pages ├── assets/ │ ├── css/ │ │ └── style.css │ ├── js/ │ │ └── script.js │ ├── images/ │ │ ├── logo.png │ │ └── sprite.svg │ └── fonts/ │ └── font.woff2 ├── external/ # External domain assets (if enabled) │ └── cdn_example_com/ │ └── library.js └── url-mapping.json # URL to local path mapping

Random User Agents - Rotates between realistic browser user agents
Realistic Headers - Includes Accept, Accept-Language, Accept-Encoding, etc.
Referer Headers - Sends proper referer for each request
Configurable Delays - Adds delays between requests to avoid detection
Proxy Support - Routes traffic through HTTP proxies

Failed downloads are logged but don't stop the cloning process
Statistics show successful and failed downloads
Detailed error messages for debugging

Adjust Delay - Lower delay for faster cloning (but less stealthy)
Limit Depth - Reduce depth for large sites
Use Patterns - Include/exclude patterns to focus on specific content
Proxy Selection - Use fast, reliable proxies for better performance

JavaScript-rendered content requires the page to be pre-rendered
Dynamic content loaded via AJAX may not be captured
Some anti-scraping measures may block requests
Very large sites may take significant time to clone

⚠️ Important: Always respect website terms of service and robots.txt. This tool is for:

Backing up your own websites
Archiving public domain content
Educational purposes
Authorized testing

Do not use this tool to:

Violate copyright laws
Bypass paywalls or authentication
Overload servers with requests
Access restricted content without permission

ISC

Contributions are welcome! Please feel free to submit issues or pull requests.

"Failed to download" errors

Check if the website blocks scrapers
Try increasing the delay
Use a different user agent
Check proxy configuration

Increase crawl depth
Check include/exclude patterns
Some assets may be loaded dynamically via JavaScript

Verify proxy credentials
Check proxy host and port
Ensure proxy supports HTTP/HTTPS

npm run dev -- https://blog.example.com -d 2 -o ./blog-backup

npm run dev -- https://example.com \ --proxy-host 192.168.1.100 \ --proxy-port 3128 \ --proxy-user admin \ --proxy-pass secret123

Clone only specific sections

npm run dev -- https://example.com \ --include ".*example\\.com/docs.*" \ --exclude ".*\\.pdf$"

Clone with custom headers

npm run dev -- https://api.example.com \ --header "Authorization: Bearer YOUR_TOKEN" \ --header "X-API-Key: YOUR_KEY"

Read Entire Article

Webcloner-JS – A powerful, stealthy website cloner/scraper

"Failed to download" errors

Clone only specific sections

Clone with custom headers

Related

Modern Python CI with Coverage in 2025

The Collapse of Centralized AI Discovery

First recording of a dying human brain shows waves similar t...