Webcloner-JS – A powerful, stealthy website cloner/scraper

7 hours ago 2

A powerful, stealthy website cloner/scraper built with TypeScript that downloads entire websites for offline use. Supports HTTP proxy authentication, comprehensive asset downloading (CSS, JS, images, SVG sprites, fonts, etc.), and intelligent URL rewriting.

  • 🚀 Complete Website Cloning - Downloads HTML, CSS, JavaScript, images, fonts, and all other assets
  • 🔒 HTTP Proxy Support - Connect through HTTP proxies with username/password authentication
  • 🎯 SVG Sprite Support - Properly handles SVG sprites with xlink:href references
  • 🔄 Smart URL Rewriting - Converts all URLs to relative local paths for offline browsing
  • 🕷️ Stealthy Crawling - Configurable delays, random user agents, and realistic headers
  • 📦 Asset Discovery - Extracts assets from:
    • HTML tags (img, script, link, etc.)
    • CSS files (background images, fonts, etc.)
    • Inline styles
    • SVG sprites and references
    • srcset attributes
    • Data attributes (data-src, data-lazy-src)
  • 🎨 CSS Processing - Parses CSS files to download referenced assets
  • 🌐 External Link Handling - Optional following of external links
  • 📊 Progress Tracking - Real-time statistics and detailed logging
  • ⚙️ Highly Configurable - Control depth, patterns, delays, and more
# Install dependencies npm install # Build the project npm run build # Or use directly with ts-node npm run dev -- <url> [options]
# Clone a website to default directory (./cloned-site) npm run dev -- https://example.com # Specify output directory npm run dev -- https://example.com -o ./my-site # Set crawl depth npm run dev -- https://example.com -d 5
# Using proxy with authentication npm run dev -- https://example.com \ --proxy-host proxy.example.com \ --proxy-port 8080 \ --proxy-user myusername \ --proxy-pass mypassword # Using proxy without authentication npm run dev -- https://example.com \ --proxy-host proxy.example.com \ --proxy-port 8080
# Full example with all options npm run dev -- https://example.com \ -o ./output \ -d 3 \ --delay 200 \ --follow-external \ --user-agent "Mozilla/5.0 Custom Agent" \ --include ".*\\.example\\.com.*" ".*\\.cdn\\.com.*" \ --exclude ".*\\.pdf$" ".*login.*" \ --header "Authorization: Bearer token123" \ --header "X-Custom-Header: value" \ --proxy-host proxy.example.com \ --proxy-port 8080 \ --proxy-user username \ --proxy-pass password
Option Description Default
<url> Target website URL to clone Required
-o, --output <dir> Output directory ./cloned-site
-d, --depth <number> Maximum crawl depth 3
--delay <ms> Delay between requests (milliseconds) 100
--proxy-host <host> Proxy server host -
--proxy-port <port> Proxy server port -
--proxy-user <username> Proxy authentication username -
--proxy-pass <password> Proxy authentication password -
--user-agent <agent> Custom user agent string Random
--follow-external Follow external links false
--include <patterns...> Include URL patterns (regex) All
--exclude <patterns...> Exclude URL patterns (regex) None
--header <header...> Custom headers (format: "Key: Value") -
import { WebsiteCloner } from "./src/cloner.js"; const cloner = new WebsiteCloner({ targetUrl: "https://example.com", outputDir: "./cloned-site", maxDepth: 3, delay: 100, proxy: { host: "proxy.example.com", port: 8080, username: "user", password: "pass", }, userAgent: "Custom User Agent", followExternalLinks: false, includePatterns: [".*\\.example\\.com.*"], excludePatterns: [".*\\.pdf$"], headers: { Authorization: "Bearer token", }, }); await cloner.clone();
  1. Initial Request - Downloads the target URL's HTML content
  2. Asset Extraction - Parses HTML to find all assets:
    • Stylesheets (<link rel="stylesheet">)
    • Scripts (<script src>)
    • Images (<img>, srcset, background images)
    • SVG sprites (<use xlink:href>)
    • Fonts (from CSS @font-face)
    • Videos, audio, iframes, etc.
  3. Asset Download - Downloads each asset with proper referer headers
  4. CSS Processing - Parses CSS files to find and download referenced assets
  5. URL Rewriting - Converts all absolute URLs to relative local paths
  6. Link Crawling - Follows links within the same domain (respecting depth limit)
  7. File Organization - Saves files maintaining directory structure

The cloner properly handles SVG sprites referenced with xlink:href:

<svg class="icon"> <use xlink:href="./assets/sprite.svg#icon-name"></use> </svg> <svg class="icon"> <use xlink:href="../assets/sprite.svg#icon-name"></use> </svg>'><!-- Original --> <svg class="icon"> <use xlink:href="./assets/sprite.svg#icon-name"></use> </svg> <!-- After cloning (with proper relative path) --> <svg class="icon"> <use xlink:href="../assets/sprite.svg#icon-name"></use> </svg>
cloned-site/ ├── index.html # Main page ├── about.html # Other pages ├── assets/ │ ├── css/ │ │ └── style.css │ ├── js/ │ │ └── script.js │ ├── images/ │ │ ├── logo.png │ │ └── sprite.svg │ └── fonts/ │ └── font.woff2 ├── external/ # External domain assets (if enabled) │ └── cdn_example_com/ │ └── library.js └── url-mapping.json # URL to local path mapping
  • Random User Agents - Rotates between realistic browser user agents
  • Realistic Headers - Includes Accept, Accept-Language, Accept-Encoding, etc.
  • Referer Headers - Sends proper referer for each request
  • Configurable Delays - Adds delays between requests to avoid detection
  • Proxy Support - Routes traffic through HTTP proxies
  • Failed downloads are logged but don't stop the cloning process
  • Statistics show successful and failed downloads
  • Detailed error messages for debugging
  1. Adjust Delay - Lower delay for faster cloning (but less stealthy)
  2. Limit Depth - Reduce depth for large sites
  3. Use Patterns - Include/exclude patterns to focus on specific content
  4. Proxy Selection - Use fast, reliable proxies for better performance
  • JavaScript-rendered content requires the page to be pre-rendered
  • Dynamic content loaded via AJAX may not be captured
  • Some anti-scraping measures may block requests
  • Very large sites may take significant time to clone

⚠️ Important: Always respect website terms of service and robots.txt. This tool is for:

  • Backing up your own websites
  • Archiving public domain content
  • Educational purposes
  • Authorized testing

Do not use this tool to:

  • Violate copyright laws
  • Bypass paywalls or authentication
  • Overload servers with requests
  • Access restricted content without permission

ISC

Contributions are welcome! Please feel free to submit issues or pull requests.

"Failed to download" errors

  • Check if the website blocks scrapers
  • Try increasing the delay
  • Use a different user agent
  • Check proxy configuration
  • Increase crawl depth
  • Check include/exclude patterns
  • Some assets may be loaded dynamically via JavaScript
  • Verify proxy credentials
  • Check proxy host and port
  • Ensure proxy supports HTTP/HTTPS
npm run dev -- https://blog.example.com -d 2 -o ./blog-backup
npm run dev -- https://example.com \ --proxy-host 192.168.1.100 \ --proxy-port 3128 \ --proxy-user admin \ --proxy-pass secret123

Clone only specific sections

npm run dev -- https://example.com \ --include ".*example\\.com/docs.*" \ --exclude ".*\\.pdf$"

Clone with custom headers

npm run dev -- https://api.example.com \ --header "Authorization: Bearer YOUR_TOKEN" \ --header "X-API-Key: YOUR_KEY"
Read Entire Article