A powerful, stealthy website cloner/scraper built with TypeScript that downloads entire websites for offline use. Supports HTTP proxy authentication, comprehensive asset downloading (CSS, JS, images, SVG sprites, fonts, etc.), and intelligent URL rewriting.
- 🚀 Complete Website Cloning - Downloads HTML, CSS, JavaScript, images, fonts, and all other assets
- 🔒 HTTP Proxy Support - Connect through HTTP proxies with username/password authentication
- 🎯 SVG Sprite Support - Properly handles SVG sprites with xlink:href references
- 🔄 Smart URL Rewriting - Converts all URLs to relative local paths for offline browsing
- 🕷️ Stealthy Crawling - Configurable delays, random user agents, and realistic headers
- 📦 Asset Discovery - Extracts assets from:
- HTML tags (img, script, link, etc.)
- CSS files (background images, fonts, etc.)
- Inline styles
- SVG sprites and references
- srcset attributes
- Data attributes (data-src, data-lazy-src)
- 🎨 CSS Processing - Parses CSS files to download referenced assets
- 🌐 External Link Handling - Optional following of external links
- 📊 Progress Tracking - Real-time statistics and detailed logging
- ⚙️ Highly Configurable - Control depth, patterns, delays, and more
# Install dependencies
npm install
# Build the project
npm run build
# Or use directly with ts-node
npm run dev -- <url> [options]
# Clone a website to default directory (./cloned-site)
npm run dev -- https://example.com
# Specify output directory
npm run dev -- https://example.com -o ./my-site
# Set crawl depth
npm run dev -- https://example.com -d 5
# Using proxy with authentication
npm run dev -- https://example.com \
--proxy-host proxy.example.com \
--proxy-port 8080 \
--proxy-user myusername \
--proxy-pass mypassword
# Using proxy without authentication
npm run dev -- https://example.com \
--proxy-host proxy.example.com \
--proxy-port 8080
# Full example with all options
npm run dev -- https://example.com \
-o ./output \
-d 3 \
--delay 200 \
--follow-external \
--user-agent "Mozilla/5.0 Custom Agent" \
--include ".*\\.example\\.com.*" ".*\\.cdn\\.com.*" \
--exclude ".*\\.pdf$" ".*login.*" \
--header "Authorization: Bearer token123" \
--header "X-Custom-Header: value" \
--proxy-host proxy.example.com \
--proxy-port 8080 \
--proxy-user username \
--proxy-pass password
| <url> | Target website URL to clone | Required |
| -o, --output <dir> | Output directory | ./cloned-site |
| -d, --depth <number> | Maximum crawl depth | 3 |
| --delay <ms> | Delay between requests (milliseconds) | 100 |
| --proxy-host <host> | Proxy server host | - |
| --proxy-port <port> | Proxy server port | - |
| --proxy-user <username> | Proxy authentication username | - |
| --proxy-pass <password> | Proxy authentication password | - |
| --user-agent <agent> | Custom user agent string | Random |
| --follow-external | Follow external links | false |
| --include <patterns...> | Include URL patterns (regex) | All |
| --exclude <patterns...> | Exclude URL patterns (regex) | None |
| --header <header...> | Custom headers (format: "Key: Value") | - |
import { WebsiteCloner } from "./src/cloner.js";
const cloner = new WebsiteCloner({
targetUrl: "https://example.com",
outputDir: "./cloned-site",
maxDepth: 3,
delay: 100,
proxy: {
host: "proxy.example.com",
port: 8080,
username: "user",
password: "pass",
},
userAgent: "Custom User Agent",
followExternalLinks: false,
includePatterns: [".*\\.example\\.com.*"],
excludePatterns: [".*\\.pdf$"],
headers: {
Authorization: "Bearer token",
},
});
await cloner.clone();
- Initial Request - Downloads the target URL's HTML content
- Asset Extraction - Parses HTML to find all assets:
- Stylesheets (<link rel="stylesheet">)
- Scripts (<script src>)
- Images (<img>, srcset, background images)
- SVG sprites (<use xlink:href>)
- Fonts (from CSS @font-face)
- Videos, audio, iframes, etc.
- Asset Download - Downloads each asset with proper referer headers
- CSS Processing - Parses CSS files to find and download referenced assets
- URL Rewriting - Converts all absolute URLs to relative local paths
- Link Crawling - Follows links within the same domain (respecting depth limit)
- File Organization - Saves files maintaining directory structure
The cloner properly handles SVG sprites referenced with xlink:href:
<svg class="icon">
<use xlink:href="./assets/sprite.svg#icon-name"></use>
</svg>
<svg class="icon">
<use xlink:href="../assets/sprite.svg#icon-name"></use>
</svg>'><!-- Original -->
<svg class="icon">
<use xlink:href="./assets/sprite.svg#icon-name"></use>
</svg>
<!-- After cloning (with proper relative path) -->
<svg class="icon">
<use xlink:href="../assets/sprite.svg#icon-name"></use>
</svg>
cloned-site/
├── index.html # Main page
├── about.html # Other pages
├── assets/
│ ├── css/
│ │ └── style.css
│ ├── js/
│ │ └── script.js
│ ├── images/
│ │ ├── logo.png
│ │ └── sprite.svg
│ └── fonts/
│ └── font.woff2
├── external/ # External domain assets (if enabled)
│ └── cdn_example_com/
│ └── library.js
└── url-mapping.json # URL to local path mapping
- Random User Agents - Rotates between realistic browser user agents
- Realistic Headers - Includes Accept, Accept-Language, Accept-Encoding, etc.
- Referer Headers - Sends proper referer for each request
- Configurable Delays - Adds delays between requests to avoid detection
- Proxy Support - Routes traffic through HTTP proxies
- Failed downloads are logged but don't stop the cloning process
- Statistics show successful and failed downloads
- Detailed error messages for debugging
- Adjust Delay - Lower delay for faster cloning (but less stealthy)
- Limit Depth - Reduce depth for large sites
- Use Patterns - Include/exclude patterns to focus on specific content
- Proxy Selection - Use fast, reliable proxies for better performance
- JavaScript-rendered content requires the page to be pre-rendered
- Dynamic content loaded via AJAX may not be captured
- Some anti-scraping measures may block requests
- Very large sites may take significant time to clone
⚠️ Important: Always respect website terms of service and robots.txt. This tool is for:
- Backing up your own websites
- Archiving public domain content
- Educational purposes
- Authorized testing
Do not use this tool to:
- Violate copyright laws
- Bypass paywalls or authentication
- Overload servers with requests
- Access restricted content without permission
ISC
Contributions are welcome! Please feel free to submit issues or pull requests.
- Check if the website blocks scrapers
- Try increasing the delay
- Use a different user agent
- Check proxy configuration
- Increase crawl depth
- Check include/exclude patterns
- Some assets may be loaded dynamically via JavaScript
- Verify proxy credentials
- Check proxy host and port
- Ensure proxy supports HTTP/HTTPS
npm run dev -- https://blog.example.com -d 2 -o ./blog-backup
npm run dev -- https://example.com \
--proxy-host 192.168.1.100 \
--proxy-port 3128 \
--proxy-user admin \
--proxy-pass secret123
npm run dev -- https://example.com \
--include ".*example\\.com/docs.*" \
--exclude ".*\\.pdf$"
npm run dev -- https://api.example.com \
--header "Authorization: Bearer YOUR_TOKEN" \
--header "X-API-Key: YOUR_KEY"
.png)

