Network Scanner script to automate Adblock rules

3 hours ago 1

A Puppeteer-based tool for scanning websites to find third-party (or optionally first-party) network requests matching specified patterns, and generate Adblock-formatted rules.

  • Scan websites and detect matching third-party or first-party resources
  • Output Adblock-formatted blocking rules
  • Support for multiple filters per site
  • Grouped titles (! ) before site matches
  • Ignore unwanted domains (global and per-site)
  • Block unwanted domains during scan (simulate adblock)
  • Support Chrome, Firefox, Safari user agents (desktop or mobile)
  • Advanced fingerprint spoofing and referrer header simulation
  • Delay, timeout, reload options per site
  • Verbose and debug modes
  • Dump matched full URLs into matched_urls.log
  • Save output in normal Adblock format or localhost (127.0.0.1/0.0.0.0)
  • Subdomain handling (collapse to root or full subdomain)
  • Optionally match only first-party, third-party, or both
  • Enhanced redirect handling with JavaScript and meta refresh detection

Argument Description
-o, --output <file> Output file for rules. If omitted, prints to console
--compare <file> Remove rules that already exist in this file before output
--color, --colour Enable colored console output for status messages
--append Append new rules to output file instead of overwriting (requires -o)
Argument Description
--localhost[=IP] Output as IP domain.com (default: 127.0.0.1)
Examples: --localhost, --localhost=0.0.0.0, --localhost=192.168.1.1
--plain Output just domains (no adblock formatting)
--dnsmasq Output as local=/domain.com/ (dnsmasq format)
--dnsmasq-old Output as server=/domain.com/ (dnsmasq old format)
--unbound Output as local-zone: "domain.com." always_null (unbound format)
--privoxy Output as { +block } .domain.com (Privoxy format)
--pihole Output as (^|\\.)domain\\.com$ (Pi-hole regex format)
--adblock-rules Generate adblock filter rules with resource type modifiers (requires -o)
Argument Description
--verbose Force verbose mode globally
--debug Force debug mode globally
--silent Suppress normal console logs
--titles Add ! <url> title before each site's group
--dumpurls Dump matched URLs into matched_urls.log
--remove-tempfiles Remove Chrome/Puppeteer temporary files before exit
--compress-logs Compress log files with gzip (requires --dumpurls)
--sub-domains Output full subdomains instead of collapsing to root
--no-interact Disable page interactions globally
--custom-json <file> Use a custom config JSON file instead of config.json
--headful Launch browser with GUI (not headless)
--cdp Enable Chrome DevTools Protocol logging (now per-page if enabled)
--remove-dupes Remove duplicate domains from output (only with -o)
--dry-run Console output only: show matching regex, titles, whois/dig/searchstring results, and adblock rules
--eval-on-doc Globally enable evaluateOnNewDocument() for Fetch/XHR interception
--help, -h Show this help menu
--version Show script version
--max-concurrent <number> Maximum concurrent site processing (1-50, overrides config/default)
--cleanup-interval <number> Browser restart interval in URLs processed (1-1000, overrides config/default)
Argument Description
--cache-requests Cache HTTP requests to avoid re-requesting same URLs within scan
--validate-config Validate config.json file and exit
--validate-rules [file] Validate rule file format (uses --output/--compare files if no file specified)
--clean-rules [file] Clean rule files by removing invalid lines and optionally duplicates (uses --output/--compare files if no file specified)
--test-validation Run domain validation tests and exit
--clear-cache Clear persistent cache before scanning (improves fresh start performance)
--ignore-cache Bypass all smart caching functionality during scanning

Example:

{ "ignoreDomains": [ "googleapis.com", "googletagmanager.com" ], "sites": [ { "url": "https://example.com/", "userAgent": "chrome", "filterRegex": "ads|analytics", "resourceTypes": ["script", "xhr", "image"], "reload": 2, "delay": 5000, "timeout": 30000, "verbose": 1, "debug": 1, "interact": true, "fingerprint_protection": "random", "referrer_headers": { "mode": "random_search", "search_terms": ["example reviews", "best deals"] }, "custom_headers": { "X-Custom-Header": "value" }, "firstParty": 0, "thirdParty": 1, "subDomains": 0, "blocked": [ "googletagmanager.com", ".*tracking.*" ] } ] }

Field Values Default Description
url String or Array - Website URL(s) to scan
userAgent chrome, chrome_mac, chrome_linux, firefox, firefox_mac, firefox_linux, safari - User agent for page
filterRegex String or Array .* Regex or list of regexes to match requests
regex_and Boolean false Use AND logic for multiple filterRegex patterns - ALL patterns must match the same URL
comments String or Array - String of comments or references
resourceTypes Array ["script", "xhr", "image", "stylesheet"] What resource types to monitor
reload Integer 1 Number of times to reload page
delay Milliseconds 4000 Wait time after loading/reloading
timeout Milliseconds 30000 Timeout for page load
verbose 0 or 1 0 Enable verbose output per site
debug 0 or 1 0 Dump matching URLs for the site
interact true or false false Simulate user interaction (hover, click)
firstParty 0 or 1 0 Match first-party requests
thirdParty 0 or 1 1 Match third-party requests
subDomains 0 or 1 0 1 = preserve subdomains in output
blocked Array - Domains or regexes to block during scanning
even_blocked Boolean false Add matching rules even if requests are blocked
bypass_cache Boolean false Skip all caching for this site's URLs
window_cleanup Boolean or String false Close old/unused browser windows/tabs after entire URL group completes

Window cleanup modes: false (disabled), true (conservative - closes obvious leftovers), "all" (aggressive - closes all content pages). Both active modes preserve the main Puppeteer window and wait 16 seconds before cleanup to avoid interfering with active operations.

Redirect Handling Options

Field Values Default Description
follow_redirects Boolean true Follow redirects to new domains
max_redirects Integer 10 Maximum number of redirects to follow
js_redirect_timeout Milliseconds 5000 Time to wait for JavaScript redirects
detect_js_patterns Boolean true Analyze page source for redirect patterns
redirect_timeout_multiplier Number 1.5 Increase timeout for redirected URLs

When a page redirects to a new domain, first-party/third-party detection is based on the final redirected domain, and all intermediate redirect domains (like bit.ly, t.co) are automatically excluded from the generated rules.

Advanced Stealth & Fingerprinting

Field Values Default Description
fingerprint_protection true, false, "random" false Enable navigator/device spoofing
referrer_headers String, Array, or Object - Set referrer header for realistic traffic sources
custom_headers Object - Add custom HTTP headers to requests

Simple formats:

"referrer_headers": "https://google.com/search?q=example" "referrer_headers": ["url1", "url2"]

Smart modes:

"referrer_headers": {"mode": "random_search", "search_terms": ["reviews"]} "referrer_headers": {"mode": "social_media"} "referrer_headers": {"mode": "direct_navigation"} "referrer_headers": {"mode": "custom", "custom": ["https://news.ycombinator.com/"]}
Field Values Default Description
cloudflare_phish Boolean false Auto-click through Cloudflare phishing warnings
cloudflare_bypass Boolean false Auto-solve Cloudflare "Verify you are human" challenges
cloudflare_parallel_detection Boolean true Use parallel detection for faster Cloudflare checks
cloudflare_max_retries Integer 3 Maximum retry attempts for Cloudflare operations
cloudflare_cache_ttl Milliseconds 300000 TTL for Cloudflare detection cache (5 minutes)
cloudflare_retry_on_error Boolean true Enable retry logic for Cloudflare operations
flowproxy_detection Boolean false Enable flowProxy protection detection and handling
flowproxy_page_timeout Milliseconds 45000 Page timeout for flowProxy sites
flowproxy_nav_timeout Milliseconds 45000 Navigation timeout for flowProxy sites
flowproxy_js_timeout Milliseconds 15000 JavaScript challenge timeout
flowproxy_delay Milliseconds 30000 Delay for rate limiting
flowproxy_additional_delay Milliseconds 5000 Additional processing delay

WHOIS/DNS Analysis Options

Field Values Default Description
whois Array - Check whois data for ALL specified terms (AND logic)
whois-or Array - Check whois data for ANY specified term (OR logic)
whois_delay Integer 3000 Delay whois requests to avoid throttling
whois_server String or Array - Custom whois server(s) - single server or randomized list
whois_server_mode String "random" Server selection mode: "random" or "cycle"
whois_max_retries Integer 2 Maximum retry attempts per domain
whois_timeout_multiplier Number 1.5 Timeout increase multiplier per retry
whois_use_fallback Boolean true Add TLD-specific fallback servers
whois_retry_on_timeout Boolean true Retry on timeout errors
whois_retry_on_error Boolean true Retry on connection/other errors
dig Array - Check dig output for ALL specified terms (AND logic)
dig-or Array - Check dig output for ANY specified term (OR logic)
dig_subdomain Boolean false Use subdomain for dig lookup instead of root domain
digRecordType String "A" DNS record type for dig (A, CNAME, MX, etc.)
Field Values Default Description
searchstring String or Array - Text to search in response content (OR logic)
searchstring_and String or Array - Text to search with AND logic - ALL terms must be present
curl Boolean false Use curl to download content for analysis
grep Boolean false Use grep instead of JavaScript for pattern matching (requires curl=true)
Field Values Default Description
goto_options Object {"waitUntil": "load"} Custom page.goto() options
clear_sitedata Boolean false Clear all cookies, cache, storage before each load
forcereload Boolean false Force an additional reload after reloads
isBrave Boolean false Spoof Brave browser detection
evaluateOnNewDocument Boolean false Inject fetch/XHR interceptor in page
cdp Boolean false Enable CDP logging for this site
cdp_specific Array - Enable CDP logging only for specific domains in the URL list
css_blocked Array - CSS selectors to hide elements
source Boolean false Save page source HTML after load
screenshot Boolean false Capture screenshot on load failure
headful Boolean false Launch browser with GUI for this site
localhost String - Force custom IP format for this site (e.g., "127.0.0.1", "0.0.0.0", "192.168.1.1")
adblock_rules Boolean false Generate adblock filter rules with resource types for this site
interact_duration Milliseconds 2000 Duration of interaction simulation
interact_scrolling Boolean true Enable scrolling simulation
interact_clicks Boolean false Enable element clicking simulation
interact_typing Boolean false Enable typing simulation
interact_intensity String "medium" Interaction simulation intensity: "low", "medium", "high"

Global Configuration Options

These options go at the root level of your config.json:

Field Values Default Description
ignoreDomains Array - Domains to completely ignore (supports wildcards like *.ads.com)
blocked Array - Global regex patterns to block requests (combined with per-site blocked)
whois_server_mode String "random" Default server selection mode for all sites
ignore_similar Boolean true Ignore domains similar to already found domains
ignore_similar_threshold Integer 80 Similarity threshold percentage for ignore_similar
ignore_similar_ignored_domains Boolean true Ignore domains similar to ignoreDomains list
max_concurrent_sites Integer 6 Maximum concurrent site processing (1-50)
resource_cleanup_interval Integer 80 Browser restart interval in URLs processed (1-1000)
cache_path String ".cache" Directory path for persistent cache storage
cache_max_size Integer 5000 Maximum number of entries in cache
cache_autosave_minutes Integer 1 Interval for automatic cache saves (minutes)
cache_requests Boolean false Enable HTTP request response caching

Special Characters in searchstring

The searchstring parameter supports all characters including special symbols. Only double quotes need JSON escaping:

{ "searchstring": [ ")}return n}function N(n,e,r){try{\"function\"==typeof", "addEventListener(\"click\",function(){", "{\"status\":\"success\",\"data\":[", "console.log('Debug: ' + value);", "`API endpoint: ${baseUrl}/users`", "@media screen and (max-width: 768px)", "if(e&&e.preventDefault){e.preventDefault()}", "__webpack_require__(/*! ./module */ \"./src/module.js\")", "console.log('Hello world')", "#header { background-color: #ff0000; }", "$(document).ready(function() {", "completion: 85% @ $1,500 budget", "SELECT * FROM users WHERE id = *", "regex: ^[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,}$", "typeof window !== 'undefined'" ] }

Character escaping rules:

  • " becomes \" (required in JSON)
  • \ becomes \\ (if searching for literal backslashes)
  • All other characters are used literally: ' ` @ # $ % * ^ [ ] { } ( ) ; = ! ? :
# Scan with default config and output to console node nwss.js # Scan and save rules to file node nwss.js -o blocklist.txt # Append new rules to existing file node nwss.js --append -o blocklist.txt # Clean existing rules and append new ones node nwss.js --clean-rules --append -o blocklist.txt
# Debug mode with URL dumping and colored output node nwss.js --debug --dumpurls --color -o rules.txt # Dry run to see what would be matched node nwss.js --dry-run --debug # Validate configuration before running node nwss.js --validate-config # Clean rule files node nwss.js --clean-rules existing_rules.txt # Maximum stealth scanning node nwss.js --debug --color -o stealth_rules.txt
# High-performance scanning with custom concurrency node nwss.js --max-concurrent 12 --cleanup-interval 300 -o rules.txt

Stealth Configuration Examples

Memory Management with Window Cleanup

{ "url": [ "https://popup-heavy-site1.com", "https://popup-heavy-site2.com", "https://popup-heavy-site3.com" ], "filterRegex": "\\.(space|website|tech)\\b", "window_cleanup": "all", "interact": true, "reload": 2, "resourceTypes": ["script", "fetch"], "comments": "Aggressive cleanup for sites that open many popups" }

Conservative Memory Management

{ "url": "https://complex-site.com", "filterRegex": "analytics|tracking", "window_cleanup": true, "interact": true, "delay": 8000, "reload": 3, "comments": [ "Conservative cleanup preserves potentially active content", "Good for sites with complex iframe structures" ] }
{ "url": "https://shopping-site.com", "userAgent": "chrome", "fingerprint_protection": "random", "referrer_headers": { "mode": "random_search", "search_terms": ["product reviews", "best deals", "price comparison"] }, "interact": true, "delay": 6000, "filterRegex": "analytics|tracking|ads" }
{ "url": "https://news-site.com", "userAgent": "firefox", "fingerprint_protection": true, "referrer_headers": {"mode": "social_media"}, "custom_headers": { "Accept-Language": "en-US,en;q=0.9" }, "filterRegex": "doubleclick|googletagmanager" }

Tech Blog with Custom Referrers

{ "url": "https://tech-blog.com", "fingerprint_protection": "random", "referrer_headers": { "mode": "custom", "custom": [ "https://news.ycombinator.com/", "https://www.reddit.com/r/programming/", "https://lobste.rs/" ] } }

The scanner includes intelligent window management to prevent memory accumulation during long scans:

  • Conservative cleanup (window_cleanup: true): Selectively closes pages that appear to be leftovers from previous scans
  • Aggressive cleanup (window_cleanup: "all"): Closes all content pages from previous operations for maximum memory recovery
  • Main window preservation: Both modes always preserve the main Puppeteer browser window to maintain stability
  • Popup window handling: Automatically detects and closes popup windows created by previous site scans
  • Timing protection: 16-second delay ensures no active operations are interrupted during cleanup
  • Active page protection: Never affects pages currently being processed by concurrent scanning operations
  • Memory reporting: Reports estimated memory freed from closed windows for performance monitoring

Use aggressive cleanup for sites that open many popups or when processing large numbers of URLs. Use conservative cleanup when you want to preserve potentially active content but still free obvious leftovers.


(Ubuntu as example). NOTE: Use Chrome and not Chromium for best compatibility.

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo gpg --dearmor -o /usr/share/keyrings/googlechrome-linux-keyring.gpg

Add Google Chrome repository

echo "deb [arch=amd64 signed-by=/usr/share/keyrings/googlechrome-linux-keyring.gpg] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update sudo apt install google-chrome-stable

dig & whois (needed for network checks)

sudo apt install bind9-dnsutils whois
  • If both firstParty: 0 and thirdParty: 0 are set for a site, it will be skipped.
  • ignoreDomains applies globally across all sites.
  • ignoreDomains supports wildcards (e.g., *.ads.com matches tracker.ads.com)
  • Blocking (blocked) can match full domains or regex.
  • If a site's blocked field is missing, no extra blocking is applied.
  • --clean-rules with --append will clean existing files first, then append new rules
  • --remove-dupes works with all output modes and removes duplicates from final output
  • Validation tools help ensure rule files are properly formatted before use
  • --remove-tempfiles removes Chrome/Puppeteer temporary files before exiting, avoids disk space issues
  • For maximum stealth, combine fingerprint_protection: "random" with appropriate referrer_headers modes
  • User agents are automatically updated to latest versions (Chrome 131, Firefox 133, Safari 18.2)
  • Referrer headers work independently from fingerprint protection - use both for best results

Read Entire Article