A dataset of different files that robots tried to crawl through webfiddle.net
Mostly html files but other files too pdfs, images, binary- i have no idea what is in here at this stage - but gives an interesting idea of what crawlers like to visit and could be the basis of interesting SEO or coding LLM reasearch.
Collected as part of my work on web simulators. https://webfiddle.net JS/CSS editor for the web, https://websim.netwrck.com Coding Editor for the web. https://x.com/leeleepenkman
Its uploading soon/right now. split into different sqlite .db files. Schema:
CREATE TABLE IF NOT EXISTS mirrored_content ( key_name TEXT PRIMARY KEY, original_address TEXT, translated_address TEXT, status INTEGER, headers TEXT, data BLOB, base_url TEXT, expiry INTEGER )Data:
============================================================ 📊 SIZE DISTRIBUTION HISTOGRAM ============================================================ Size Range Count Percentage Bar ------------------------------------------------------------ 0 B 70,360 1.1 % < 1 KB 346,265 5.2 % ███ 1-10 KB 799,840 12.1 % ████████ 10-100 KB 2,246,994 34.0 % ███████████████████████ 100 KB - 1 MB 2,903,394 43.9 % ██████████████████████████████ 1-10 MB 230,689 3.5 % ██ 10-100 MB 8,391 0.1 % 100 MB - 1 GB 554 0.0 % Total Records: 6,606,487 Total Storage: 1.6 TB Average Size: 265.4 KB ============================================================ 📋 TOP CONTENT TYPES BY STORAGE ============================================================ Content Type Count Total Size Avg Size --------------------------------------------------------------------------- text/html 4,714,460 1.0 TB 237.2 KB application/pdf 240,623 436.9 GB 1.9 MB image/jpeg 1,065,334 35.6 GB 35.0 KB application/x-msi 221 32.1 GB 148.8 MB image/png 170,819 22.4 GB 137.3 KB video/mp4 434 17.2 GB 40.6 MB application/octet-stream 6,285 15.8 GB 2.6 MB unknown 5,994 8.9 GB 1.5 MB video/webm 205 7.1 GB 35.3 MB application/vnd.openxmlformats-o... 762 5.5 GB 7.3 MB audio/mpeg 794 3.8 GB 4.9 MB application/javascript 36,974 3.4 GB 95.4 KB image/gif 5,129 3.1 GB 635.0 KB application/x-gzip 16 1.6 GB 102.3 MB text/javascript 17,294 1.5 GB 91.4 KB binary/octet-stream 197 1.4 GB 7.5 MB video/quicktime 15 1.0 GB 69.3 MB application/zip 104 1.0 GB 10.0 MB video/x-msvideo 18 866.6 MB 48.1 MB text/plain 235,904 775.3 MB 3.4 KB ============================================================ 💾 STORAGE EFFICIENCY ANALYSIS ============================================================ Small files (< 10 KB): 1,146,105 (17.3%) Large files (> 100 MB): 554 (0.0%) HTML content storage: 1.0 TB Image content storage: 61.8 GB HTML vs Images ratio: 17.24:1Different mimetypes breakdown:
text/html; charset=utf-8|2875071 image/jpeg|974715 text/html; charset=UTF-8|714039 text/html|556342 text/html;charset=utf-8|429081 text/plain; charset=utf-8|226375 application/pdf|222635 image/png|167336 text/html;charset=UTF-8|34451 application/javascript|28263 application/json|17888 text/html; charset=iso-8859-1|17001 text/fragment+html; charset=utf-8|15953 application/json; charset=utf-8|14553 text/javascript|11680 application/xml|7783 image/svg+xml|6778 application/octet-stream|6118 application/x-javascript|5845 text/css|5601 |5552 image/svg+xml;charset=utf-8|4919 application/javascript; charset=utf-8|4862 application/rss+xml; charset=UTF-8|4759 text/javascript; charset=utf-8|4114 image/gif|4018 text/html; charset="UTF-8"|2856 application/javascript; charset=UTF-8|2835 image/webp|2764 application/PDF|2068 */*|1849 application/pdf;charset=UTF-8|1617 text/css; charset=utf-8|1381 text/plain|1272 image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"|1251 image/gif;charset=utf-8|985 application/vnd.openxmlformats-officedocument.wordprocessingml.document|956 text/html; charset="utf-8"|892 application/json; charset=UTF-8|786 audio/mpeg|776 image/jpg|731 image/pjpeg|711 application/vnd.openxmlformats-officedocument.presentationml.presentation|708 text/plain; charset=UTF-8|671 text/javascript; charset=UTF-8|664 font/woff2|627 application/pgp-encrypted|596 text/css; charset=UTF-8|583 image/svg+xml; charset=utf-8|536 application/atom+xml; charset=UTF-8|485 text/javascript;charset=UTF-8|430 video/mp4|429 application/msword|402 application/vnd.ms-excel|379 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet|371 application/xml; charset=UTF-8|357 text/plain;charset=UTF-8|317 application/pdf;charset=ISO-8859-1|317 text/html; charset=ISO-8859-1|303 image/jpeg;charset=UTF-8|276 image/vnd.microsoft.icon|275 application/rss+xml; charset=utf-8|265 application/xml; charset=utf-8|234 application/x-msi|221 application/javascript;charset=utf-8|218 video/webm|205 application/x-javascript; charset=utf-8|200 binary/octet-stream|187 text/html;charset=ISO-8859-1|177 application/json;charset=utf-8|167 application/pdf; charset=utf-8|165 image/png; qs=0.7|150 text/javascript;charset=utf-8|130 application/vnd.ms-office|116 text/xml|115 text/html; charset=us-ascii|115 text/css;charset=utf-8|115 application/xhtml+xml|110 text/xml; charset=UTF-8|109 application/zip|102 text/plain;charset=utf-8|100 text/html; charset=utf8|92 application/font-woff2|87 application/epub+zip|87 image/pdf;charset=UTF-8|86 image/png;charset=UTF-8|83 application/pdf; charset=UTF-8|80 text/css;charset=UTF-8|73 image/x-icon|72 text/html;charset=iso-8859-1|70 font/woff|65 image/x-ms-bmp|61 application/json;encoding=UTF8;charset=UTF-8|61 application/xml;charset=UTF-8|58 application/pdf;charset=utf-8|56 text/xml;charset=UTF-8|50 text/xml; charset=utf-8|43 application/json;charset=UTF-8|41 application/ogg|40 application/vnd.ms-excel.sheet.macroenabled.12|39 application/font-woff|39 text/javascript; charset="utf-8"|37 application/pdf; qs=0.001|36 application/opensearchdescription+xml|35 application/javascript;charset=UTF-8|33 application/x-javascript;charset=utf-8|32 application/octet-stream, application/octet-stream|31 image/svg+xml; qs=0.85|30 text/csv;charset=iso-8859-1|28 image/tiff|28 font/ttf|28 application/vnd.ms-powerpoint|24 application/x-mpegURL|21 image/avif|20 audio/x-wav|20 application/rss+xml;charset=UTF-8|20 application/vnd.php.serialized; charset=UTF-8|19 video/x-msvideo|18 image/png; charset=utf-8|18 image/jpeg; qs=0.8|18 image/bmp|18 application/x-javascript; charset=UTF-8|18 font/opentype|17 application/x-font-ttf|17 video/youtube|16 image/png; charset=binary|16 application/x-gzip|16 application/rdf+xml; charset=UTF-8|16 application/atom+xml; charset=utf-8|16 video/quicktime|15 image/gif; qs=0.5|15 text/html; charset=US-ASCII|13 text/css; charset="UTF-8"|13 text/n3; charset=UTF-8|12 text/html;|12 model/usd|12 application/x-www-form-urlencoded|12 application/x-shockwave-flash|12 application/octet-stream; charset=utf-8|12 application/javascript; charset=utf8|12 application/font-sfnt|12 application/atom+xml|12 Application/pdf|12 |12 text/turtle; charset=UTF-8|11 text/csv|11 image/svg+xml;charset=UTF-8|11 text/x-python|10 text/markdown; charset=utf-8|10 text/html; charset=windows-1251|10 text/html; charset=utf-8;|10 text/html; Charset=utf-8|10 text/html; Charset=UTF-8|10 application/xml;|10 application/x-javascript;charset=UTF-8|10 application/x-font-woff|10 application/rss+xml;charset=utf-8|10 application/octetstream|10 application/download|10 text/javascript; charset=ISO-8859-1|9 text/html; charset=ISO-8859-2|9 application/rss+xml|9 application/pdf;charset=binary|9 application/ld+json; charset=UTF-8|9 application/json; charset=utf-8;|9 application/vnd.apple.mpegurl|8 application/pdf; charset=|8 application/n-triples; charset=UTF-8|8 application/javascript; charset=ANSI_X3.4-1968|8 application/font-ttf|8 application/font|8 */*; charset=utf-8|8 text/turtle; charset=utf-8|7 text/html; charset=windows-1252|7 text/html; ISO-8859-1|7 text/html,text/html; charset=utf-8|7 model/gltf-binary|7 image/svg+xml; charset=utf-8; api-version=7.2-preview.1|7 image/svg+xml; charset=UTF-8|7 application/x-perl|7 application/vnd.ms-fontobject|7 application/vnd.apple.mpegURL|7 application/rdf+xml|7 application/pgp-signature|7 .pdf|7 text/css; charset="utf-8"|6 image/png;charset=ISO-8859-1|6 image/jpeg; charset=binary|6 audio/x-mpegurl|6 application/x-chrome-extension|6 application/pgp-keys|6 application/octet-stream, text/html|6 text/prs.fallenstein.rst|5 text/html; charset=EUC_JP|5 text|5 image/svg+xml; charset="UTF-8"|5 application/rtf|5 application/postscript|5 application/pdf; charset=binary|5 text/vtt|4 text/javascript;charset=ISO-8859-1|4 text/html; charset=UTF-8;|4 text/html, text/html|4 application/x-msdos-program|4 application/wasm|4 application/problem+json|4 application/json,text/javascript; charset=utf-8|4 text/xml;charset=utf-8|3 text/x-js|3 text/markdown|3 text/js;charset=UTF-8|3 text/html;charset=UTF8|3 text/html; charset=utf-8,charset=utf-8|3 text/html; charset=UTF-8; charset=utf-8|3 text/HTML|3 image/jpeg; charset=UTF-8|3 content-type: application/pdf|3 audio/webm|3 application/x-yaml|3 application/x-redhat-package-manager|3 application/x-ipynb+json|3 application/x-apple-diskimage|3 application/turtle|3 application/rdf+xml;charset=UTF-8|3 application/octet-stream;charset=UTF-8|3 application/manifest+json|3 application/json;charset=iso-8859-1|3 application/json; charset=ISO-8859-1|3 application/json+oembed; charset=utf-8|3 text/yaml|2 text/xml; charset="utf-8"|2 text/x-python; charset=utf-8|2 text/x-matlab|2 text/vnd.wap.wml|2 text/plain;charset=ISO-8859-1|2 text/html;;charset=utf-8|2 text/html; ver=1.0; charset=utf-8;|2 text/html, text/html;charset=utf-8|2 text/csv;charset=utf-8|2 image/x-icon;charset=UTF-8|2 image/pdf;charset=ISO-8859-1|2 image/Png|2 application/xml;charset=utf-8|2 application/x-sql|2 application/x-rss+xml|2 application/x-msdownload|2 application/x-javascript;charset=ISO-8859-1|2 application/x-font-woff2|2 application/x-eprint|2 application/x-dosexec|2 application/vnd.openxmlformats-officedocument.presentationml.slideshow|2 application/vnd.ms-excel.12|2 application/vnd.maxmind.com-error+json; charset=UTF-8; version=2.1|2 application/vnd.apple.installer+xml|2 application/vnd.api+json; charset=utf-8|2 application/vnd.api+json|2 application/rss+xml; charset="UTF-8"|2 application/rfc+xml; charset=utf-8|2 application/rdf+xml; qs=0.9|2 application/pkix-crl|2 application/pdf; charset=ISO-8859-1|2 application/opensearchdescription+xml; charset=utf-8|2 application/n-triples; charset=utf-8|2 application/javascript; charset="utf-8"|2 application/atom+xml;charset=UTF-8|2 ; charset=|2 text/x-javascript; charset=utf-8|1 text/x-go; charset=utf-8|1 text/x-diff|1 text/x-c;charset=ISO-8859-1|1 text/vcard; charset=UTF-8|1 text/plain;charset=UTF-8;|1 text/plain;;charset=UTF-8|1 text/plain; charset=utf-8;|1 text/plain; charset=us-ascii|1 text/plain; charset=ISO-8859-1|1 text/javascript;charset=iso-8859-1|1 text/javascript; charset: UTF-8;charset=UTF-8|1 text/html;charset=utf8|1 text/html;charset=utf-8;|1 text/html;charset=CP1252|1 text/html; encoding=utf-8|1 text/html; charset=windows-1250|1 text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.8.0"|1 text/html; charset=utf-8, application/atom+xml|1 text/html; charset=none|1 text/html; charset=latin1|1 text/html; charset=latin-1|1 text/html; charset=iso8859-1|1 text/html; charset=cp1251|1 text/html; charset=WINDOWS-1251|1 text/html; charset=UTF-8;charset=UTF-8|1 text/html; charset=ISO-8859-1;|1 text/html; charset=ISO-8852|1 text/html; Charset=windows-1252|1 text/html; Charset=ISO-8859-1|1 text/csv; charset=UTF-8; header=present|1 text/css;charset =UTF-8;charset=UTF-8|1 text/calendar;charset=UTF-8|1 text/calendar; charset=utf-8|1 text/calendar|1 text/HTML; charset=utf-8|1 test/plain|1 png|1 image/x-photoshop|1 image/x-icon; charset="UTF-8"|1 image/png; charset="UTF-8"|1 image/jpeg;charset=ISO-8859-1|1 image/jpeg; charset=utf-8|1 image/gif;charset=UTF-8|1 image/gif;charset=ISO-8859-1|1 image/gif; charset=UTF-8|1 image/bmp;charset=UTF-8|1 image/Gif|1 image%2Fpng|1 font/x-woff|1 font/woff2; charset=utf-8|1 audio/x-scpls;charset=ISO-8859-1|1 audio/vnd.dts.hd|1 audio/unknown|1 audio/mp4a-latm|1 audio/mp4|1 audio/mp3|1 application/yaml|1 application/xslt+xml|1 application/xml; charset=utf-8; filename=k2_podcastitunes_309.xml|1 application/xml; charset=ISO-8859-1|1 application/xml-dtd|1 application/xhtml+xml; charset=utf-8|1 application/x-woff|1 application/x-web-app-manifest+json|1 application/x-tar|1 application/x-sh|1 application/x-pdf|1 application/x-javascript; charset=utf-8;|1 application/x-javascript, application/javascript; charset=utf-8|1 application/x-jar|1 application/x-httpd-php|1 application/x-download|1 application/x-compress|1 application/x-bibtex; charset=utf-8|1 application/vnd.rcuk.gtr.json-v1;charset=UTF-8|1 application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8|1 application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=ISO-8859-1|1 application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=utf-8|1 application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=UTF-8|1 application/vnd.openxmlformats|1 application/vnd.ms-excel;charset=UTF-8|1 application/vnd.initializr.v2.1+json|1 application/vnd.android.package-archive|1 application/problem+json; charset=utf-8|1 application/pdf;charset=iso-8859-1|1 application/pdf;charset=base64|1 application/pdf; time=20250519152750|1 application/pdf;|1 application/pdf,application/pdf|1 application/pdf, application/pdf|1 application/octet-stream; charset=UTF-8|1 application/octet-stream, application/x-javascript|1 application/manifest+json; charset=utf-8|1 application/ld+json|1 application/json; charset=UTF-8|1 application/json;|1 application/javascript; charset=windows-1251|1 application/javascript; charset=UTF8|1 application/javascript; charset=ISO-8859-1|1 application/javascript, application/x-javascript|1 application/gzip|1 application/feed+json; charset=UTF-8|1 application/atom+xml;profile=opds|1 None|1 .png|1Example row
sqlite3 cache.db "SELECT * FROM mirrored_content LIMIT 1;" hash_f0e6a6a97042a4f1f1c87f5f7d44315b2d852c2df5c7991cc66241bf7072d1c4|http://example.com|example.com|200|{"accept-ranges": "bytes", "content-type": "text/html", "etag": "\"84238dfc8092e5d9c0dac8ef93371a07:1736799080.121134\"", "last-modified": "Mon, 13 Jan 2025 20:11:20 GMT", "vary": "Accept-Encoding", "content-encoding": "gzip", "content-length": "648", "date": "Sun, 16 Feb 2025 02:58:01 GMT"}|<!doctype html> <html> <head> <title>Example Domain</title> <meta charset="utf-8" /> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <style type="text/css"> body { background-color: #f0f0f2; margin: 0; padding: 0; font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; } div { width: 600px; margin: 5em auto; padding: 2em; background-color: #fdfdff; border-radius: 0.5em; box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02); } a:link, a:visited { color: #38488f; text-decoration: none; } @media (max-width: 700px) { div { margin: 0 auto; width: auto; } } </style> </head> <body> <div> <h1>Example Domain</h1> <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p> <p><a href="/test-fiddle/www.iana.org/domains/example">More information...</a></p> </div> </body> </html> |test-fiddle/example.com|1742266681