Readweb: Extract token-efficient Markdown from websites

3 days ago 2

A specialized pipeline for extracting clean, token-efficient markdown from websites.

Naive HTML -> Markdown conversion produces a ton of garbage that wastes tokens and pollutes LLM workflows. Typical noise includes:

Navigation panels
Popups
Cookie consent banners
Table of contents
Headers / footers

This project implements three pipelines:

"Page preset" generation: HTML -> Preset:

type Preset = { // anchors to make this preset more fragile on purpose. // Elements that identify website engine layout go here. preset_match_detectors: CSSSelector[]; // main content extractors main_content_selectors: CSSSelector[]; // filter selectors to trim the main content. // banners, subscription forms, sponsor content main_content_filters: CSSSelector[]; }; type CSSSelector = string;

Preset generation uses a feedback loop that enhances + applies preset until the markdown is really clean.

Applying page preset: Preset + HTML -> Markdown
Programmatic mozilla/readability (a.k.a. "reader mode") as HTML -> markdown API. Just for comparison with how far we can get with naive heuristics on the modern web.

I deployed a demo for you to try: https://readweb.osint.moe/ (temporary - it may run out of firecrawl credits).