Readweb: Extract token-efficient Markdown from websites

3 days ago 2

A specialized pipeline for extracting clean, token-efficient markdown from websites.

Naive HTML -> Markdown conversion produces a ton of garbage that wastes tokens and pollutes LLM workflows. Typical noise includes:

  • Navigation panels
  • Popups
  • Cookie consent banners
  • Table of contents
  • Headers / footers

This project implements three pipelines:

  1. "Page preset" generation: HTML -> Preset:
type Preset = { // anchors to make this preset more fragile on purpose. // Elements that identify website engine layout go here. preset_match_detectors: CSSSelector[]; // main content extractors main_content_selectors: CSSSelector[]; // filter selectors to trim the main content. // banners, subscription forms, sponsor content main_content_filters: CSSSelector[]; }; type CSSSelector = string;

Preset generation uses a feedback loop that enhances + applies preset until the markdown is really clean.

  1. Applying page preset: Preset + HTML -> Markdown

  2. Programmatic mozilla/readability (a.k.a. "reader mode") as HTML -> markdown API. Just for comparison with how far we can get with naive heuristics on the modern web.

I deployed a demo for you to try: https://readweb.osint.moe/ (temporary - it may run out of firecrawl credits).

demo page

It compares these methods side by side:

  • our preset generation flow
  • Firecrawl URL -> markdown
  • literal HTML -> markdown (similar to Firecrawl, but not exactly the same)
  • Mozilla's Readability (reader mode)

To run the demo by yourself,

  1. Populate .env (see .env.example). Firecrawl is used for HTML fetching
  2. pnpm install
  3. pnpm run start:web
Read Entire Article