A specialized pipeline for extracting clean, token-efficient markdown from websites.
Naive HTML -> Markdown conversion produces a ton of garbage that wastes tokens and pollutes LLM workflows. Typical noise includes:
- Navigation panels
- Popups
- Cookie consent banners
- Table of contents
- Headers / footers
This project implements three pipelines:
- "Page preset" generation: HTML -> Preset:
type Preset = {
// anchors to make this preset more fragile on purpose.
// Elements that identify website engine layout go here.
preset_match_detectors: CSSSelector[];
// main content extractors
main_content_selectors: CSSSelector[];
// filter selectors to trim the main content.
// banners, subscription forms, sponsor content
main_content_filters: CSSSelector[];
};
type CSSSelector = string;
Preset generation uses a feedback loop that enhances + applies preset until the markdown is really clean.
-
Applying page preset: Preset + HTML -> Markdown
-
Programmatic mozilla/readability (a.k.a. "reader mode") as HTML -> markdown API. Just for comparison with how far we can get with naive heuristics on the modern web.
I deployed a demo for you to try: https://readweb.osint.moe/ (temporary - it may run out of firecrawl credits).
It compares these methods side by side:
- our preset generation flow
- Firecrawl URL -> markdown
- literal HTML -> markdown (similar to Firecrawl, but not exactly the same)
- Mozilla's Readability (reader mode)
To run the demo by yourself,
- Populate .env (see .env.example). Firecrawl is used for HTML fetching
- pnpm install
- pnpm run start:web
.png)


