DeepWiki to MdBook Converter

2 hours ago 1

Relevant source files

README.md

Purpose and Scope

This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system's purpose, core capabilities, and high-level architecture.

For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.

Sources: README.md:1-3

Problem Statement

DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:

ProblemSolution

Content locked in web platform	HTTP scraping with requests and BeautifulSoup4
Mermaid diagrams rendered client-side only	JavaScript payload extraction with fuzzy matching
No offline access	Self-contained HTML site generation
No searchability	mdBook's built-in search
Platform-specific formatting	Conversion to standard Markdown

Sources: README.md:3-15

Core Capabilities

The system provides the following capabilities through environment variable configuration:

Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via REPO environment variable
Auto-Detection : Extracts repository metadata from Git remotes when available
Hierarchy Preservation : Maintains wiki page numbering and section structure
Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
Dual Output Modes : Full mdBook build or markdown-only extraction via MARKDOWN_ONLY flag
No Authentication : Public HTTP scraping without API keys or credentials
Containerized Deployment : Single Docker image with all dependencies

Sources: README.md:5-15 README.md:42-51

System Components

The system consists of three primary executable components coordinated by a shell orchestrator:

Main Components

graph TB User["docker run"] subgraph Container["deepwiki-scraper Container"] BuildDocs["build-docs.sh\n(Shell Orchestrator)"] Scraper["deepwiki-scraper.py\n(Python)"] MdBook["mdbook\n(Rust Binary)"] MermaidPlugin["mdbook-mermaid\n(Rust Binary)"] end subgraph External["External Systems"] DeepWiki["deepwiki.com\n(HTTP Scraping)"] GitHub["github.com\n(Edit Links)"] end subgraph Output["Output Directory"] MarkdownDir["markdown/\n(.md files)"] BookDir["book/\n(HTML site)"] ConfigFile["book.toml"] end User -->|Environment Variables| BuildDocs BuildDocs -->|Step 1: Execute| Scraper BuildDocs -->|Step 4: Execute| MdBook Scraper -->|HTTP GET| DeepWiki Scraper -->|Writes| MarkdownDir MdBook -->|Preprocessor| MermaidPlugin MdBook -->|Generates| BookDir BookDir -.->|Edit links| GitHub BuildDocs -->|Copies| ConfigFile style BuildDocs fill:#fff4e1 style Scraper fill:#e8f5e9 style MdBook fill:#f3e5f5

ComponentLanguagePurposeKey Functions

build-docs.sh	Shell	Orchestration	Parse env vars, generate configs, call executables
deepwiki-scraper.py	Python 3.12	Content extraction	HTTP scraping, HTML parsing, diagram matching
mdbook	Rust	Site generation	Markdown to HTML, navigation, search
mdbook-mermaid	Rust	Diagram rendering	Inject JavaScript/CSS for Mermaid.js

Sources: README.md:146-157 Diagram 1, Diagram 5

Processing Pipeline

The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable:

Phase Details

stateDiagram-v2 [*] --> ParseEnvVars ParseEnvVars --> ExecuteScraper : build-docs.sh phase 1 state ExecuteScraper { [*] --> FetchHTML FetchHTML --> ConvertMarkdown : html2text ConvertMarkdown --> ExtractDiagrams : Regex on JS payload ExtractDiagrams --> FuzzyMatch : Progressive chunks FuzzyMatch --> WriteMarkdown : output/markdown/ WriteMarkdown --> [*] } ExecuteScraper --> CheckMode state CheckMode <<choice>> CheckMode --> GenerateBookToml : MARKDOWN_ONLY=false CheckMode --> CopyOutput : MARKDOWN_ONLY=true GenerateBookToml --> GenerateSummary : build-docs.sh phase 2 GenerateSummary --> ExecuteMdbook : build-docs.sh phase 3 state ExecuteMdbook { [*] --> InitBook InitBook --> CopyMarkdown : mdbook init CopyMarkdown --> InstallMermaid : mdbook-mermaid install InstallMermaid --> BuildHTML : mdbook build BuildHTML --> [*] : output/book/ } ExecuteMdbook --> CopyOutput CopyOutput --> [*]

PhaseScriptKey OperationsOutput

1	deepwiki-scraper.py	HTTP fetch, BeautifulSoup4 parse, html2text conversion, fuzzy diagram matching	markdown/*.md
2	build-docs.sh	Generate book.toml, generate SUMMARY.md	Configuration files
3	mdbook + mdbook-mermaid	Markdown processing, Mermaid.js asset injection, HTML generation	book/ directory

Sources: README.md:121-145 Diagram 2

Input and Output

Input Requirements

InputFormatSourceExample

REPO	owner/repo	Environment variable	facebook/react
BOOK_TITLE	String	Environment variable (optional)	React Documentation
BOOK_AUTHORS	String	Environment variable (optional)	Meta Open Source
MARKDOWN_ONLY	true/false	Environment variable (optional)	false

Sources: README.md:42-51

Output Artifacts

Full Build Mode (MARKDOWN_ONLY=false or unset):

output/ ├── markdown/ │ ├── 1-overview.md │ ├── 2-quick-start.md │ ├── section-3/ │ │ ├── 3-1-workspace.md │ │ └── 3-2-parser.md │ └── ... ├── book/ │ ├── index.html │ ├── searchindex.json │ ├── mermaid.min.js │ └── ... └── book.toml

Markdown-Only Mode (MARKDOWN_ONLY=true):

output/ └── markdown/ ├── 1-overview.md ├── 2-quick-start.md └── ...

Sources: README.md:89-119

Technical Stack

The system combines multiple technology stacks in a single container using Docker multi-stage builds:

Runtime Dependencies

ComponentVersionPurposeInstallation Method

Python	3.12-slim	Scraping runtime	Base image
requests	Latest	HTTP client	uv pip install
beautifulsoup4	Latest	HTML parser	uv pip install
html2text	Latest	HTML to Markdown	uv pip install
mdbook	Latest	Documentation builder	Compiled from source (Rust)
mdbook-mermaid	Latest	Diagram preprocessor	Compiled from source (Rust)

Build Architecture

The Dockerfile uses a two-stage build:

Stage 1 (rust:latest): Compiles mdbook and mdbook-mermaid binaries (~1.5 GB, discarded)
Stage 2 (python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)

Sources: README.md:146-157 Diagram 3

File System Interaction

The system interacts with three key filesystem locations:

Temporary Directory Workflow :

deepwiki-scraper.py writes initial markdown to /tmp/wiki_temp/
After diagram enhancement, files move atomically to /output/markdown/
build-docs.sh copies final HTML to /output/book/

This ensures no partial states exist in the output directory.

Sources: README.md:220-227 README.md136

Configuration Philosophy

The system operates on three configuration principles:

Environment-Driven : All customization via environment variables, no file editing required
Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
Zero-Configuration : Minimal required inputs (REPO or auto-detect from current directory)

Minimal Example :

This single command triggers the complete extraction, transformation, and build pipeline.

For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.

Sources: README.md:22-51 README.md:220-227

Read Entire Article