DeepWiki to MdBook Converter

2 hours ago 1

DeepWiki GitHub

Relevant source files

Purpose and Scope

This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system's purpose, core capabilities, and high-level architecture.

For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.

Sources: README.md:1-3

Problem Statement

DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:

ProblemSolution
Content locked in web platformHTTP scraping with requests and BeautifulSoup4
Mermaid diagrams rendered client-side onlyJavaScript payload extraction with fuzzy matching
No offline accessSelf-contained HTML site generation
No searchabilitymdBook's built-in search
Platform-specific formattingConversion to standard Markdown

Sources: README.md:3-15

Core Capabilities

The system provides the following capabilities through environment variable configuration:

  • Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via REPO environment variable
  • Auto-Detection : Extracts repository metadata from Git remotes when available
  • Hierarchy Preservation : Maintains wiki page numbering and section structure
  • Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
  • Dual Output Modes : Full mdBook build or markdown-only extraction via MARKDOWN_ONLY flag
  • No Authentication : Public HTTP scraping without API keys or credentials
  • Containerized Deployment : Single Docker image with all dependencies

Sources: README.md:5-15 README.md:42-51

System Components

The system consists of three primary executable components coordinated by a shell orchestrator:

Main Components

graph TB User["docker run"] subgraph Container["deepwiki-scraper Container"] BuildDocs["build-docs.sh\n(Shell Orchestrator)"] Scraper["deepwiki-scraper.py\n(Python)"] MdBook["mdbook\n(Rust Binary)"] MermaidPlugin["mdbook-mermaid\n(Rust Binary)"] end subgraph External["External Systems"] DeepWiki["deepwiki.com\n(HTTP Scraping)"] GitHub["github.com\n(Edit Links)"] end subgraph Output["Output Directory"] MarkdownDir["markdown/\n(.md files)"] BookDir["book/\n(HTML site)"] ConfigFile["book.toml"] end User -->|Environment Variables| BuildDocs BuildDocs -->|Step 1: Execute| Scraper BuildDocs -->|Step 4: Execute| MdBook Scraper -->|HTTP GET| DeepWiki Scraper -->|Writes| MarkdownDir MdBook -->|Preprocessor| MermaidPlugin MdBook -->|Generates| BookDir BookDir -.->|Edit links| GitHub BuildDocs -->|Copies| ConfigFile style BuildDocs fill:#fff4e1 style Scraper fill:#e8f5e9 style MdBook fill:#f3e5f5
ComponentLanguagePurposeKey Functions
build-docs.shShellOrchestrationParse env vars, generate configs, call executables
deepwiki-scraper.pyPython 3.12Content extractionHTTP scraping, HTML parsing, diagram matching
mdbookRustSite generationMarkdown to HTML, navigation, search
mdbook-mermaidRustDiagram renderingInject JavaScript/CSS for Mermaid.js

Sources: README.md:146-157 Diagram 1, Diagram 5

Processing Pipeline

The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable:

Phase Details

stateDiagram-v2 [*] --> ParseEnvVars ParseEnvVars --> ExecuteScraper : build-docs.sh phase 1 state ExecuteScraper { [*] --> FetchHTML FetchHTML --> ConvertMarkdown : html2text ConvertMarkdown --> ExtractDiagrams : Regex on JS payload ExtractDiagrams --> FuzzyMatch : Progressive chunks FuzzyMatch --> WriteMarkdown : output/markdown/ WriteMarkdown --> [*] } ExecuteScraper --> CheckMode state CheckMode <<choice>> CheckMode --> GenerateBookToml : MARKDOWN_ONLY=false CheckMode --> CopyOutput : MARKDOWN_ONLY=true GenerateBookToml --> GenerateSummary : build-docs.sh phase 2 GenerateSummary --> ExecuteMdbook : build-docs.sh phase 3 state ExecuteMdbook { [*] --> InitBook InitBook --> CopyMarkdown : mdbook init CopyMarkdown --> InstallMermaid : mdbook-mermaid install InstallMermaid --> BuildHTML : mdbook build BuildHTML --> [*] : output/book/ } ExecuteMdbook --> CopyOutput CopyOutput --> [*]
PhaseScriptKey OperationsOutput
1deepwiki-scraper.pyHTTP fetch, BeautifulSoup4 parse, html2text conversion, fuzzy diagram matchingmarkdown/*.md
2build-docs.shGenerate book.toml, generate SUMMARY.mdConfiguration files
3mdbook + mdbook-mermaidMarkdown processing, Mermaid.js asset injection, HTML generationbook/ directory

Sources: README.md:121-145 Diagram 2

Input and Output

Input Requirements

InputFormatSourceExample
REPOowner/repoEnvironment variablefacebook/react
BOOK_TITLEStringEnvironment variable (optional)React Documentation
BOOK_AUTHORSStringEnvironment variable (optional)Meta Open Source
MARKDOWN_ONLYtrue/falseEnvironment variable (optional)false

Sources: README.md:42-51

Output Artifacts

Full Build Mode (MARKDOWN_ONLY=false or unset):

output/ ├── markdown/ │ ├── 1-overview.md │ ├── 2-quick-start.md │ ├── section-3/ │ │ ├── 3-1-workspace.md │ │ └── 3-2-parser.md │ └── ... ├── book/ │ ├── index.html │ ├── searchindex.json │ ├── mermaid.min.js │ └── ... └── book.toml

Markdown-Only Mode (MARKDOWN_ONLY=true):

output/ └── markdown/ ├── 1-overview.md ├── 2-quick-start.md └── ...

Sources: README.md:89-119

Technical Stack

The system combines multiple technology stacks in a single container using Docker multi-stage builds:

Runtime Dependencies

ComponentVersionPurposeInstallation Method
Python3.12-slimScraping runtimeBase image
requestsLatestHTTP clientuv pip install
beautifulsoup4LatestHTML parseruv pip install
html2textLatestHTML to Markdownuv pip install
mdbookLatestDocumentation builderCompiled from source (Rust)
mdbook-mermaidLatestDiagram preprocessorCompiled from source (Rust)

Build Architecture

The Dockerfile uses a two-stage build:

  1. Stage 1 (rust:latest): Compiles mdbook and mdbook-mermaid binaries (~1.5 GB, discarded)
  2. Stage 2 (python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)

Sources: README.md:146-157 Diagram 3

File System Interaction

The system interacts with three key filesystem locations:

Temporary Directory Workflow :

  1. deepwiki-scraper.py writes initial markdown to /tmp/wiki_temp/
  2. After diagram enhancement, files move atomically to /output/markdown/
  3. build-docs.sh copies final HTML to /output/book/

This ensures no partial states exist in the output directory.

Sources: README.md:220-227 README.md136

Configuration Philosophy

The system operates on three configuration principles:

  1. Environment-Driven : All customization via environment variables, no file editing required
  2. Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
  3. Zero-Configuration : Minimal required inputs (REPO or auto-detect from current directory)

Minimal Example :

This single command triggers the complete extraction, transformation, and build pipeline.

For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.

Sources: README.md:22-51 README.md:220-227

Read Entire Article