Relevant source files
Purpose and Scope
This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system's purpose, core capabilities, and high-level architecture.
For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.
Sources: README.md:1-3
Problem Statement
DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:
| Content locked in web platform | HTTP scraping with requests and BeautifulSoup4 |
| Mermaid diagrams rendered client-side only | JavaScript payload extraction with fuzzy matching |
| No offline access | Self-contained HTML site generation |
| No searchability | mdBook's built-in search |
| Platform-specific formatting | Conversion to standard Markdown |
Sources: README.md:3-15
Core Capabilities
The system provides the following capabilities through environment variable configuration:
- Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via REPO environment variable
- Auto-Detection : Extracts repository metadata from Git remotes when available
- Hierarchy Preservation : Maintains wiki page numbering and section structure
- Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
- Dual Output Modes : Full mdBook build or markdown-only extraction via MARKDOWN_ONLY flag
- No Authentication : Public HTTP scraping without API keys or credentials
- Containerized Deployment : Single Docker image with all dependencies
Sources: README.md:5-15 README.md:42-51
System Components
The system consists of three primary executable components coordinated by a shell orchestrator:
Main Components
graph TB User["docker run"] subgraph Container["deepwiki-scraper Container"] BuildDocs["build-docs.sh\n(Shell Orchestrator)"] Scraper["deepwiki-scraper.py\n(Python)"] MdBook["mdbook\n(Rust Binary)"] MermaidPlugin["mdbook-mermaid\n(Rust Binary)"] end subgraph External["External Systems"] DeepWiki["deepwiki.com\n(HTTP Scraping)"] GitHub["github.com\n(Edit Links)"] end subgraph Output["Output Directory"] MarkdownDir["markdown/\n(.md files)"] BookDir["book/\n(HTML site)"] ConfigFile["book.toml"] end User -->|Environment Variables| BuildDocs BuildDocs -->|Step 1: Execute| Scraper BuildDocs -->|Step 4: Execute| MdBook Scraper -->|HTTP GET| DeepWiki Scraper -->|Writes| MarkdownDir MdBook -->|Preprocessor| MermaidPlugin MdBook -->|Generates| BookDir BookDir -.->|Edit links| GitHub BuildDocs -->|Copies| ConfigFile style BuildDocs fill:#fff4e1 style Scraper fill:#e8f5e9 style MdBook fill:#f3e5f5| build-docs.sh | Shell | Orchestration | Parse env vars, generate configs, call executables |
| deepwiki-scraper.py | Python 3.12 | Content extraction | HTTP scraping, HTML parsing, diagram matching |
| mdbook | Rust | Site generation | Markdown to HTML, navigation, search |
| mdbook-mermaid | Rust | Diagram rendering | Inject JavaScript/CSS for Mermaid.js |
Sources: README.md:146-157 Diagram 1, Diagram 5
Processing Pipeline
The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable:
Phase Details
stateDiagram-v2 [*] --> ParseEnvVars ParseEnvVars --> ExecuteScraper : build-docs.sh phase 1 state ExecuteScraper { [*] --> FetchHTML FetchHTML --> ConvertMarkdown : html2text ConvertMarkdown --> ExtractDiagrams : Regex on JS payload ExtractDiagrams --> FuzzyMatch : Progressive chunks FuzzyMatch --> WriteMarkdown : output/markdown/ WriteMarkdown --> [*] } ExecuteScraper --> CheckMode state CheckMode <<choice>> CheckMode --> GenerateBookToml : MARKDOWN_ONLY=false CheckMode --> CopyOutput : MARKDOWN_ONLY=true GenerateBookToml --> GenerateSummary : build-docs.sh phase 2 GenerateSummary --> ExecuteMdbook : build-docs.sh phase 3 state ExecuteMdbook { [*] --> InitBook InitBook --> CopyMarkdown : mdbook init CopyMarkdown --> InstallMermaid : mdbook-mermaid install InstallMermaid --> BuildHTML : mdbook build BuildHTML --> [*] : output/book/ } ExecuteMdbook --> CopyOutput CopyOutput --> [*]| 1 | deepwiki-scraper.py | HTTP fetch, BeautifulSoup4 parse, html2text conversion, fuzzy diagram matching | markdown/*.md |
| 2 | build-docs.sh | Generate book.toml, generate SUMMARY.md | Configuration files |
| 3 | mdbook + mdbook-mermaid | Markdown processing, Mermaid.js asset injection, HTML generation | book/ directory |
Sources: README.md:121-145 Diagram 2
Input and Output
Input Requirements
| REPO | owner/repo | Environment variable | facebook/react |
| BOOK_TITLE | String | Environment variable (optional) | React Documentation |
| BOOK_AUTHORS | String | Environment variable (optional) | Meta Open Source |
| MARKDOWN_ONLY | true/false | Environment variable (optional) | false |
Sources: README.md:42-51
Output Artifacts
Full Build Mode (MARKDOWN_ONLY=false or unset):
output/ ├── markdown/ │ ├── 1-overview.md │ ├── 2-quick-start.md │ ├── section-3/ │ │ ├── 3-1-workspace.md │ │ └── 3-2-parser.md │ └── ... ├── book/ │ ├── index.html │ ├── searchindex.json │ ├── mermaid.min.js │ └── ... └── book.tomlMarkdown-Only Mode (MARKDOWN_ONLY=true):
output/ └── markdown/ ├── 1-overview.md ├── 2-quick-start.md └── ...Sources: README.md:89-119
Technical Stack
The system combines multiple technology stacks in a single container using Docker multi-stage builds:
Runtime Dependencies
| Python | 3.12-slim | Scraping runtime | Base image |
| requests | Latest | HTTP client | uv pip install |
| beautifulsoup4 | Latest | HTML parser | uv pip install |
| html2text | Latest | HTML to Markdown | uv pip install |
| mdbook | Latest | Documentation builder | Compiled from source (Rust) |
| mdbook-mermaid | Latest | Diagram preprocessor | Compiled from source (Rust) |
Build Architecture
The Dockerfile uses a two-stage build:
- Stage 1 (rust:latest): Compiles mdbook and mdbook-mermaid binaries (~1.5 GB, discarded)
- Stage 2 (python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)
Sources: README.md:146-157 Diagram 3
File System Interaction
The system interacts with three key filesystem locations:
Temporary Directory Workflow :
- deepwiki-scraper.py writes initial markdown to /tmp/wiki_temp/
- After diagram enhancement, files move atomically to /output/markdown/
- build-docs.sh copies final HTML to /output/book/
This ensures no partial states exist in the output directory.
Sources: README.md:220-227 README.md136
Configuration Philosophy
The system operates on three configuration principles:
- Environment-Driven : All customization via environment variables, no file editing required
- Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
- Zero-Configuration : Minimal required inputs (REPO or auto-detect from current directory)
Minimal Example :
This single command triggers the complete extraction, transformation, and build pipeline.
For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.
Sources: README.md:22-51 README.md:220-227
.png)

