ocr_md_book.py turns a folder of scanned page images into clean Markdown, merges every page into a single book.md, and finally packages the result as an EPUB (plus optional AZW3/MOBI if Calibre is available). The OCR step relies on Alibaba DashScope (Tongyi) multimodal models and includes light post-processing to remove headers, footers, page numbers, and unwanted hard wraps. The tool is resumable and designed for macOS but remains cross-platform friendly.
- Python 3.10 or newer.
- Python dependencies (install inside a virtual environment if possible):
python3 -m pip install httpx pillow tqdm pyyaml
- External tools:
- pandoc (required) – e.g. brew install pandoc on macOS.
- ebook-convert from Calibre (optional) if you want AZW3/MOBI output.
- DashScope API Key – export before running:
export DASHSCOPE_API_KEY="sk-your-key"
If your source is a PDF, convert it to page images first. Install Poppler (provides pdftoppm), e.g. brew install poppler, then run:
This will create files such as output-prefix-01.png, output-prefix-02.png, … that you can place in the images directory for the OCR step.
- Place all page images (PNG/JPG) in one directory; natural sorting is handled automatically, but numeric suffixes are recommended.
- Provide a cover image if you want EPUB metadata to include it; the path must exist or pandoc will fail.
From the project root, run:
Results land in book_images/_out/:
- pages/page-0001.md, … individual Markdown files
- book.md – merged document
- book.epub – main deliverable (and book.azw3/book.mobi when Calibre is detected and flags set)
- --images-dir (required): folder containing images.
- --title, --author, --lang: EPUB metadata.
- --max-width: downscale width before upload (never upscale).
- --concurrency: async OCR concurrency; start between 1–4.
- --model: DashScope model name (e.g. qwen3-omni-flash).
- --cover: cover image path for EPUB metadata (must exist).
- --out-name: output file prefix (default book).
- --skip-ocr-existing: skip pages with existing Markdown (resume support).
- --from-list: newline-separated file list to control ordering.
- --pages: subset pages like 1-50,120,121-130.
- --dry-run: list pages to process without running OCR.
- --to-azw3, --to-mobi: build Kindle formats if ebook-convert is available.
- --verbose: show detailed logs (default output is concise).
- Gather images (or read from --from-list) and sort naturally.
- Auto-rotate with EXIF data, optionally downscale, and forward to DashScope using several payload variants for compatibility.
- Clean the resulting Markdown and write to _out/pages/page-XXXX.md.
- Merge pages into _out/book.md with # 第 N 页 separators.
- Build the EPUB via pandoc, and optionally call Calibre to produce AZW3/MOBI.
- Combine --skip-ocr-existing with the default output structure to resume after interruptions.
- Failed pages are logged by index; re-run the command (optionally with --pages) to fill the gaps.
- HTTP 400 “url error”: ensure the chosen model supports base64 payloads. If it requires public URLs, upload images to accessible HTTPS locations and reference them via --from-list.
- Cover file missing: confirm the path passed to --cover exists or omit the flag.
- Calibre not found: the script logs a warning and skips AZW3/MOBI when ebook-convert is absent.
No specific license is provided. Use internally or personally as needed, and comply with the licenses of DashScope, Calibre, pandoc, and other dependencies.
.png)