Extract transcriptions, visual descriptions, and smart summaries from videos. Run 100% locally (Whisper + BLIP + Ollama) or via APIs (Groq + Gemini). Designed for long clips, block-by-block summaries, and a customizable final overview.
- 🎙️ Audio transcription in blocks (FFmpeg + local Whisper or Groq Whisper).
- 🖼️ Visual description of representative frames (local BLIP or Gemini Vision).
- 🧠 Multimodal summarization (combines speech + visuals) with configurable size, language, and persona.
- 🧩 Two execution modes:
- Local: no API keys required (faster-whisper + BLIP + Ollama).
- API: Groq (STT + LLM) + Google Gemini (image description).
- 🧱 Block processing (BLOCK_DURATION) with an aggregated final summary.
- 🌐 Accepts local file or URL (via utils/download_url.py).
Note: folder/file names may vary. Keep them as above to follow this guide verbatim.
- Python 3.10+
- FFmpeg (required to extract audio)
- OpenCV, Pillow
- Windows (winget): winget install Gyan.FFmpeg
- Windows (choco): choco install ffmpeg
- macOS (brew): brew install ffmpeg
- Ubuntu/Debian: sudo apt update && sudo apt install -y ffmpeg
Verify: ffmpeg -version
Install API-specific deps:
Create .env at the project root:
Set the video:
- Edit VIDEO_PATH in api-models/main.py to a local file or a URL (YouTube/Instagram/etc.).
If it’s a URL, the script downloads it automatically via utils/download_url.py.
(Optional) Tune parameters: At the end of api-models/main.py:
Run (from the repo root):
Important: run from the repo root so from utils.download_url import download resolves correctly.
Install local-specific deps:
Prepare Ollama & an LLM:
Set the video:
- Edit VIDEO_PATH in local-models/main.py to a local file or a URL.
If it’s a URL, the script downloads it automatically via utils/download_url.py.
(Optional) Tune parameters: At the end of local-models/main.py:
(Optional) Enable GPU for Whisper: In initialize_models():
Run (from the repo root):
Never commit your .env to Git.
- Downloads videos from URLs (YouTube, Instagram, etc.) using yt-dlp.
- Saves to downloads/ and returns the local path to feed the pipeline.
- If you need guaranteed MP4 with AAC audio, adjust yt-dlp/ffmpeg options there.
For each block:
- start_time, end_time
- transcription (speech for the segment)
- frame_description (visual description of the frame)
- audio_summary (multimodal summary for the block)
Final:
- Final video summary (aggregates all blocks).
Currently printed to the terminal. It’s straightforward to extend to JSON, SRT, or Markdown exports.
- Function signatures & param order: ensure calls to final_video_summary(...) match the function signature (API and Local).
- Image MIME for Gemini: if you saved PNG, pass mime_type='image/png'.
- Audio in Opus (Windows): if needed, re-encode to AAC with FFmpeg:
ffmpeg -i input.ext -c:v libx264 -c:a aac -movflags +faststart output.mp4
- ModuleNotFoundError: No module named 'utils': run scripts from the repo root and ensure utils/__init__.py exists.
- GPU recommended: WhisperModel(..., device="cuda", compute_type="float16").
- Adjust BLOCK_DURATION (shorter = finer captions; longer = faster processing).
- Tune SIZE_TO_TOKENS according to your LLM.
- For longer videos, cache per-block results to safely resume.
- Export JSON/SRT/Markdown (per block and final).
- CLI: klipmind --video <path|url> --mode api|local --lang en --size large ...
- Web UI (FastAPI/Streamlit) with upload/URL and progress bar.
- Multi-frame sampling per block.
- Model selection (Whisper tiny/base/…; BLIP variants; different LLMs).
- Unit tests for utils/download_url.py and parsers.
Contributions are welcome!
Open an issue with suggestions/bugs or submit a PR explaining the change.
MIT
- Whisper (faster-whisper), BLIP (Salesforce), Ollama (local models)
- Groq (STT + Chat Completions-compatible LLM)
- Gemini 2.0 Flash-Lite for vision (frame description)
- FFmpeg, OpenCV, Pillow
.png)


