A developer utility that enables semantic search through Git commit history. This tool extracts commit history, generates vector embeddings, and allows you to search through your repository's history using natural language queries.
This script scans your Git repository's commit history, extracts metadata (author, date, commit message), generates vector embeddings using configurable models (OpenAI or Hugging Face), and stores them in a Chroma vector database. You can then perform semantic searches to find relevant commits based on your queries.
Current Features:
- ✅ Git commit message search (implemented)
- 🔄 File diff search (planned)
- 🔄 Full commit content search (planned)
- Clone this repository
- Install dependencies:
We recommend using OpenAI models for the best search experience. Here's the recommended workflow:
Why OpenAI? OpenAI's embedding models provide superior semantic understanding and the GPT models offer excellent summarization capabilities, making your search results more relevant and easier to understand.
Extracts commits from a Git repository, generates embeddings, and stores them in a vector database.
Options:
- --provider / -p: Embedding provider (hf or openai, default: hf)
- --model / -m: Embedding model name (default: BAAI/bge-small-en-v1.5)
Examples:
Performs semantic search on the latest embeddings and returns relevant commits.
Options:
Note: The provider and model should be the same as the ones used during preparation of embeddings
- --provider / -p: Embedding provider for search (hf or openai, default: hf)
- --model / -m: Embedding model for search (default: BAAI/bge-small-en-v1.5)
- --summarize / -s: Use LLM to summarize results (default: False)
- --llm-provider / -lp: LLM provider for summarization (default: openai)
- --llm-model / -lm: LLM model for summarization (default: gpt-4.1-nano)
- --limit / -k: Maximum number of results to return (default: 5)
Examples:
Lists all available embeddings with their metadata, sorted by creation date (newest first).
Output includes:
- Branch name
- Creation timestamp
- Embedding provider and model
- Document count
- Chroma database directory
Removes all embedding files and vector database directories created by this tool.
The tool creates the following structure in a .tmp directory:
- Hugging Face: BAAI/bge-small-en-v1.5 (default)
- OpenAI: text-embedding-3-small
- OpenAI: gpt-4.1-nano, gpt-4, gpt-3.5-turbo
-
Model Consistency: When searching, you must use the same embedding provider and model that was used during preparation.
-
Query Length: Search queries are limited to 200 characters.
-
Repository Requirements: The target directory must be a valid Git repository (contain a .git folder).
-
Storage: Embeddings and vector databases are stored locally and persist until explicitly cleaned up.
-
Latest Search: The search command always uses the most recent embeddings file.
-
Model Interoperability: Search queries must use the same embedding model that was used during preparation.
-
File Management: The tool doesn't automatically clean up old embeddings - use the cleanup command when needed.
-
Current Scope: Only commit message search is implemented; file diff and full content search are planned features.
Required for OpenAI models: You must set the following environment variable to use OpenAI embedding and LLM models:
Note: This environment variable is mandatory when using -p openai or -lp openai options. The tool will fail if this is not set when attempting to use OpenAI services.
- Feature Tracking: "When was the user authentication feature added?"
- Bug Investigation: "When did we fix the login bug?"
- Refactoring History: "When did we refactor the database layer?"
- Dependency Updates: "When did we update React to version 18?"
.png)

