Monorepos solved: graph-based search

4 months ago 3

A sophisticated Retrieval-Augmented Generation (RAG) system that analyzes Python repositories, builds knowledge graphs, and enables natural language querying of codebase structure and relationships.

AST-based Code Analysis: Deep parsing of Python files to extract classes, functions, methods, and their relationships
Knowledge Graph Storage: Uses Memgraph to store codebase structure as an interconnected graph
Natural Language Querying: Ask questions about your codebase in plain English
AI-Powered Cypher Generation: Leverages Google Gemini to translate natural language to Cypher queries
Code Snippet Retrieval: Retrieves actual source code snippets for found functions/methods
Dependency Analysis: Parses pyproject.toml to understand external dependencies

The system consists of two main components:

Repository Parser (repo_parser.py): Analyzes Python codebases and ingests data into Memgraph
RAG System (codebase_rag/): Interactive CLI for querying the stored knowledge graph

Graph Database: Memgraph for storing code structure as nodes and relationships
LLM Integration: Google Gemini for natural language processing
Code Analysis: AST traversal for extracting code elements
Query Tools: Specialized tools for graph querying and code retrieval

Python 3.12+
Docker & Docker Compose (for Memgraph)
Google Gemini API key
uv package manager

Clone the repository:

git clone <repository-url> cd graph-code

Install dependencies:

Set up environment variables:

cp .env.example .env # Edit .env with your configuration

Required environment variables:

GEMINI_API_KEY=your-api-key GEMINI_MODEL_ID=gemeini-model-handle MEMGRAPH_HOST=localhost MEMGRAPH_PORT=7687

Start Memgraph database:

Step 1: Parse a Repository

Parse and ingest a Python repository into the knowledge graph:

python repo_parser.py /path/to/your/python/repo --clean

Options:

--clean: Clear existing data before parsing
--host: Memgraph host (default: localhost)
--port: Memgraph port (default: 7687)

Step 2: Query the Codebase

Start the interactive RAG CLI:

python -m codebase_rag.main --repo-path /path/to/your/repo

Example queries:

"Show me all classes that contain 'user' in their name"
"Find functions related to database operations"
"What methods does the User class have?"
"Show me functions that handle authentication"

The knowledge graph uses the following node types and relationships:

Project: Root node representing the entire repository
Package: Python packages (directories with __init__.py)
Module: Individual Python files
Class: Class definitions
Function: Module-level functions
Method: Class methods
Folder: Regular directories
File: Non-Python files
ExternalPackage: External dependencies

CONTAINS_PACKAGE/MODULE/FILE/FOLDER: Hierarchical containment
DEFINES: Module defines classes/functions
DEFINES_METHOD: Class defines methods
DEPENDS_ON_EXTERNAL: Project depends on external packages

Configuration is managed through environment variables and the config.py file:

MEMGRAPH_HOST = "localhost" MEMGRAPH_PORT = 7687 GEMINI_MODEL_ID = "gemini-2.5-pro-preview-06-05" TARGET_REPO_PATH = "." GEMINI_API_KEY = "required"

graph-code/ ├── repo_parser.py # Repository analysis and ingestion ├── codebase_rag/ # RAG system package │ ├── main.py # CLI entry point │ ├── config.py # Configuration management │ ├── prompts.py # LLM prompts and schemas │ ├── schemas.py # Pydantic models │ ├── services/ # Core services │ │ ├── graph_db.py # Memgraph integration │ │ └── llm.py # Gemini LLM integration │ └── tools/ # RAG tools │ ├── codebase_query.py # Graph querying tool │ └── code_retrieval.py # Code snippet retrieval ├── docker-compose.yaml # Memgraph setup └── pyproject.toml # Project dependencies

pydantic-ai: AI agent framework
pymgclient: Memgraph Python client
loguru: Advanced logging
python-dotenv: Environment variable management

Check Memgraph connection:
- Ensure Docker containers are running: docker-compose ps
- Verify Memgraph is accessible on port 7687
View database in Memgraph Lab:
- Open http://localhost:3000
- Connect to memgraph:7687
Enable debug logging:
- The RAG orchestrator runs in debug mode by default
- Check logs for detailed execution traces