A sophisticated Retrieval-Augmented Generation (RAG) system that analyzes Python repositories, builds knowledge graphs, and enables natural language querying of codebase structure and relationships.
- AST-based Code Analysis: Deep parsing of Python files to extract classes, functions, methods, and their relationships
- Knowledge Graph Storage: Uses Memgraph to store codebase structure as an interconnected graph
- Natural Language Querying: Ask questions about your codebase in plain English
- AI-Powered Cypher Generation: Leverages Google Gemini to translate natural language to Cypher queries
- Code Snippet Retrieval: Retrieves actual source code snippets for found functions/methods
- Dependency Analysis: Parses pyproject.toml to understand external dependencies
The system consists of two main components:
- Repository Parser (repo_parser.py): Analyzes Python codebases and ingests data into Memgraph
- RAG System (codebase_rag/): Interactive CLI for querying the stored knowledge graph
- Graph Database: Memgraph for storing code structure as nodes and relationships
- LLM Integration: Google Gemini for natural language processing
- Code Analysis: AST traversal for extracting code elements
- Query Tools: Specialized tools for graph querying and code retrieval
- Python 3.12+
- Docker & Docker Compose (for Memgraph)
- Google Gemini API key
- uv package manager
- Clone the repository:
git clone <repository-url>
cd graph-code
- Install dependencies:
- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration
Required environment variables:
GEMINI_API_KEY=your-api-key
GEMINI_MODEL_ID=gemeini-model-handle
MEMGRAPH_HOST=localhost
MEMGRAPH_PORT=7687
- Start Memgraph database:
Parse and ingest a Python repository into the knowledge graph:
python repo_parser.py /path/to/your/python/repo --clean
Options:
- --clean: Clear existing data before parsing
- --host: Memgraph host (default: localhost)
- --port: Memgraph port (default: 7687)
Start the interactive RAG CLI:
python -m codebase_rag.main --repo-path /path/to/your/repo
Example queries:
- "Show me all classes that contain 'user' in their name"
- "Find functions related to database operations"
- "What methods does the User class have?"
- "Show me functions that handle authentication"
The knowledge graph uses the following node types and relationships:
- Project: Root node representing the entire repository
- Package: Python packages (directories with __init__.py)
- Module: Individual Python files
- Class: Class definitions
- Function: Module-level functions
- Method: Class methods
- Folder: Regular directories
- File: Non-Python files
- ExternalPackage: External dependencies
- CONTAINS_PACKAGE/MODULE/FILE/FOLDER: Hierarchical containment
- DEFINES: Module defines classes/functions
- DEFINES_METHOD: Class defines methods
- DEPENDS_ON_EXTERNAL: Project depends on external packages
Configuration is managed through environment variables and the config.py file:
MEMGRAPH_HOST = "localhost"
MEMGRAPH_PORT = 7687
GEMINI_MODEL_ID = "gemini-2.5-pro-preview-06-05"
TARGET_REPO_PATH = "."
GEMINI_API_KEY = "required"
graph-code/
├── repo_parser.py # Repository analysis and ingestion
├── codebase_rag/ # RAG system package
│ ├── main.py # CLI entry point
│ ├── config.py # Configuration management
│ ├── prompts.py # LLM prompts and schemas
│ ├── schemas.py # Pydantic models
│ ├── services/ # Core services
│ │ ├── graph_db.py # Memgraph integration
│ │ └── llm.py # Gemini LLM integration
│ └── tools/ # RAG tools
│ ├── codebase_query.py # Graph querying tool
│ └── code_retrieval.py # Code snippet retrieval
├── docker-compose.yaml # Memgraph setup
└── pyproject.toml # Project dependencies
- pydantic-ai: AI agent framework
- pymgclient: Memgraph Python client
- loguru: Advanced logging
- python-dotenv: Environment variable management
-
Check Memgraph connection:
- Ensure Docker containers are running: docker-compose ps
- Verify Memgraph is accessible on port 7687
-
View database in Memgraph Lab:
- Open http://localhost:3000
- Connect to memgraph:7687
-
Enable debug logging:
- The RAG orchestrator runs in debug mode by default
- Check logs for detailed execution traces
- Follow the established code structure
- Keep files under 100 lines (as per user rules)
- Use type annotations
- Follow conventional commit messages
- Use DRY principles
For issues or questions:
- Check the logs for error details
- Verify Memgraph connection
- Ensure all environment variables are set
- Review the graph schema matches your expectations
.png)


