Show HN: Hybrid Knowledge Graph and RAG for Legal Documents (Learning Project)

3 months ago 1

A hybrid system combining Knowledge Graphs and Retrieval-Augmented Generation (RAG) for intelligent querying of the Indian Income Tax Act.

Traditional RAG systems struggle with legal documents because they miss the interconnected nature of legal provisions. This system solves that by:

"What sections reference Section 80C?"
"Show me all exemptions available for senior citizens"
"What penalties apply if I violate Section 44AD?"
"How does Section 10 relate to Section 80?"

Knowledge Graph Excels Because Tax Law Has:

Sections that reference other sections
Definitions used across multiple places
Conditions and thresholds (income slabs, age limits)
Exemptions with eligibility criteria

┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Tax Act Text │───▶│ Parser Module │───▶│ Knowledge Graph│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ ▼ │ ┌──────────────────┐ │ │ Vector Store │ │ │ (RAG) │ │ └──────────────────┘ │ │ │ ▼ ▼ ┌──────────────────────────────────────┐ │ Hybrid Query System │ │ • KG Queries (relationships) │ │ • RAG Queries (content) │ │ • Hybrid Queries (both) │ └──────────────────────────────────────┘

Python 3.8+
Neo4j Database (local or cloud)
OpenAI API Key (optional, for enhanced responses)

Option 1: Using Makefile (Recommended)

git clone <repository> cd ita-kg # Setup virtual environment and install dependencies make setup # Configure environment cp .env.example .env # Edit .env with your credentials # Run with virtual environment make run # Or run demo make demo

git clone <repository> cd ita-kg # Configure environment cp .env.example .env # Edit .env with your credentials # Start services (Neo4j + app) make docker-up # View logs make docker-logs

git clone <repository> cd ita-kg python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # Setup Neo4j manually docker run --name neo4j -p7474:7474 -p7687:7687 -d \ -e NEO4J_AUTH=neo4j/your_password \ neo4j:latest python main.py

make help # Show all available commands make setup # Create venv and install dependencies make run # Run main.py using virtual environment make demo # Run demo.py using virtual environment make docker-up # Start Docker services make docker-down # Stop Docker services make clean # Remove venv and Docker volumes

Knowledge Graph Queries (Relationships)

# What sections reference Section 80C? response = query_system.query("What sections reference Section 80C?") # Returns: Sections that reference Section 80C: # • Section 80TTB: Deduction in respect of interest on deposits... # Find related sections response = query_system.query("What sections are related to Section 139?")

Hybrid Queries (Structure + Content)

# Eligibility questions response = query_system.query("What exemptions are available for senior citizens?") # Combines: KG to find exemption sections + RAG for senior citizen content # Category queries response = query_system.query("What deductions are available?")

RAG Queries (Content-Based)

# Detailed explanations response = query_system.query("Explain Section 44AD for presumptive taxation") # Specific information response = query_system.query("What is the penalty for not filing returns?")

ita-kg/ ├── tax_parser.py # Income Tax Act text parser ├── knowledge_graph.py # Neo4j Knowledge Graph builder ├── hybrid_query_system.py # Query routing and processing ├── main.py # Interactive system ├── demo.py # Capabilities demonstration ├── sample_income_tax_act.txt # Sample tax act data ├── requirements.txt # Python dependencies ├── .env.example # Environment template └── README.md # This file

1. Tax Parser (tax_parser.py)

Extracts sections, titles, and content
Identifies cross-references between sections
Classifies section types (exemption, deduction, penalty)
Extracts key concepts and definitions

2. Knowledge Graph (knowledge_graph.py)

Creates Neo4j nodes for sections
Builds REFERENCES relationships
Adds concept categorization
Provides graph analytics

3. Hybrid Query System (hybrid_query_system.py)

Routes queries based on type:
- KG: Reference/relationship queries
- RAG: Content/explanation queries
- Hybrid: Complex eligibility queries
Combines results for comprehensive answers

Reference Tracking: "What references Section X?"
Relationship Discovery: "What sections are related to X?"
Category Queries: "Show all deduction sections"
Eligibility Analysis: "What exemptions for senior citizens?"
Content Explanation: "Explain presumptive taxation"
Impact Analysis: "If Section X changes, what's affected?"

Section count by type
Cross-reference statistics
Concept distribution
Reference network analysis

The system comes with sample data covering key sections:

Exemptions: Sections 10, 10A
Deductions: Sections 80C, 80D, 80TTB
Penalties: Sections 271F, 271B
Procedures: Sections 139, 44AD
Definitions: Section 2

Try these queries:

• "What sections reference Section 80C?" • "What exemptions are available for senior citizens?" • "Show me all penalty sections" • "What is Section 44AD about?" • "Which sections mention agricultural income?"

Cross-Reference Navigation: Navigate the web of legal references
Structured Categorization: Find all exemptions/deductions instantly
Impact Analysis: See what's affected when sections change
Context-Aware Responses: Combine structure with content
Scalable: Add more legal documents to the same graph

Environment Variables (.env)

NEO4J_URI=bolt://localhost:7687 NEO4J_USERNAME=neo4j NEO4J_PASSWORD=your_password OPENAI_API_KEY=your_api_key # Optional

Add sections to sample_income_tax_act.txt
Follow the format: Section X - Title
The parser will automatically extract references and relationships
Run the system to rebuild the knowledge graph

# Check if Neo4j is running curl http://localhost:7474 # Verify credentials in .env file # Make sure bolt port (7687) is accessible

# Reinstall dependencies pip install -r requirements.txt # Check Python version (3.8+ required) python --version

Check Neo4j database has data: MATCH (n) RETURN count(n)
Verify section format in source text
Check logs for parsing errors

Graph Build: ~1-2 seconds for sample data
Query Response: ~100-500ms average
Memory Usage: ~50MB for sample dataset
Scalability: Tested up to 1000+ sections

Add more legal documents (Companies Act, GST Act)
Enhanced NLP for better reference extraction
Web interface with graph visualization
Multi-language support
Advanced analytics and insights

MIT License - Feel free to use for educational and commercial purposes.

Read Entire Article