Show HN: Hybrid Knowledge Graph and RAG for Legal Documents (Learning Project)

3 months ago 1

A hybrid system combining Knowledge Graphs and Retrieval-Augmented Generation (RAG) for intelligent querying of the Indian Income Tax Act.

Traditional RAG systems struggle with legal documents because they miss the interconnected nature of legal provisions. This system solves that by:

  • "What sections reference Section 80C?"
  • "Show me all exemptions available for senior citizens"
  • "What penalties apply if I violate Section 44AD?"
  • "How does Section 10 relate to Section 80?"

Knowledge Graph Excels Because Tax Law Has:

  • Sections that reference other sections
  • Definitions used across multiple places
  • Conditions and thresholds (income slabs, age limits)
  • Exemptions with eligibility criteria
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Tax Act Text │───▶│ Parser Module │───▶│ Knowledge Graph│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ ▼ │ ┌──────────────────┐ │ │ Vector Store │ │ │ (RAG) │ │ └──────────────────┘ │ │ │ ▼ ▼ ┌──────────────────────────────────────┐ │ Hybrid Query System │ │ • KG Queries (relationships) │ │ • RAG Queries (content) │ │ • Hybrid Queries (both) │ └──────────────────────────────────────┘
  • Python 3.8+
  • Neo4j Database (local or cloud)
  • OpenAI API Key (optional, for enhanced responses)

Option 1: Using Makefile (Recommended)

git clone <repository> cd ita-kg # Setup virtual environment and install dependencies make setup # Configure environment cp .env.example .env # Edit .env with your credentials # Run with virtual environment make run # Or run demo make demo
git clone <repository> cd ita-kg # Configure environment cp .env.example .env # Edit .env with your credentials # Start services (Neo4j + app) make docker-up # View logs make docker-logs
git clone <repository> cd ita-kg python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # Setup Neo4j manually docker run --name neo4j -p7474:7474 -p7687:7687 -d \ -e NEO4J_AUTH=neo4j/your_password \ neo4j:latest python main.py
make help # Show all available commands make setup # Create venv and install dependencies make run # Run main.py using virtual environment make demo # Run demo.py using virtual environment make docker-up # Start Docker services make docker-down # Stop Docker services make clean # Remove venv and Docker volumes

Knowledge Graph Queries (Relationships)

# What sections reference Section 80C? response = query_system.query("What sections reference Section 80C?") # Returns: Sections that reference Section 80C: # • Section 80TTB: Deduction in respect of interest on deposits... # Find related sections response = query_system.query("What sections are related to Section 139?")

Hybrid Queries (Structure + Content)

# Eligibility questions response = query_system.query("What exemptions are available for senior citizens?") # Combines: KG to find exemption sections + RAG for senior citizen content # Category queries response = query_system.query("What deductions are available?")

RAG Queries (Content-Based)

# Detailed explanations response = query_system.query("Explain Section 44AD for presumptive taxation") # Specific information response = query_system.query("What is the penalty for not filing returns?")
ita-kg/ ├── tax_parser.py # Income Tax Act text parser ├── knowledge_graph.py # Neo4j Knowledge Graph builder ├── hybrid_query_system.py # Query routing and processing ├── main.py # Interactive system ├── demo.py # Capabilities demonstration ├── sample_income_tax_act.txt # Sample tax act data ├── requirements.txt # Python dependencies ├── .env.example # Environment template └── README.md # This file

1. Tax Parser (tax_parser.py)

  • Extracts sections, titles, and content
  • Identifies cross-references between sections
  • Classifies section types (exemption, deduction, penalty)
  • Extracts key concepts and definitions

2. Knowledge Graph (knowledge_graph.py)

  • Creates Neo4j nodes for sections
  • Builds REFERENCES relationships
  • Adds concept categorization
  • Provides graph analytics

3. Hybrid Query System (hybrid_query_system.py)

  • Routes queries based on type:
    • KG: Reference/relationship queries
    • RAG: Content/explanation queries
    • Hybrid: Complex eligibility queries
  • Combines results for comprehensive answers
  1. Reference Tracking: "What references Section X?"
  2. Relationship Discovery: "What sections are related to X?"
  3. Category Queries: "Show all deduction sections"
  4. Eligibility Analysis: "What exemptions for senior citizens?"
  5. Content Explanation: "Explain presumptive taxation"
  6. Impact Analysis: "If Section X changes, what's affected?"
  • Section count by type
  • Cross-reference statistics
  • Concept distribution
  • Reference network analysis

The system comes with sample data covering key sections:

  • Exemptions: Sections 10, 10A
  • Deductions: Sections 80C, 80D, 80TTB
  • Penalties: Sections 271F, 271B
  • Procedures: Sections 139, 44AD
  • Definitions: Section 2

Try these queries:

• "What sections reference Section 80C?" • "What exemptions are available for senior citizens?" • "Show me all penalty sections" • "What is Section 44AD about?" • "Which sections mention agricultural income?"
  1. Cross-Reference Navigation: Navigate the web of legal references
  2. Structured Categorization: Find all exemptions/deductions instantly
  3. Impact Analysis: See what's affected when sections change
  4. Context-Aware Responses: Combine structure with content
  5. Scalable: Add more legal documents to the same graph

Environment Variables (.env)

NEO4J_URI=bolt://localhost:7687 NEO4J_USERNAME=neo4j NEO4J_PASSWORD=your_password OPENAI_API_KEY=your_api_key # Optional
  1. Add sections to sample_income_tax_act.txt
  2. Follow the format: Section X - Title
  3. The parser will automatically extract references and relationships
  4. Run the system to rebuild the knowledge graph
# Check if Neo4j is running curl http://localhost:7474 # Verify credentials in .env file # Make sure bolt port (7687) is accessible
# Reinstall dependencies pip install -r requirements.txt # Check Python version (3.8+ required) python --version
  • Check Neo4j database has data: MATCH (n) RETURN count(n)
  • Verify section format in source text
  • Check logs for parsing errors
  • Graph Build: ~1-2 seconds for sample data
  • Query Response: ~100-500ms average
  • Memory Usage: ~50MB for sample dataset
  • Scalability: Tested up to 1000+ sections
  • Add more legal documents (Companies Act, GST Act)
  • Enhanced NLP for better reference extraction
  • Web interface with graph visualization
  • Multi-language support
  • Advanced analytics and insights

MIT License - Feel free to use for educational and commercial purposes.

Read Entire Article