Show HN: I built an LLM that never forgets – persistent user memory with RAG

2 weeks ago 1

An intelligent proxy for LLMs with long-term memory using PostgreSQL + pgvector. The system automatically extracts and remembers facts about users using AI, without hardcoded triggers.

🤖 AI-Driven Fact Extraction

  • No hardcoded triggers - AI autonomously decides what's worth remembering
  • Single LLM call generates response AND extracts facts simultaneously
  • Automatically determines memory type: preference, personal, skill, goal, opinion, experience, other
  • Intelligent importance scoring (0.5-2.0)
  • Handles various categories:
    • Preferences and likes/dislikes
    • Personal information (name, location, job, hobbies)
    • Skills and areas of expertise
    • Goals and projects
    • Opinions and beliefs
    • Experiences and stories

🔍 Semantic Search with pgvector

  • Memory retrieval using vector similarity
  • Fast search thanks to IVFFlat indexes
  • Embeddings generated by google/embeddinggemma-300m
  • Memories persist across sessions
  • Historical context in every conversation
  • Only extracted facts are stored - full conversations are NOT saved
  • Single LLM API call per user message
  • Returns structured JSON: {response, facts}
  • Fast response times
  • Cost-effective operation

1. PostgreSQL with pgvector

# Build Docker image docker build -t postgres-pgvector . # Run container docker run -d --name postgres-vector -p 5432:5432 postgres-pgvector
pip install psycopg2-binary pgvector sentence-transformers torch transformers fastapi uvicorn requests

3. Start LLM (if not already running)

System requires a running LLM API (OpenAI-compatible format). Default: http://localhost:8000/v1

# Example: vLLM python -m vllm.entrypoints.openai.api_server --model smollm-135m-instruct

Server will start at http://localhost:8001

POST /chat

{ "user_id": "john_doe", "message": "Hi! I'm a Python developer and I love FastAPI.", "save_to_memory": true }

Response:

{ "response": "Hello! Nice to meet you. FastAPI is indeed a great framework!", "relevant_memories": [ { "content": "User is a Python developer", "type": "personal", "similarity": 0.87, "created_at": "2025-10-21T10:30:00", "importance": 1.5 } ], "memory_saved": true }

Use the built-in chat tester for easy testing:

This will:

  • Ask for your user ID
  • Start an interactive chat session
  • Show AI responses and used memories
  • Display when new facts are saved

You can also test directly with curl:

# First message - AI will extract facts curl -X POST http://localhost:8001/chat \ -H "Content-Type: application/json" \ -d '{ "user_id": "alice", "message": "Hi! My name is Alice and I love hiking in the mountains. I work as a data scientist.", "save_to_memory": true }' # Second message - AI will use remembered facts curl -X POST http://localhost:8001/chat \ -H "Content-Type: application/json" \ -d '{ "user_id": "alice", "message": "What do you know about me?", "save_to_memory": true }'
┌─────────────┐ │ User │ └──────┬──────┘ │ ▼ ┌────────────────────────────────────────┐ │ FastAPI Proxy (port 8001) │ │ ┌──────────────────────────────────┐ │ │ │ 1. Retrieve Memories (RAG) │◄─┼─┐ │ └──────────────────────────────────┘ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ 2. Build Context │ │ │ │ └──────────────────────────────────┘ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ 3. Single LLM Call: │──┼─┼──► LLM API (port 8000) │ │ - Generate response │ │ │ ⚡ Only 1 call per message! │ │ - Extract facts │ │ │ │ │ (returns JSON) │ │ │ │ └──────────────────────────────────┘ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ 4. Save Memories │──┼─┘ │ └──────────────────────────────────┘ │ └────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────┐ │ PostgreSQL + pgvector │ │ ┌─────────────────────────┐ │ │ │ user_memories │ │ │ │ - content (text) │ │ │ │ - embedding (vector) │ │ │ │ - memory_type │ │ │ │ - importance │ │ │ └─────────────────────────┘ │ └─────────────────────────────┘

Edit variables in proxy.py:

# Embedding model EMBEDDING_MODEL = "google/embeddinggemma-300m" # LLM API URL LLM_API_URL = "http://localhost:8000/v1" LLM_MODEL = "smollm-135m-instruct" # PostgreSQL config POSTGRES_CONFIG = { "host": "localhost", "database": "vectordb", "user": "vectoruser", "password": "vectorpass" }

The system uses a single optimized LLM call that handles both response generation and fact extraction:

  1. Input: User message + retrieved memories (context)
  2. Single LLM call: AI generates structured JSON containing:
    • response - natural reply to user
    • facts - list of extracted memories
  3. Categorization: AI automatically assigns memory type for each fact
  4. Importance scoring: AI determines value (0.5-2.0) for each fact
  5. Validation: System validates JSON and saves facts to database

Example LLM output:

{ "response": "Hello! Nice to meet a fellow Python enthusiast!", "facts": [ {"content": "User is a Python developer", "memory_type": "personal", "importance": 1.5}, {"content": "User works on AI projects", "memory_type": "personal", "importance": 1.4}, {"content": "User loves FastAPI framework", "memory_type": "preference", "importance": 1.3} ] }

Efficient: Only 1 API call per user message
🎯 Smart: AI decides what's worth remembering
📊 Structured: Returns both conversation response and extracted facts

  • id - SERIAL PRIMARY KEY
  • user_id - VARCHAR(255) - User identifier
  • content - TEXT - Memory content
  • memory_type - VARCHAR(50) - Type (preference/personal/skill/etc.)
  • embedding - VECTOR(768) - Vector embedding
  • created_at - TIMESTAMP - Creation date
  • importance - FLOAT - Importance weight (0.5-2.0)

Note: The conversations table has been removed - system doesn't store full conversations.

AI not extracting facts correctly:

  1. Check logs - system prints warnings about parsing errors
  2. Increase max_tokens in generate_response_and_extract_facts (currently 600)
  3. Use a stronger LLM model (e.g., llama-3-8b instead of smollm-135m)
  4. Adjust the prompt in generate_response_and_extract_facts

Database dimension errors:

If you get expected 384 dimensions, not 768 error:

# Clean up database and recreate with correct dimensions python cleanup_db.py # Restart proxy python proxy.py

This happens when the database was created with wrong embedding dimensions.

MIT - use it however you want!

  • FastAPI - modern web framework
  • pgvector - PostgreSQL extension for vectors
  • HuggingFace Transformers - embedding models
  • Context7 - API documentation

Enjoy your AI with perfect memory! 🧠✨

Read Entire Article