Show HN: I built an LLM that never forgets – persistent user memory with RAG

2 weeks ago 1

An intelligent proxy for LLMs with long-term memory using PostgreSQL + pgvector. The system automatically extracts and remembers facts about users using AI, without hardcoded triggers.

🤖 AI-Driven Fact Extraction

No hardcoded triggers - AI autonomously decides what's worth remembering
Single LLM call generates response AND extracts facts simultaneously
Automatically determines memory type: preference, personal, skill, goal, opinion, experience, other
Intelligent importance scoring (0.5-2.0)
Handles various categories:
- Preferences and likes/dislikes
- Personal information (name, location, job, hobbies)
- Skills and areas of expertise
- Goals and projects
- Opinions and beliefs
- Experiences and stories

🔍 Semantic Search with pgvector

Memory retrieval using vector similarity
Fast search thanks to IVFFlat indexes
Embeddings generated by google/embeddinggemma-300m

Memories persist across sessions
Historical context in every conversation
Only extracted facts are stored - full conversations are NOT saved

Single LLM API call per user message
Returns structured JSON: {response, facts}
Fast response times
Cost-effective operation

1. PostgreSQL with pgvector

# Build Docker image docker build -t postgres-pgvector . # Run container docker run -d --name postgres-vector -p 5432:5432 postgres-pgvector

pip install psycopg2-binary pgvector sentence-transformers torch transformers fastapi uvicorn requests

3. Start LLM (if not already running)

System requires a running LLM API (OpenAI-compatible format). Default: http://localhost:8000/v1

# Example: vLLM python -m vllm.entrypoints.openai.api_server --model smollm-135m-instruct

Server will start at http://localhost:8001

POST /chat

{ "user_id": "john_doe", "message": "Hi! I'm a Python developer and I love FastAPI.", "save_to_memory": true }

Response:

{ "response": "Hello! Nice to meet you. FastAPI is indeed a great framework!", "relevant_memories": [ { "content": "User is a Python developer", "type": "personal", "similarity": 0.87, "created_at": "2025-10-21T10:30:00", "importance": 1.5 } ], "memory_saved": true }

Use the built-in chat tester for easy testing:

This will:

Ask for your user ID
Start an interactive chat session
Show AI responses and used memories
Display when new facts are saved

You can also test directly with curl:

# First message - AI will extract facts curl -X POST http://localhost:8001/chat \ -H "Content-Type: application/json" \ -d '{ "user_id": "alice", "message": "Hi! My name is Alice and I love hiking in the mountains. I work as a data scientist.", "save_to_memory": true }' # Second message - AI will use remembered facts curl -X POST http://localhost:8001/chat \ -H "Content-Type: application/json" \ -d '{ "user_id": "alice", "message": "What do you know about me?", "save_to_memory": true }'

┌─────────────┐ │ User │ └──────┬──────┘ │ ▼ ┌────────────────────────────────────────┐ │ FastAPI Proxy (port 8001) │ │ ┌──────────────────────────────────┐ │ │ │ 1. Retrieve Memories (RAG) │◄─┼─┐ │ └──────────────────────────────────┘ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ 2. Build Context │ │ │ │ └──────────────────────────────────┘ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ 3. Single LLM Call: │──┼─┼──► LLM API (port 8000) │ │ - Generate response │ │ │ ⚡ Only 1 call per message! │ │ - Extract facts │ │ │ │ │ (returns JSON) │ │ │ │ └──────────────────────────────────┘ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ 4. Save Memories │──┼─┘ │ └──────────────────────────────────┘ │ └────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────┐ │ PostgreSQL + pgvector │ │ ┌─────────────────────────┐ │ │ │ user_memories │ │ │ │ - content (text) │ │ │ │ - embedding (vector) │ │ │ │ - memory_type │ │ │ │ - importance │ │ │ └─────────────────────────┘ │ └─────────────────────────────┘

Edit variables in proxy.py:

# Embedding model EMBEDDING_MODEL = "google/embeddinggemma-300m" # LLM API URL LLM_API_URL = "http://localhost:8000/v1" LLM_MODEL = "smollm-135m-instruct" # PostgreSQL config POSTGRES_CONFIG = { "host": "localhost", "database": "vectordb", "user": "vectoruser", "password": "vectorpass" }

The system uses a single optimized LLM call that handles both response generation and fact extraction:

Input: User message + retrieved memories (context)
Single LLM call: AI generates structured JSON containing:
- response - natural reply to user
- facts - list of extracted memories
Categorization: AI automatically assigns memory type for each fact
Importance scoring: AI determines value (0.5-2.0) for each fact
Validation: System validates JSON and saves facts to database

Example LLM output:

{ "response": "Hello! Nice to meet a fellow Python enthusiast!", "facts": [ {"content": "User is a Python developer", "memory_type": "personal", "importance": 1.5}, {"content": "User works on AI projects", "memory_type": "personal", "importance": 1.4}, {"content": "User loves FastAPI framework", "memory_type": "preference", "importance": 1.3} ] }

⚡ Efficient: Only 1 API call per user message
🎯 Smart: AI decides what's worth remembering
📊 Structured: Returns both conversation response and extracted facts

id - SERIAL PRIMARY KEY
user_id - VARCHAR(255) - User identifier
content - TEXT - Memory content
memory_type - VARCHAR(50) - Type (preference/personal/skill/etc.)
embedding - VECTOR(768) - Vector embedding
created_at - TIMESTAMP - Creation date
importance - FLOAT - Importance weight (0.5-2.0)

Note: The conversations table has been removed - system doesn't store full conversations.

AI not extracting facts correctly:

Check logs - system prints warnings about parsing errors
Increase max_tokens in generate_response_and_extract_facts (currently 600)
Use a stronger LLM model (e.g., llama-3-8b instead of smollm-135m)
Adjust the prompt in generate_response_and_extract_facts

Database dimension errors:

If you get expected 384 dimensions, not 768 error:

# Clean up database and recreate with correct dimensions python cleanup_db.py # Restart proxy python proxy.py

This happens when the database was created with wrong embedding dimensions.

MIT - use it however you want!

FastAPI - modern web framework
pgvector - PostgreSQL extension for vectors
HuggingFace Transformers - embedding models
Context7 - API documentation

Enjoy your AI with perfect memory! 🧠✨

Read Entire Article

Show HN: I built an LLM that never forgets – persistent user memory with RAG

🤖 AI-Driven Fact Extraction

🔍 Semantic Search with pgvector

1. PostgreSQL with pgvector

3. Start LLM (if not already running)

AI not extracting facts correctly:

Database dimension errors:

Related

China's clean-energy revolution will reshape markets and pol...

Show HN: BookPace – Track your reading time (with NFC tags f...

Oldest marathon runner Fauja Singh dies at 114 in hit-and-ru...