A real-world case study of long-context LLMs outperforming traditional retrieval systems
Viberank.dev is a modern platform for discovering and ranking top technology applications (think of it as a kind of Product Hunt with AI-powered recommendations and more). We help users explore curated apps across categories like AI tools, developer utilities, productivity software, social media platforms, and more. Users can browse trending applications, submit their own tools, and get personalized recommendations based on their specific needs.
The technical aspect we want to focus on is our AI Tool Finder feature, which allows users to describe what they’re looking for in natural language and receive intelligent recommendations. Instead of browsing categories or using keyword search, users can simply ask: “I need a tool for creating presentations with AI” and get contextually relevant suggestions.
Our AI Tool Finder doesn’t use RAG (Retrieval-Augmented Generation) systems at all. Instead, we load our entire application database directly into Google’s Gemini 2.5 Flash context window and get better results with dramatically less complexity. This isn’t just a clever hack — it’s a glimpse into a future where RAG systems might become obsolete.
When we started building our AI Tool Finder, we faced the same challenge most of AI teams encounter: how do we help users find relevant information from our growing database of applications?
Here’s our entire architecture:
1.User submits a natural language query
2.We fetch the apps from our Supabase database
3.Complete dataset gets loaded into Gemini’s context window
4.AI returns the 5 most relevant recommendations
That’s it. No vector embeddings, no similarity calculations, no retrieval ranking algorithms.
// The core of our approach - surprisingly simpleconst { data: apps } = await supabase
.from('apps')
.select('id, name, category, tagline, website_summary, tags');
const combinedPrompt = `
You are an expert app recommender. Based on the user's query below,
please select up to 5 of the most relevant applications from the provided list.
User Query: "${userQuery}"
Available Applications:
${apps.map((app, index) => `
App Reference ${index + 1}:
- ID: ${app.id}
- Name: ${app.name || 'Unknown'}
- Category: ${app.category || 'Unknown'}
- Tagline: ${app.tagline || 'No tagline'}
- Summary: ${app.website_summary || 'No summary'}
- Tags: ${app.tags ? app.tags.join(', ') : 'No tags'}
`).join('\n')}
`;
The results? Our users get contextually aware recommendations that consider the entire landscape of available tools, not just the top-K retrieved fragments that traditional RAG systems provide.
The App Discovery Challenge: Users don’t just want apps that match keywords, they want contextually relevant recommendations. When someone asks for “a tool for creating presentations with AI,” they might benefit from knowing about complementary tools like image generators, voice-over software, or collaboration platforms. RAG systems, by design, would only retrieve the most similar apps, missing these valuable connections.
Our Data Doesn’t Chunk Well: Each app in our database is a complete entity with interconnected attributes — name, category, features, use cases, pricing, integrations. Splitting this into chunks for vector search would break these relationships.
Maintenance Overhead We Couldn’t Afford: As a small team, we couldn’t dedicate resources to constantly tuning embedding models, optimizing chunk strategies, or debugging why certain apps weren’t being retrieved.
Modern LLMs like Gemini 2.5 Pro can process over 1 million tokens, equivalent to about 750,000 words or the entire Harry Potter series. For us, this wasn’t just a bigger context window; it was a fundamentally different way of thinking about information processing.
Holistic Understanding: When our AI can see our entire dataset, it understands relationships that would be impossible to capture through retrieval. Our system can recommend complementary tools, identify alternatives, and understand subtle distinctions between similar applications.
Performance Benefits: Our system makes a single database query and one API call. Traditional RAG systems require multiple database operations, embedding calculations, and complex retrieval logic. The result? Lower latency and more predictable performance.
Our Current Scale: Right now, our complete app database requires approximately 131,285 tokens when formatted for Gemini’s context window. With Gemini 2.5 Flash supporting up to 1 million tokens, we’re currently using only about 13% of the available context capacity. This gives us significant room for growth.
Betting on Parallel Growth: Our approach is based on a strategic assumption: as Viberank grows and our app database expands, LLM context windows will continue to expand as well. We’re essentially betting that the rate of context window growth in AI models will outpace or at least match our database growth rate. Given the trajectory we’ve seen from 4K tokens (early GPT) to 1M+ tokens (current models) in just a few years, this seems like a reasonable bet.
References:
[1] Liu, N. F., et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics.
[2] Li, Z., et al. (2024). “Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach.” arXiv preprint arXiv:2407.16833.
What’s your experience with RAG vs long-context approaches? We’d love to hear your thoughts in the comments and connect with other teams exploring similar architectures.