Building a Simple Search Engine That Works

4 hours ago 1

Why Build Your Own?

Look, I know what you're thinking. "Why not just use Elasticsearch?" or "What about Algolia?" Those are valid options, but they come with complexity. You need to learn their APIs, manage their infrastructure, and deal with their quirks.

Sometimes you just want something that:

Works with your existing database
Doesn't require external services
Is easy to understand and debug
Actually finds relevant results

That's what I built. A search engine that uses your existing database, respects your current architecture, and gives you full control over how it works.

The Core Idea

The concept is simple: tokenize everything, store it, then match tokens when searching.

Here's how it works:

Indexing: When you add or update content, we split it into tokens (words, prefixes, n-grams) and store them with weights
Searching: When someone searches, we tokenize their query the same way, find matching tokens, and score the results
Scoring: We use the stored weights to calculate relevance scores

The magic is in the tokenization and weighting. Let me show you what I mean.

Building Block 1: The Database Schema

We need two simple tables: index_tokens and index_entries.

index_tokens

This table stores all unique tokens with their tokenizer weights. Each token name can have multiple records with different weights—one per tokenizer.

// index_tokens table structure id | name | weight ---|---------|------- 1 | parser | 20 // From WordTokenizer 2 | parser | 5 // From PrefixTokenizer 3 | parser | 1 // From NGramsTokenizer 4 | parser | 10 // From SingularTokenizer

Why store separate tokens per weight? Different tokenizers produce the same token with different weights. For example, "parser" from WordTokenizer has weight 20, but "parser" from PrefixTokenizer has weight 5. We need separate records to properly score matches.

The unique constraint is on (name, weight), so the same token name can exist multiple times with different weights.

index_entries

This table links tokens to documents with field-specific weights.

// index_entries table structure id | token_id | document_type | field_id | document_id | weight ---|----------|---------------|----------|-------------|------- 1 | 1 | 1 | 1 | 42 | 2000 2 | 2 | 1 | 1 | 42 | 500

The weight here is the final calculated weight: field_weight × tokenizer_weight × ceil(sqrt(token_length)). This encodes everything we need for scoring. We will talk about scoring later in the post.

We add indexes on:

(document_type, document_id) - for fast document lookups
token_id - for fast token lookups
(document_type, field_id) - for field-specific queries
weight - for filtering by weight

Why this structure? Simple, efficient, and leverages what databases do best.

Building Block 2: Tokenization

What is tokenization? It's breaking text into searchable pieces. The word "parser" becomes tokens like ["parser"], ["par", "pars", "parse", "parser"], or ["par", "ars", "rse", "ser"] depending on which tokenizer we use.

Why multiple tokenizers? Different strategies for different matching needs. One tokenizer for exact matches, another for partial matches, another for typos.

All tokenizers implement a simple interface:

interface TokenizerInterface { public function tokenize(string $text): array; // Returns array of Token objects public function getWeight(): int; // Returns tokenizer weight }

Simple contract, easy to extend.

Word Tokenizer

This one is straightforward—it splits text into individual words. "parser" becomes just ["parser"]. Simple, but powerful for exact matches.

First, we normalize the text. Lowercase everything, remove special characters, normalize whitespace:

class WordTokenizer implements TokenizerInterface { public function tokenize(string $text): array { // Normalize: lowercase, remove special chars $text = mb_strtolower(trim($text)); $text = preg_replace('/[^a-z0-9]/', ' ', $text); $text = preg_replace('/\s+/', ' ', $text);

Next, we split into words and filter out short ones:

// Split into words, filter short ones $words = explode(' ', $text); $words = array_filter($words, fn($w) => mb_strlen($w) >= 2);

Why filter short words? Single-character words are usually too common to be useful. "a", "I", "x" don't help with search.

Finally, we return unique words as Token objects:

// Return as Token objects with weight return array_map( fn($word) => new Token($word, $this->weight), array_unique($words) ); } }

Weight: 20 (high priority for exact matches)

Prefix Tokenizer

This generates word prefixes. "parser" becomes ["par", "pars", "parse", "parser"] (with min length 4). This helps with partial matches and autocomplete-like behavior.

First, we extract words (same normalization as WordTokenizer):

class PrefixTokenizer implements TokenizerInterface { public function __construct( private int $minPrefixLength = 4, private int $weight = 5 ) {} public function tokenize(string $text): array { // Normalize same as WordTokenizer $words = $this->extractWords($text);

Then, for each word, we generate prefixes from the minimum length to the full word:

$tokens = []; foreach ($words as $word) { $wordLength = mb_strlen($word); // Generate prefixes from min length to full word for ($i = $this->minPrefixLength; $i <= $wordLength; $i++) { $prefix = mb_substr($word, 0, $i); $tokens[$prefix] = true; // Use associative array for uniqueness } }

Why use an associative array? It ensures uniqueness. If "parser" appears twice in the text, we only want one "parser" token.

Finally, we convert the keys to Token objects:

return array_map( fn($prefix) => new Token($prefix, $this->weight), array_keys($tokens) ); } }

Weight: 5 (medium priority)

Why min length? Avoid too many tiny tokens. Prefixes shorter than 4 characters are usually too common to be useful.

N-Grams Tokenizer

This creates character sequences of a fixed length (I use 3). "parser" becomes ["par", "ars", "rse", "ser"]. This catches typos and partial word matches.

First, we extract words:

class NGramsTokenizer implements TokenizerInterface { public function __construct( private int $ngramLength = 3, private int $weight = 1 ) {} public function tokenize(string $text): array { $words = $this->extractWords($text);

Then, for each word, we slide a window of fixed length across it:

$tokens = []; foreach ($words as $word) { $wordLength = mb_strlen($word); // Sliding window of fixed length for ($i = 0; $i <= $wordLength - $this->ngramLength; $i++) { $ngram = mb_substr($word, $i, $this->ngramLength); $tokens[$ngram] = true; } }

The sliding window: for "parser" with length 3, we get:

Position 0: "par"
Position 1: "ars"
Position 2: "rse"
Position 3: "ser"

Why this works? Even if someone types "parsr" (typo), we still get "par" and "ars" tokens, which match the correctly spelled "parser".

Finally, we convert to Token objects:

return array_map( fn($ngram) => new Token($ngram, $this->weight), array_keys($tokens) ); } }

Weight: 1 (low priority, but catches edge cases)

Why 3? Balance between coverage and noise. Too short and you get too many matches, too long and you miss typos.

Normalization

All tokenizers do the same normalization:

Lowercase everything
Remove special characters (keep only alphanumerical)
Normalize whitespace (multiple spaces to single space)

This ensures consistent matching regardless of input format.

Building Block 3: The Weight System

We have three levels of weights working together:

Field weights: Title vs content vs keywords
Tokenizer weights: Word vs prefix vs n-gram (stored in index_tokens)
Document weights: Stored in index_entries (calculated: field_weight × tokenizer_weight × ceil(sqrt(token_length)))

Final Weight Calculation

When indexing, we calculate the final weight like this:

$finalWeight = $fieldWeight * $tokenizerWeight * ceil(sqrt($tokenLength));

For example:

Title field: weight 10
Word tokenizer: weight 20
Token "parser": length 6
Final weight: 10 × 20 × ceil(sqrt(6)) = 10 × 20 × 3 = 600

Why use ceil(sqrt())? Longer tokens are more specific, but we don't want weights to blow up with very long tokens. "parser" is more specific than "par", but a 100-character token shouldn't have 100x the weight. The square root function gives us diminishing returns—longer tokens still score higher, but not linearly. We use ceil() to round up to the nearest integer, keeping weights as whole numbers.

Tuning Weights

You can adjust weights for your use case:

Increase field weights for titles if titles are most important
Increase tokenizer weights for exact matches if you want to prioritize exact matches
Adjust the token length function (ceil(sqrt), log, or linear) if you want longer tokens to matter more or less

You can see exactly how weights are calculated and adjust them as needed.

Building Block 4: The Indexing Service

The indexing service takes a document and stores all its tokens in the database.

The Interface

Documents that can be indexed implement IndexableDocumentInterface:

interface IndexableDocumentInterface { public function getDocumentId(): int; public function getDocumentType(): DocumentType; public function getIndexableFields(): IndexableFields; }

To make a document searchable, you implement these three methods:

class Post implements IndexableDocumentInterface { public function getDocumentId(): int { return $this->id ?? 0; } public function getDocumentType(): DocumentType { return DocumentType::POST; } public function getIndexableFields(): IndexableFields { $fields = IndexableFields::create() ->addField(FieldId::TITLE, $this->title ?? '', 10) ->addField(FieldId::CONTENT, $this->content ?? '', 1); // Add keywords if present if (!empty($this->keywords)) { $fields->addField(FieldId::KEYWORDS, $this->keywords, 20); } return $fields; } }

Three methods to implement:

getDocumentType(): returns the document type enum
getDocumentId(): returns the document ID
getIndexableFields(): builds fields with weights using fluent API

You can index documents:

On create/update (via event listeners)
Via commands: app:index-document, app:reindex-documents
Via cron (for batch reindexing)

How It Works

Here's the indexing process, step by step.

First, we get the document information:

class SearchIndexingService { public function indexDocument(IndexableDocumentInterface $document): void { // 1. Get document info $documentType = $document->getDocumentType(); $documentId = $document->getDocumentId(); $indexableFields = $document->getIndexableFields(); $fields = $indexableFields->getFields(); $weights = $indexableFields->getWeights();

The document provides its fields and weights via the IndexableFields builder.

Next, we remove the existing index for this document. This handles updates—if the document changed, we need to reindex it:

// 2. Remove existing index for this document $this->removeDocumentIndex($documentType, $documentId); // 3. Prepare batch insert data $insertData = [];

Why remove first? If we just add new tokens, we'll have duplicates. Better to start fresh.

Now, we process each field. For each field, we run all tokenizers:

// 4. Process each field foreach ($fields as $fieldIdValue => $content) { if (empty($content)) { continue; } $fieldId = FieldId::from($fieldIdValue); $fieldWeight = $weights[$fieldIdValue] ?? 0; // 5. Run all tokenizers on this field foreach ($this->tokenizers as $tokenizer) { $tokens = $tokenizer->tokenize($content);

For each tokenizer, we get tokens. Then, for each token, we find or create it in the database and calculate the final weight:

foreach ($tokens as $token) { $tokenValue = $token->value; $tokenWeight = $token->weight; // 6. Find or create token in index_tokens $tokenId = $this->findOrCreateToken($tokenValue, $tokenWeight); // 7. Calculate final weight $tokenLength = mb_strlen($tokenValue); $finalWeight = (int) ($fieldWeight * $tokenWeight * ceil(sqrt($tokenLength))); // 8. Add to batch insert $insertData[] = [ 'token_id' => $tokenId, 'document_type' => $documentType->value, 'field_id' => $fieldId->value, 'document_id' => $documentId, 'weight' => $finalWeight, ]; } } }

Why batch insert? Performance. Instead of inserting one row at a time, we collect all rows and insert them in one query.

Finally, we batch insert everything:

// 9. Batch insert for performance if (!empty($insertData)) { $this->batchInsertSearchDocuments($insertData); } }

The findOrCreateToken method is straightforward:

private function findOrCreateToken(string $name, int $weight): int { // Try to find existing token with same name and weight $sql = "SELECT id FROM index_tokens WHERE name = ? AND weight = ?"; $result = $this->connection->executeQuery($sql, [$name, $weight])->fetchAssociative(); if ($result) { return (int) $result['id']; } // Create new token $insertSql = "INSERT INTO index_tokens (name, weight) VALUES (?, ?)"; $this->connection->executeStatement($insertSql, [$name, $weight]); return (int) $this->connection->lastInsertId(); } }

Why find or create? Tokens are shared across documents. If "parser" already exists with weight 20, we reuse it. No need to create duplicates.

The key points:

We remove old index first (handles updates)
We batch insert for performance (one query instead of many)
We find or create tokens (avoids duplicates)
We calculate final weight on the fly

Building Block 5: The Search Service

The search service takes a query string and finds relevant documents. It tokenizes the query the same way we tokenized documents during indexing, then matches those tokens against the indexed tokens in the database. The results are scored by relevance and returned as document IDs with scores.

How It Works

Here's the search process, step by step.

First, we tokenize the query using all tokenizers:

class SearchService { public function search(DocumentType $documentType, string $query, ?int $limit = null): array { // 1. Tokenize query using all tokenizers $queryTokens = $this->tokenizeQuery($query); if (empty($queryTokens)) { return []; }

If the query produces no tokens (e.g., only special characters), we return empty results.

Why Tokenize the Query Using the Same Tokenizers?

Different tokenizers produce different token values. If we index with one set and search with another, we'll miss matches.

Example:

Indexing with PrefixTokenizer creates tokens: "par", "pars", "parse", "parser"
Searching with only WordTokenizer creates token: "parser"
We'll find "parser", but we won't find documents that only have "par" or "pars" tokens
Result: Incomplete matches, missing relevant documents!

The solution: Use the same tokenizers for both indexing and searching. Same tokenization strategy = same token values = complete matches.

This is why the SearchService and SearchIndexingService both receive the same set of tokenizers.

Next, we extract unique token values. Multiple tokenizers might produce the same token value, so we deduplicate:

// 2. Extract unique token values $tokenValues = array_unique(array_map( fn($token) => $token instanceof Token ? $token->value : $token, $queryTokens ));

Why extract values? We search by token name, not by weight. We need the unique token names to search for.

Then, we sort tokens by length (longest first). This prioritizes specific matches:

// 3. Sort tokens (longest first - prioritize specific matches) usort($tokenValues, fn($a, $b) => mb_strlen($b) <=> mb_strlen($a));

Why sort? Longer tokens are more specific. "parser" is more specific than "par", so we want to search for "parser" first.

We also limit the token count to prevent DoS attacks with huge queries:

// 4. Limit token count (prevent DoS with huge queries) if (count($tokenValues) > 300) { $tokenValues = array_slice($tokenValues, 0, 300); }

Why limit? A malicious user could send a query that produces thousands of tokens, causing performance issues. We keep the longest 300 tokens (already sorted).

Now, we execute the optimized SQL query. The executeSearch() method builds the SQL query and executes it:

// 5. Execute optimized SQL query $results = $this->executeSearch($documentType, $tokenValues, $limit);

Inside executeSearch(), we build the SQL query with parameter placeholders, execute it, filter low-scoring results, and convert to SearchResult objects:

private function executeSearch(DocumentType $documentType, array $tokenValues, int $tokenCount, ?int $limit, int $minTokenWeight): array { // Build parameter placeholders for token values $tokenPlaceholders = implode(',', array_fill(0, $tokenCount, '?')); // Build the SQL query (shown in full in "The SQL Query" section below) $sql = "SELECT sd.document_id, ... FROM index_entries sd ..."; // Build parameters array $params = [ $documentType->value, // document_type ...$tokenValues, // token values for IN clause $documentType->value, // for subquery ...$tokenValues, // token values for subquery $minTokenWeight, // minimum token weight // ... more parameters ]; // Execute query with parameter binding $results = $this->connection->executeQuery($sql, $params)->fetchAllAssociative(); // Filter out results with low normalized scores (below threshold) $results = array_filter($results, fn($r) => (float) $r['score'] >= 0.05); // Convert to SearchResult objects return array_map( fn($result) => new SearchResult( documentId: (int) $result['document_id'], score: (float) $result['score'] ), $results ); }

The SQL query does the heavy lifting: finds matching documents, calculates scores, and sorts by relevance. We use raw SQL for performance and full control—we can optimize the query exactly how we need it.

The query uses JOINs to connect tokens and documents, subqueries for normalization, aggregation for scoring, and indexes on token name, document type, and weight. We use parameter binding for security (prevents SQL injection).

We'll see the full query in the next section.

The main search() method then returns the results:

// 5. Return results return $results; } }

The Scoring Algorithm

The scoring algorithm balances multiple factors. Let's break it down step by step.

The base score is the sum of all matched token weights:

SELECT sd.document_id, SUM(sd.weight) as base_score FROM index_entries sd INNER JOIN index_tokens st ON sd.token_id = st.id WHERE sd.document_type = ? AND st.name IN (?, ?, ?) -- Query tokens GROUP BY sd.document_id

sd.weight: from index_entries (field_weight × tokenizer_weight × ceil(sqrt(token_length)))

Why not multiply by st.weight? The tokenizer weight is already included in sd.weight during indexing. The st.weight from index_tokens is used only in the full SQL query's WHERE clause for filtering (ensures at least one token with weight >= minTokenWeight).

This gives us the raw score. But we need more than that.

We add a token diversity boost. Documents matching more unique tokens score higher:

(1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * base_score

Why? A document matching 5 different tokens is more relevant than one matching the same token 5 times. The LOG function makes this boost logarithmic—matching 10 tokens doesn't give 10x the boost.

We also add an average weight quality boost. Documents with higher quality matches score higher:

(1.0 + LOG(1.0 + AVG(sd.weight))) * base_score

Why? A document with high-weight matches (e.g., title matches) is more relevant than one with low-weight matches (e.g., content matches). Again, LOG makes this logarithmic.

We apply a document length penalty. Prevents long documents from dominating:

base_score / (1.0 + LOG(1.0 + doc_token_count.token_count))

Why? A 1000-word document doesn't automatically beat a 100-word document just because it has more tokens. The LOG function makes this penalty logarithmic—a 10x longer document doesn't get 10x the penalty.

Finally, we normalize by dividing by the maximum score:

score / GREATEST(1.0, max_score) as normalized_score

This gives us a 0-1 range, making scores comparable across different queries.

The full formula looks like this:

SELECT sd.document_id, ( SUM(sd.weight) * -- Base score (1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * -- Token diversity boost (1.0 + LOG(1.0 + AVG(sd.weight))) / -- Average weight quality boost (1.0 + LOG(1.0 + doc_token_count.token_count)) -- Document length penalty ) / GREATEST(1.0, max_score) as score -- Normalization FROM index_entries sd INNER JOIN index_tokens st ON sd.token_id = st.id INNER JOIN ( SELECT document_id, COUNT(*) as token_count FROM index_entries WHERE document_type = ? GROUP BY document_id ) doc_token_count ON sd.document_id = doc_token_count.document_id WHERE sd.document_type = ? AND st.name IN (?, ?, ?) -- Query tokens AND sd.document_id IN ( SELECT DISTINCT document_id FROM index_entries sd2 INNER JOIN index_tokens st2 ON sd2.token_id = st2.id WHERE sd2.document_type = ? AND st2.name IN (?, ?, ?) AND st2.weight >= ? -- Ensure at least one token with meaningful weight ) GROUP BY sd.document_id ORDER BY score DESC LIMIT ?

Why the subquery with st2.weight >= ?? This ensures we only include documents that have at least one matching token with a meaningful tokenizer weight. Without this filter, a document matching only low-priority tokens (like n-grams with weight 1) would be included even if it doesn't match any high-priority tokens (like words with weight 20). This subquery filters out documents that only match noise. We want documents that match at least one meaningful token.

Why this formula? It balances multiple factors for relevance. Exact matches score high, but so do documents matching many tokens. Long documents don't dominate, but high-quality matches do.

If no results with weight 10, we retry with weight 1 (fallback for edge cases).

Converting IDs to Documents

The search service returns SearchResult objects with document IDs and scores:

class SearchResult { public function __construct( public readonly int $documentId, public readonly float $score ) {} }

But we need actual documents, not just IDs. We convert them using repositories:

// Perform search $searchResults = $this->searchService->search( DocumentType::POST, $query, $limit ); // Get document IDs from search results (preserving order) $documentIds = array_map(fn($result) => $result->documentId, $searchResults); // Get documents by IDs (preserving order from search results) $documents = $this->documentRepository->findByIds($documentIds);

Why preserve order? The search results are sorted by relevance score. We want to keep that order when displaying results.

The repository method handles the conversion:

public function findByIds(array $ids): array { if (empty($ids)) { return []; } return $this->createQueryBuilder('d') ->where('d.id IN (:ids)') ->setParameter('ids', $ids) ->orderBy('FIELD(d.id, :ids)') // Preserve order from IDs array ->getQuery() ->getResult(); }

The FIELD() function preserves the order from the IDs array, so documents appear in the same order as search results.

The Result: What You Get

What you get is a search engine that:

Finds relevant results quickly (leverages database indexes)
Handles typos (n-grams catch partial matches)
Handles partial words (prefix tokenizer)
Prioritizes exact matches (word tokenizer has highest weight)
Works with existing database (no external services)
Easy to understand and debug (everything is transparent)
Full control over behavior (adjust weights, add tokenizers, modify scoring)

Extending the System

Want to add a new tokenizer? Implement TokenizerInterface:

class StemmingTokenizer implements TokenizerInterface { public function tokenize(string $text): array { // Your stemming logic here // Return array of Token objects } public function getWeight(): int { return 15; // Your weight } }

Want to add a new document type? Implement IndexableDocumentInterface:

class Comment implements IndexableDocumentInterface { public function getIndexableFields(): IndexableFields { return IndexableFields::create() ->addField(FieldId::CONTENT, $this->content ?? '', 5); } }

Want to adjust weights? Change the configuration. Want to modify scoring? Edit the SQL query. Everything is under your control.

Conclusion

So there you have it. A simple search engine that actually works. It's not fancy, and it doesn't need a lot of infrastructure, but for most use cases, it's perfect.

The key insight? Sometimes the best solution is the one you understand. No magic, no black boxes, just straightforward code that does what it says.

You own it, you control it, you can debug it. And that's worth a lot.

Read Entire Article

Building a Simple Search Engine That Works

Why Build Your Own?

The Core Idea

Building Block 1: The Database Schema

index_tokens

index_entries

Building Block 2: Tokenization

Word Tokenizer

Prefix Tokenizer

N-Grams Tokenizer

Normalization

Building Block 3: The Weight System

Final Weight Calculation

Tuning Weights

Building Block 4: The Indexing Service

The Interface

How It Works

Building Block 5: The Search Service

How It Works

Why Tokenize the Query Using the Same Tokenizers?

The Scoring Algorithm

Converting IDs to Documents

The Result: What You Get

Extending the System

Conclusion

Related

Ask HN: A frictionless security engine for all platforms?

Layanan Agoda Reschedule

Layanan Agoda 24 Jam