Why Build Your Own?
Look, I know what you're thinking. "Why not just use Elasticsearch?" or "What about Algolia?" Those are valid options, but they come with complexity. You need to learn their APIs, manage their infrastructure, and deal with their quirks.
Sometimes you just want something that:
- Works with your existing database
- Doesn't require external services
- Is easy to understand and debug
- Actually finds relevant results
That's what I built. A search engine that uses your existing database, respects your current architecture, and gives you full control over how it works.
The Core Idea
The concept is simple: tokenize everything, store it, then match tokens when searching.
Here's how it works:
- Indexing: When you add or update content, we split it into tokens (words, prefixes, n-grams) and store them with weights
- Searching: When someone searches, we tokenize their query the same way, find matching tokens, and score the results
- Scoring: We use the stored weights to calculate relevance scores
The magic is in the tokenization and weighting. Let me show you what I mean.
Building Block 1: The Database Schema
We need two simple tables: index_tokens and index_entries.
index_tokens
This table stores all unique tokens with their tokenizer weights. Each token name can have multiple records with different weights—one per tokenizer.
// index_tokens table structure id | name | weight ---|---------|------- 1 | parser | 20 // From WordTokenizer 2 | parser | 5 // From PrefixTokenizer 3 | parser | 1 // From NGramsTokenizer 4 | parser | 10 // From SingularTokenizerWhy store separate tokens per weight? Different tokenizers produce the same token with different weights. For example, "parser" from WordTokenizer has weight 20, but "parser" from PrefixTokenizer has weight 5. We need separate records to properly score matches.
The unique constraint is on (name, weight), so the same token name can exist multiple times with different weights.
index_entries
This table links tokens to documents with field-specific weights.
// index_entries table structure id | token_id | document_type | field_id | document_id | weight ---|----------|---------------|----------|-------------|------- 1 | 1 | 1 | 1 | 42 | 2000 2 | 2 | 1 | 1 | 42 | 500The weight here is the final calculated weight: field_weight × tokenizer_weight × ceil(sqrt(token_length)). This encodes everything we need for scoring. We will talk about scoring later in the post.
We add indexes on:
- (document_type, document_id) - for fast document lookups
- token_id - for fast token lookups
- (document_type, field_id) - for field-specific queries
- weight - for filtering by weight
Why this structure? Simple, efficient, and leverages what databases do best.
Building Block 2: Tokenization
What is tokenization? It's breaking text into searchable pieces. The word "parser" becomes tokens like ["parser"], ["par", "pars", "parse", "parser"], or ["par", "ars", "rse", "ser"] depending on which tokenizer we use.
Why multiple tokenizers? Different strategies for different matching needs. One tokenizer for exact matches, another for partial matches, another for typos.
All tokenizers implement a simple interface:
interface TokenizerInterface { public function tokenize(string $text): array; // Returns array of Token objects public function getWeight(): int; // Returns tokenizer weight }Simple contract, easy to extend.
Word Tokenizer
This one is straightforward—it splits text into individual words. "parser" becomes just ["parser"]. Simple, but powerful for exact matches.
First, we normalize the text. Lowercase everything, remove special characters, normalize whitespace:
class WordTokenizer implements TokenizerInterface { public function tokenize(string $text): array { // Normalize: lowercase, remove special chars $text = mb_strtolower(trim($text)); $text = preg_replace('/[^a-z0-9]/', ' ', $text); $text = preg_replace('/\s+/', ' ', $text);Next, we split into words and filter out short ones:
// Split into words, filter short ones $words = explode(' ', $text); $words = array_filter($words, fn($w) => mb_strlen($w) >= 2);Why filter short words? Single-character words are usually too common to be useful. "a", "I", "x" don't help with search.
Finally, we return unique words as Token objects:
// Return as Token objects with weight return array_map( fn($word) => new Token($word, $this->weight), array_unique($words) ); } }Weight: 20 (high priority for exact matches)
Prefix Tokenizer
This generates word prefixes. "parser" becomes ["par", "pars", "parse", "parser"] (with min length 4). This helps with partial matches and autocomplete-like behavior.
First, we extract words (same normalization as WordTokenizer):
class PrefixTokenizer implements TokenizerInterface { public function __construct( private int $minPrefixLength = 4, private int $weight = 5 ) {} public function tokenize(string $text): array { // Normalize same as WordTokenizer $words = $this->extractWords($text);Then, for each word, we generate prefixes from the minimum length to the full word:
$tokens = []; foreach ($words as $word) { $wordLength = mb_strlen($word); // Generate prefixes from min length to full word for ($i = $this->minPrefixLength; $i <= $wordLength; $i++) { $prefix = mb_substr($word, 0, $i); $tokens[$prefix] = true; // Use associative array for uniqueness } }Why use an associative array? It ensures uniqueness. If "parser" appears twice in the text, we only want one "parser" token.
Finally, we convert the keys to Token objects:
return array_map( fn($prefix) => new Token($prefix, $this->weight), array_keys($tokens) ); } }Weight: 5 (medium priority)
Why min length? Avoid too many tiny tokens. Prefixes shorter than 4 characters are usually too common to be useful.
N-Grams Tokenizer
This creates character sequences of a fixed length (I use 3). "parser" becomes ["par", "ars", "rse", "ser"]. This catches typos and partial word matches.
First, we extract words:
class NGramsTokenizer implements TokenizerInterface { public function __construct( private int $ngramLength = 3, private int $weight = 1 ) {} public function tokenize(string $text): array { $words = $this->extractWords($text);Then, for each word, we slide a window of fixed length across it:
$tokens = []; foreach ($words as $word) { $wordLength = mb_strlen($word); // Sliding window of fixed length for ($i = 0; $i <= $wordLength - $this->ngramLength; $i++) { $ngram = mb_substr($word, $i, $this->ngramLength); $tokens[$ngram] = true; } }The sliding window: for "parser" with length 3, we get:
- Position 0: "par"
- Position 1: "ars"
- Position 2: "rse"
- Position 3: "ser"
Why this works? Even if someone types "parsr" (typo), we still get "par" and "ars" tokens, which match the correctly spelled "parser".
Finally, we convert to Token objects:
return array_map( fn($ngram) => new Token($ngram, $this->weight), array_keys($tokens) ); } }Weight: 1 (low priority, but catches edge cases)
Why 3? Balance between coverage and noise. Too short and you get too many matches, too long and you miss typos.
Normalization
All tokenizers do the same normalization:
- Lowercase everything
- Remove special characters (keep only alphanumerical)
- Normalize whitespace (multiple spaces to single space)
This ensures consistent matching regardless of input format.
Building Block 3: The Weight System
We have three levels of weights working together:
- Field weights: Title vs content vs keywords
- Tokenizer weights: Word vs prefix vs n-gram (stored in index_tokens)
- Document weights: Stored in index_entries (calculated: field_weight × tokenizer_weight × ceil(sqrt(token_length)))
Final Weight Calculation
When indexing, we calculate the final weight like this:
$finalWeight = $fieldWeight * $tokenizerWeight * ceil(sqrt($tokenLength));For example:
- Title field: weight 10
- Word tokenizer: weight 20
- Token "parser": length 6
- Final weight: 10 × 20 × ceil(sqrt(6)) = 10 × 20 × 3 = 600
Why use ceil(sqrt())? Longer tokens are more specific, but we don't want weights to blow up with very long tokens. "parser" is more specific than "par", but a 100-character token shouldn't have 100x the weight. The square root function gives us diminishing returns—longer tokens still score higher, but not linearly. We use ceil() to round up to the nearest integer, keeping weights as whole numbers.
Tuning Weights
You can adjust weights for your use case:
- Increase field weights for titles if titles are most important
- Increase tokenizer weights for exact matches if you want to prioritize exact matches
- Adjust the token length function (ceil(sqrt), log, or linear) if you want longer tokens to matter more or less
You can see exactly how weights are calculated and adjust them as needed.
Building Block 4: The Indexing Service
The indexing service takes a document and stores all its tokens in the database.
The Interface
Documents that can be indexed implement IndexableDocumentInterface:
interface IndexableDocumentInterface { public function getDocumentId(): int; public function getDocumentType(): DocumentType; public function getIndexableFields(): IndexableFields; }To make a document searchable, you implement these three methods:
class Post implements IndexableDocumentInterface { public function getDocumentId(): int { return $this->id ?? 0; } public function getDocumentType(): DocumentType { return DocumentType::POST; } public function getIndexableFields(): IndexableFields { $fields = IndexableFields::create() ->addField(FieldId::TITLE, $this->title ?? '', 10) ->addField(FieldId::CONTENT, $this->content ?? '', 1); // Add keywords if present if (!empty($this->keywords)) { $fields->addField(FieldId::KEYWORDS, $this->keywords, 20); } return $fields; } }Three methods to implement:
- getDocumentType(): returns the document type enum
- getDocumentId(): returns the document ID
- getIndexableFields(): builds fields with weights using fluent API
You can index documents:
- On create/update (via event listeners)
- Via commands: app:index-document, app:reindex-documents
- Via cron (for batch reindexing)
How It Works
Here's the indexing process, step by step.
First, we get the document information:
class SearchIndexingService { public function indexDocument(IndexableDocumentInterface $document): void { // 1. Get document info $documentType = $document->getDocumentType(); $documentId = $document->getDocumentId(); $indexableFields = $document->getIndexableFields(); $fields = $indexableFields->getFields(); $weights = $indexableFields->getWeights();The document provides its fields and weights via the IndexableFields builder.
Next, we remove the existing index for this document. This handles updates—if the document changed, we need to reindex it:
// 2. Remove existing index for this document $this->removeDocumentIndex($documentType, $documentId); // 3. Prepare batch insert data $insertData = [];Why remove first? If we just add new tokens, we'll have duplicates. Better to start fresh.
Now, we process each field. For each field, we run all tokenizers:
// 4. Process each field foreach ($fields as $fieldIdValue => $content) { if (empty($content)) { continue; } $fieldId = FieldId::from($fieldIdValue); $fieldWeight = $weights[$fieldIdValue] ?? 0; // 5. Run all tokenizers on this field foreach ($this->tokenizers as $tokenizer) { $tokens = $tokenizer->tokenize($content);For each tokenizer, we get tokens. Then, for each token, we find or create it in the database and calculate the final weight:
foreach ($tokens as $token) { $tokenValue = $token->value; $tokenWeight = $token->weight; // 6. Find or create token in index_tokens $tokenId = $this->findOrCreateToken($tokenValue, $tokenWeight); // 7. Calculate final weight $tokenLength = mb_strlen($tokenValue); $finalWeight = (int) ($fieldWeight * $tokenWeight * ceil(sqrt($tokenLength))); // 8. Add to batch insert $insertData[] = [ 'token_id' => $tokenId, 'document_type' => $documentType->value, 'field_id' => $fieldId->value, 'document_id' => $documentId, 'weight' => $finalWeight, ]; } } }Why batch insert? Performance. Instead of inserting one row at a time, we collect all rows and insert them in one query.
Finally, we batch insert everything:
// 9. Batch insert for performance if (!empty($insertData)) { $this->batchInsertSearchDocuments($insertData); } }The findOrCreateToken method is straightforward:
private function findOrCreateToken(string $name, int $weight): int { // Try to find existing token with same name and weight $sql = "SELECT id FROM index_tokens WHERE name = ? AND weight = ?"; $result = $this->connection->executeQuery($sql, [$name, $weight])->fetchAssociative(); if ($result) { return (int) $result['id']; } // Create new token $insertSql = "INSERT INTO index_tokens (name, weight) VALUES (?, ?)"; $this->connection->executeStatement($insertSql, [$name, $weight]); return (int) $this->connection->lastInsertId(); } }Why find or create? Tokens are shared across documents. If "parser" already exists with weight 20, we reuse it. No need to create duplicates.
The key points:
- We remove old index first (handles updates)
- We batch insert for performance (one query instead of many)
- We find or create tokens (avoids duplicates)
- We calculate final weight on the fly
Building Block 5: The Search Service
The search service takes a query string and finds relevant documents. It tokenizes the query the same way we tokenized documents during indexing, then matches those tokens against the indexed tokens in the database. The results are scored by relevance and returned as document IDs with scores.
How It Works
Here's the search process, step by step.
First, we tokenize the query using all tokenizers:
class SearchService { public function search(DocumentType $documentType, string $query, ?int $limit = null): array { // 1. Tokenize query using all tokenizers $queryTokens = $this->tokenizeQuery($query); if (empty($queryTokens)) { return []; }If the query produces no tokens (e.g., only special characters), we return empty results.
Why Tokenize the Query Using the Same Tokenizers?
Different tokenizers produce different token values. If we index with one set and search with another, we'll miss matches.
Example:
- Indexing with PrefixTokenizer creates tokens: "par", "pars", "parse", "parser"
- Searching with only WordTokenizer creates token: "parser"
- We'll find "parser", but we won't find documents that only have "par" or "pars" tokens
- Result: Incomplete matches, missing relevant documents!
The solution: Use the same tokenizers for both indexing and searching. Same tokenization strategy = same token values = complete matches.
This is why the SearchService and SearchIndexingService both receive the same set of tokenizers.
Next, we extract unique token values. Multiple tokenizers might produce the same token value, so we deduplicate:
// 2. Extract unique token values $tokenValues = array_unique(array_map( fn($token) => $token instanceof Token ? $token->value : $token, $queryTokens ));Why extract values? We search by token name, not by weight. We need the unique token names to search for.
Then, we sort tokens by length (longest first). This prioritizes specific matches:
// 3. Sort tokens (longest first - prioritize specific matches) usort($tokenValues, fn($a, $b) => mb_strlen($b) <=> mb_strlen($a));Why sort? Longer tokens are more specific. "parser" is more specific than "par", so we want to search for "parser" first.
We also limit the token count to prevent DoS attacks with huge queries:
// 4. Limit token count (prevent DoS with huge queries) if (count($tokenValues) > 300) { $tokenValues = array_slice($tokenValues, 0, 300); }Why limit? A malicious user could send a query that produces thousands of tokens, causing performance issues. We keep the longest 300 tokens (already sorted).
Now, we execute the optimized SQL query. The executeSearch() method builds the SQL query and executes it:
// 5. Execute optimized SQL query $results = $this->executeSearch($documentType, $tokenValues, $limit);Inside executeSearch(), we build the SQL query with parameter placeholders, execute it, filter low-scoring results, and convert to SearchResult objects:
private function executeSearch(DocumentType $documentType, array $tokenValues, int $tokenCount, ?int $limit, int $minTokenWeight): array { // Build parameter placeholders for token values $tokenPlaceholders = implode(',', array_fill(0, $tokenCount, '?')); // Build the SQL query (shown in full in "The SQL Query" section below) $sql = "SELECT sd.document_id, ... FROM index_entries sd ..."; // Build parameters array $params = [ $documentType->value, // document_type ...$tokenValues, // token values for IN clause $documentType->value, // for subquery ...$tokenValues, // token values for subquery $minTokenWeight, // minimum token weight // ... more parameters ]; // Execute query with parameter binding $results = $this->connection->executeQuery($sql, $params)->fetchAllAssociative(); // Filter out results with low normalized scores (below threshold) $results = array_filter($results, fn($r) => (float) $r['score'] >= 0.05); // Convert to SearchResult objects return array_map( fn($result) => new SearchResult( documentId: (int) $result['document_id'], score: (float) $result['score'] ), $results ); }The SQL query does the heavy lifting: finds matching documents, calculates scores, and sorts by relevance. We use raw SQL for performance and full control—we can optimize the query exactly how we need it.
The query uses JOINs to connect tokens and documents, subqueries for normalization, aggregation for scoring, and indexes on token name, document type, and weight. We use parameter binding for security (prevents SQL injection).
We'll see the full query in the next section.
The main search() method then returns the results:
// 5. Return results return $results; } }The Scoring Algorithm
The scoring algorithm balances multiple factors. Let's break it down step by step.
The base score is the sum of all matched token weights:
SELECT sd.document_id, SUM(sd.weight) as base_score FROM index_entries sd INNER JOIN index_tokens st ON sd.token_id = st.id WHERE sd.document_type = ? AND st.name IN (?, ?, ?) -- Query tokens GROUP BY sd.document_id- sd.weight: from index_entries (field_weight × tokenizer_weight × ceil(sqrt(token_length)))
Why not multiply by st.weight? The tokenizer weight is already included in sd.weight during indexing. The st.weight from index_tokens is used only in the full SQL query's WHERE clause for filtering (ensures at least one token with weight >= minTokenWeight).
This gives us the raw score. But we need more than that.
We add a token diversity boost. Documents matching more unique tokens score higher:
(1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * base_scoreWhy? A document matching 5 different tokens is more relevant than one matching the same token 5 times. The LOG function makes this boost logarithmic—matching 10 tokens doesn't give 10x the boost.
We also add an average weight quality boost. Documents with higher quality matches score higher:
(1.0 + LOG(1.0 + AVG(sd.weight))) * base_scoreWhy? A document with high-weight matches (e.g., title matches) is more relevant than one with low-weight matches (e.g., content matches). Again, LOG makes this logarithmic.
We apply a document length penalty. Prevents long documents from dominating:
base_score / (1.0 + LOG(1.0 + doc_token_count.token_count))Why? A 1000-word document doesn't automatically beat a 100-word document just because it has more tokens. The LOG function makes this penalty logarithmic—a 10x longer document doesn't get 10x the penalty.
Finally, we normalize by dividing by the maximum score:
score / GREATEST(1.0, max_score) as normalized_scoreThis gives us a 0-1 range, making scores comparable across different queries.
The full formula looks like this:
SELECT sd.document_id, ( SUM(sd.weight) * -- Base score (1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * -- Token diversity boost (1.0 + LOG(1.0 + AVG(sd.weight))) / -- Average weight quality boost (1.0 + LOG(1.0 + doc_token_count.token_count)) -- Document length penalty ) / GREATEST(1.0, max_score) as score -- Normalization FROM index_entries sd INNER JOIN index_tokens st ON sd.token_id = st.id INNER JOIN ( SELECT document_id, COUNT(*) as token_count FROM index_entries WHERE document_type = ? GROUP BY document_id ) doc_token_count ON sd.document_id = doc_token_count.document_id WHERE sd.document_type = ? AND st.name IN (?, ?, ?) -- Query tokens AND sd.document_id IN ( SELECT DISTINCT document_id FROM index_entries sd2 INNER JOIN index_tokens st2 ON sd2.token_id = st2.id WHERE sd2.document_type = ? AND st2.name IN (?, ?, ?) AND st2.weight >= ? -- Ensure at least one token with meaningful weight ) GROUP BY sd.document_id ORDER BY score DESC LIMIT ?Why the subquery with st2.weight >= ?? This ensures we only include documents that have at least one matching token with a meaningful tokenizer weight. Without this filter, a document matching only low-priority tokens (like n-grams with weight 1) would be included even if it doesn't match any high-priority tokens (like words with weight 20). This subquery filters out documents that only match noise. We want documents that match at least one meaningful token.
Why this formula? It balances multiple factors for relevance. Exact matches score high, but so do documents matching many tokens. Long documents don't dominate, but high-quality matches do.
If no results with weight 10, we retry with weight 1 (fallback for edge cases).
Converting IDs to Documents
The search service returns SearchResult objects with document IDs and scores:
class SearchResult { public function __construct( public readonly int $documentId, public readonly float $score ) {} }But we need actual documents, not just IDs. We convert them using repositories:
// Perform search $searchResults = $this->searchService->search( DocumentType::POST, $query, $limit ); // Get document IDs from search results (preserving order) $documentIds = array_map(fn($result) => $result->documentId, $searchResults); // Get documents by IDs (preserving order from search results) $documents = $this->documentRepository->findByIds($documentIds);Why preserve order? The search results are sorted by relevance score. We want to keep that order when displaying results.
The repository method handles the conversion:
public function findByIds(array $ids): array { if (empty($ids)) { return []; } return $this->createQueryBuilder('d') ->where('d.id IN (:ids)') ->setParameter('ids', $ids) ->orderBy('FIELD(d.id, :ids)') // Preserve order from IDs array ->getQuery() ->getResult(); }The FIELD() function preserves the order from the IDs array, so documents appear in the same order as search results.
The Result: What You Get
What you get is a search engine that:
- Finds relevant results quickly (leverages database indexes)
- Handles typos (n-grams catch partial matches)
- Handles partial words (prefix tokenizer)
- Prioritizes exact matches (word tokenizer has highest weight)
- Works with existing database (no external services)
- Easy to understand and debug (everything is transparent)
- Full control over behavior (adjust weights, add tokenizers, modify scoring)
Extending the System
Want to add a new tokenizer? Implement TokenizerInterface:
class StemmingTokenizer implements TokenizerInterface { public function tokenize(string $text): array { // Your stemming logic here // Return array of Token objects } public function getWeight(): int { return 15; // Your weight } }Register it in your services configuration, and it's automatically used for both indexing and searching.
Want to add a new document type? Implement IndexableDocumentInterface:
class Comment implements IndexableDocumentInterface { public function getIndexableFields(): IndexableFields { return IndexableFields::create() ->addField(FieldId::CONTENT, $this->content ?? '', 5); } }Want to adjust weights? Change the configuration. Want to modify scoring? Edit the SQL query. Everything is under your control.
Conclusion
So there you have it. A simple search engine that actually works. It's not fancy, and it doesn't need a lot of infrastructure, but for most use cases, it's perfect.
The key insight? Sometimes the best solution is the one you understand. No magic, no black boxes, just straightforward code that does what it says.
You own it, you control it, you can debug it. And that's worth a lot.
.png)
