SymSpell C99: Building the Fastest Spell Checker in Pure C

1 day ago 2

How I packaged and published the first pure C99 implementation of a spell-checking algorithm that's 1 million times faster than traditional approaches

October 24, 2025 15 min read By Suman Pokhrel

C Programming Algorithms Open Source Performance

Introduction: The Need for Speed
What is SymSpell?
Why C99?
How SymSpell Works
Key Features
Implementation Journey
Performance Benchmarks
Getting Started
Real-World Use Cases
How to Contribute
Future Enhancements
Conclusion

Introduction: The Need for Speed

Imagine typing in a search box, and as you type, the system instantly corrects your misspellings in real-time. Or think about a text editor that can check millions of words per second without slowing down your workflow. This level of performance seemed impossible with traditional spell-checking algorithms—until SymSpell came along.

When I discovered that no pure C implementation of this revolutionary algorithm existed, I saw an opportunity. What started as a learning project evolved into SymSpell C99—the first and only pure C99 implementation of Wolf Garbe's SymSpell algorithm, now available as open source.

🚀 Quick Stats:

⚡ 5µs average lookup time (real-world usage)
🎯 82-84% correction accuracy
🧹 ~700 lines of clean C99 code
🌐 Zero dependencies, POSIX-compliant
📦 86,060 word dictionary included

GitHub Repository: github.com/sumanpokhrel-11/symspell-c99

What is SymSpell?

SymSpell is a spell-checking algorithm created by Wolf Garbe that's reportedly 1 million times faster than traditional spell-checkers like those based on Peter Norvig's approach.

The Problem with Traditional Spell Checkers

Traditional spell-checkers work by:

Taking a misspelled word
Generating all possible corrections (edits, insertions, deletions, transpositions)
Checking which generated candidates exist in the dictionary
Ranking by frequency/probability

For a word with edit distance 2, this can generate hundreds of thousands of candidates. It's computationally expensive and slow.

The SymSpell Innovation: Symmetric Delete

SymSpell flips this approach on its head with a brilliant insight:

"Instead of generating corrections from misspellings, pre-compute deletions from dictionary words."

The key insight: If you delete characters from both a misspelled word and the correct word, they might become the same string.

Example:

Misspelling: "spelling" → delete 1 char → "speling" Correct: "spelling" → no deletions → "spelling" Misspelling: "speling" → generate deletes → "speling", "peling", "seling", ... Dictionary: "spelling" → pre-computed deletes → "spelling", "pelling", "selling", ... Match found: "spelling" ✓

By pre-computing all possible deletions of dictionary words and storing them in a hash table, lookups become incredibly fast—just a few hash table lookups instead of generating thousands of candidates.

Why C99?

When I started this project, implementations existed in C#, Python, Java, JavaScript, and many other languages. But there was no pure C implementation. Here's why C99 was the right choice:

1. Performance

C provides direct control over memory and CPU, essential for a performance-critical algorithm like spell-checking. No garbage collection pauses, no interpreter overhead—just pure, predictable speed.

2. Portability

C99 is the most portable language. It runs on everything from embedded systems to supercomputers. POSIX compliance means it works seamlessly on Linux, macOS, BSD, and Unix systems.

3. Integration

C libraries can be easily integrated into virtually any other language through FFI (Foreign Function Interface). Want to use it from Python? Rust? Go? No problem.

4. Educational Value

Implementing complex algorithms in C teaches you how computers actually work—memory management, cache efficiency, data structures, and optimization techniques.

5. No Dependencies

The implementation is completely self-contained. No external libraries, no frameworks, no package managers. Just standard C99 and POSIX.

How SymSpell Works: Under the Hood

Let me walk you through the algorithm step by step.

Step 1: Dictionary Preprocessing

When loading the dictionary, for each word we:

Store the original word with its frequency
Generate all possible deletions up to edit distance N
Store each deletion as a key pointing to the original word

Dictionary word: "example" Edit distance: 2 Generated deletes: - "xample" → points to "example" - "eample" → points to "example" - "exmple" → points to "example" - "exaple" → points to "example" ... (and many more) All stored in a hash table for O(1) lookup.

Step 2: Fast Path - Correct Words

When a user enters a word, we first check if it's in the dictionary:

Input: "example" Hash lookup: Found! ✓ Result: Return immediately (0.7µs)

This is the fast path. Since most words users type are spelled correctly (about 85%), this optimization is crucial. It's why our real-world average is 5µs, not 30µs.

Step 3: Correction Path - Misspelled Words

If the word isn't found, we generate deletions:

Input: "exampl" (missing 'e') Generate deletes: - "xampl" → lookup in hash table - "eampl" → lookup in hash table - "exmpl" → lookup in hash table - "examl" → lookup in hash table - "examp" → lookup in hash table ✓ Found! Hash table returns: "example" Calculate edit distance: 1 Return: "example" (30µs worst case)

Step 4: Ranking Results

When multiple candidates are found, we rank by:

Edit distance (prefer closer matches)
Word frequency (prefer common words)
IWF score (Inverse Word Frequency - for filtering common words)

Why This is Fast

The beauty of SymSpell:

✅ Pre-computation done once at startup
✅ Lookups are just hash table operations (O(1))
✅ No complex string generation during lookup
✅ Memory for speed tradeoff (hash table is large but lookups are instant)

Key Features of SymSpell C99

1. Blazing Fast Performance

Correctly spelled words: 0.7µs (fast path)
Real-world average: ~5µs (accounting for 85% correct words)
Misspelling correction: 30µs (worst case)

2. High Accuracy

82-84% correction rate on standard test datasets
98% vocabulary coverage
Tested against multiple misspelling corpora (CodeSpell, Microsoft, Wikipedia)

3. Production-Ready Code

Zero compiler warnings with -Wall -Wextra -Werror
Comprehensive test suite included
Memory leak free (validated with Valgrind)
Well-documented with clear API

4. Optimized Dictionary

86,060 words carefully curated for spell-checking
Frequency data from Google N-gram corpus
Balanced across multiple test sets
Additional 25,327-word dictionary for technical terms

5. Complete Tooling

Dictionary building scripts (Perl-based pipeline)
Performance benchmarking tools
Interactive test program
Comprehensive documentation

6. POSIX Compliant

Works on Linux, macOS, BSD, Unix
Custom POSIX compatibility layer included
No platform-specific dependencies

Implementation Journey: Challenges and Solutions

Building this wasn't without challenges. Here are some of the interesting problems I encountered and solved:

Challenge 1: Hash Table Performance

Problem: With 688,710 delete entries for 86k words, hash collisions were a concern.

Solution: Implemented a custom hash table using:

xxHash3 for fast, high-quality hashing
Open addressing with linear probing
Load factor tuning to keep it 16.4% full for optimal performance

// Using xxHash3 for superior hash distribution uint64_t hash = XXH3_64bits(word, word_len); // Open addressing with linear probing uint32_t index = hash % table->capacity; while (table->entries[index].key != NULL) { if (strcmp(table->entries[index].key, word) == 0) { return &table->entries[index]; // Found } index = (index + 1) % table->capacity; // Linear probe }

Challenge 2: Memory Efficiency

Problem: Storing 688k+ entries requires careful memory management.

Solution:

Custom memory pool for string storage
Compact data structures (packed structs)
Lazy loading of large dictionaries
Total memory footprint: ~45MB (acceptable for the performance gained)

Challenge 3: ARM64 vs x86 Compatibility

Problem: Ensuring consistent behavior across architectures.

Solution: While working on this project, I discovered and fixed a critical bug in the ARM64 inline assembly code in the POSIX compatibility layer:

// BEFORE (buggy): __asm__ volatile ( "mov w0, #0\n\t" // Writing to hardcoded w0 : "=r" (result) // Compiler assigns result to w7! ); // AFTER (fixed): __asm__ volatile ( "mov %w0, #0\n\t" // %w0 = use compiler-assigned register : "=r" (result) // Now correctly uses w7 );

This fix improved performance by 34.6% and taught me valuable lessons about inline assembly constraints. Read more about this in my article: "Debugging ARM64 Inline Assembly: A Case Study"

Challenge 4: Dictionary Quality

Problem: Generic word lists don't work well for spell-checking.

Solution: Built a comprehensive dictionary pipeline:

Start with SCOWL (Spell Checker Oriented Word Lists)
Add frequency data from Google N-grams
Filter against real misspelling datasets
Validate accuracy across multiple test corpora
Iterate until achieving 82-84% accuracy

Challenge 5: Real-World Performance vs Benchmarks

Problem: Benchmarks showed 30µs, but that's only for misspellings.

Solution: Implemented a two-tier approach:

Fast path: Direct dictionary lookup (0.7µs) for correct words
Correction path: Full SymSpell algorithm (30µs) for misspellings
Real-world average: 5µs (85% fast path hits)

Performance Benchmarks: Numbers that Matter

Lookup Time Breakdown

Scenario Time Frequency

Correctly spelled word (fast path)	0.7µs	~85%
Misspelled word (correction)	30µs	~15%
Real-world average	5.1µs	100%

Calculation: 0.85 × 0.7µs + 0.15 × 30µs = 5.1µs

Accuracy Across Test Sets

Test Corpus Accuracy Test Size

CodeSpell misspellings	84.2%	2,156 pairs
Microsoft misspellings	82.7%	4,200 pairs
Wikipedia misspellings	83.1%	3,500 pairs
Average	83.3%	9,856 pairs

Memory Usage

Dictionary data: ~12MB
Delete hash table: ~33MB
Total: ~45MB
Per-lookup overhead: Negligible (stack-allocated)

Comparison with Other Implementations

While I can't do direct apples-to-apples comparisons (different languages, different test machines), here's how SymSpell C99 compares conceptually:

Python SymSpell: ~100-200µs (interpreted overhead)
C# SymSpell: ~10-20µs (JIT compilation, GC pauses)
SymSpell C99: ~5µs real-world (compiled native, no GC)

The C implementation benefits from:

No interpreter/VM overhead
No garbage collection pauses
Direct memory access
Compiler optimizations (O3 flag)
Cache-friendly data structures

Platform Performance

Apple M4 (ARM64):

Average lookup: 5.1µs
Fast path: 0.7µs
Correction: 30µs

x86-64 (Intel Xeon):

Average lookup: 5.8µs
Fast path: 0.8µs
Correction: 32µs

Performance is consistent across architectures thanks to portable C99 code.

Getting Started: Quick Start Guide

Installation

# Clone the repository git clone https://github.com/sumanpokhrel-11/symspell-c99.git cd symspell-c99 # Build make # Run interactive test ./test_symspell dictionaries/dictionary.txt # Run benchmark make benchmark

Basic Usage

#include "symspell.h" int main() { // Create dictionary with: // max_edit_distance = 2 // prefix_length = 7 symspell_dict_t* dict = symspell_create(2, 7); // Load dictionary int result = symspell_load_dictionary( dict, "dictionaries/dictionary.txt", 0, // term_index (column in file) 1 // count_index (frequency column) ); if (result < 0) { fprintf(stderr, "Failed to load dictionary\n"); return 1; } // Lookup suggestions symspell_suggestion_t suggestions[5]; int count = symspell_lookup( dict, "speling", // misspelled word 2, // max_edit_distance suggestions, // output array 5 // max suggestions ); // Print results printf("Suggestions for 'speling':\n"); for (int i = 0; i < count; i++) { printf(" %s (distance=%d, freq=%lld)\n", suggestions[i].term, suggestions[i].distance, suggestions[i].count); } // Cleanup symspell_destroy(dict); return 0; }

Output:

Suggestions for 'speling': spelling (distance=1, freq=234567) spieling (distance=2, freq=1234) peeling (distance=2, freq=5678)

Interactive Mode

The test program includes an interactive mode for experimentation:

$ ./test_symspell dictionaries/dictionary.txt Creating SymSpell dictionary... Loaded 86060 words, 688710 delete entries === Interactive Mode === Enter words to correct (or 'quit' to exit): > recieve Suggestions: receive (distance=1, iwf=7.23, prob=0.0012, freq=234567) > teh Suggestions: the (distance=1, iwf=1.05, prob=0.2456, freq=55234891) tea (distance=1, iwf=6.12, prob=0.0034, freq=76543) > quit

Building Custom Dictionaries

The project includes a complete dictionary building pipeline:

# Dictionary building scripts in dictionaries/reference/ # 1. Start with SCOWL wordlists cd dictionaries/reference/wordlists ./build-scowl.sh # 2. Add frequency data cd .. perl scripts/dictionary-build.pl \ wordlists/wordlist.txt \ frequencies/google-ngrams.txt \ > dictionaries/custom-dictionary.txt # 3. Validate against test data perl scripts/dictionary-find-misspellings.pl \ dictionaries/custom-dictionary.txt \ test/data/symspell/misspellings/ # 4. Test accuracy ./benchmark_symspell \ dictionaries/custom-dictionary.txt \ test/data/symspell/misspellings/misspell-codespell.txt

API Reference

Core Functions

// Create dictionary symspell_dict_t* symspell_create( int max_edit_distance, // 1 or 2 recommended int prefix_length // 7 recommended ); // Load dictionary from file int symspell_load_dictionary( symspell_dict_t* dict, const char* filepath, int term_index, // Column number for words int count_index // Column number for frequencies ); // Lookup suggestions int symspell_lookup( symspell_dict_t* dict, const char* input, int max_edit_distance, symspell_suggestion_t* suggestions, int max_suggestions ); // Get word frequency int64_t symspell_word_frequency( symspell_dict_t* dict, const char* word ); // Get IWF score float symspell_get_iwf( symspell_dict_t* dict, const char* word ); // Cleanup void symspell_destroy(symspell_dict_t* dict);

Data Structures

typedef struct { char term[MAX_WORD_LENGTH]; // Suggested word int distance; // Edit distance from input int64_t count; // Word frequency float iwf; // Inverse word frequency float prob; // Probability score } symspell_suggestion_t;

Real-World Use Cases

Where can you use SymSpell C99? Here are some practical applications:

1. Text Editors & IDEs

Integrate real-time spell-checking without UI lag. The 5µs average lookup time means you can check every word as the user types with zero perceptible delay.

Example: A text editor checking 100 words/second = 500µs total (0.0005 seconds)

2. Search Engines

Provide "Did you mean?" suggestions instantly. When a user searches for "pytohn tutorial", suggest "python tutorial" in microseconds.

Benefit: Improves search UX without adding latency.

3. Form Validation

Validate user input in real-time. Check email addresses, names, or technical terms as users type.

Example: A signup form that suggests corrections for common name misspellings.

4. Data Cleaning Pipelines

Process millions of records for data quality. Clean up messy datasets by correcting common misspellings.

Performance: Process 1 million words in ~5 seconds.

5. Chatbots & NLP Systems

Improve intent recognition by correcting user input before processing. Handle typos gracefully in conversational AI.

Example: User types "hw are yu", system understands "how are you".

6. Command-Line Tools

Build CLI tools with spell-check capabilities. Suggest corrections for command typos.

Example: Git-style "Did you mean 'commit'?" messages.

7. Embedded Systems

Run on resource-constrained devices. The small footprint (~45MB) and zero dependencies make it perfect for embedded Linux.

Example: Smart keyboards, IoT devices, automotive systems.

8. Database Query Correction

Fix typos in search queries before hitting the database. Improve search recall without expensive fuzzy matching.

Benefit: Better search results with exact match performance.

How to Contribute

SymSpell C99 is open source and welcomes contributions! Here's how you can help:

Ways to Contribute

1. Bug Reports

Found a bug? Please report it!

Open an issue on GitHub
Include: steps to reproduce, expected behavior, actual behavior
Bonus: Minimal code example demonstrating the bug

2. Performance Improvements

Can you make it faster?

Profile the code (use perf, Instruments, or Valgrind)
Identify bottlenecks
Submit a pull request with benchmarks showing improvement

3. Dictionary Enhancements

Improve accuracy by refining the dictionary:

Test against new misspelling corpora
Add domain-specific dictionaries (medical, legal, technical)
Improve frequency data
Document dictionary building process

4. Documentation

Help others understand and use the library:

Add code examples
Write tutorials
Improve API documentation
Translate documentation

5. Language Bindings

Make SymSpell C99 accessible from other languages:

Python: ctypes or Cython wrapper
Node.js: Native addon using N-API
Rust: FFI bindings
Go: CGo bindings
Ruby: FFI gem

6. Platform Support

Test and ensure compatibility:

Windows (via WSL or MinGW)
FreeBSD, OpenBSD
Android NDK
iOS (via cross-compilation)

7. Testing

Expand test coverage:

Unit tests for edge cases
Fuzzing for robustness
Memory leak detection
Thread safety tests

Contribution Guidelines

Fork the repository git fork https://github.com/sumanpokhrel-11/symspell-c99.git
Create a feature branch git checkout -b feature/your-feature-name
Follow the code style
- C99 standard compliance
- Zero warnings with -Wall -Wextra -Werror
- Clear, descriptive variable names
- Comments for complex logic
Write tests
- Add tests for new features
- Ensure all existing tests pass
- Run make test before submitting
Document your changes
- Update README if needed
- Add docstrings to new functions
- Update CHANGELOG
Submit a pull request
- Clear description of changes
- Reference any related issues
- Include before/after benchmarks for performance changes

Code of Conduct

This project follows a simple code of conduct:

✅ Be respectful and constructive
✅ Welcome newcomers and help them learn
✅ Focus on what's best for the project
✅ Give credit where credit is due
❌ No harassment, discrimination, or trolling

Recognition

All contributors are recognized in:

GitHub contributors page
CONTRIBUTORS.md file
Release notes

Let's build something great together! 🚀

Future Enhancements: What's Next?

Here's my roadmap for future development. These are areas I'm exploring or would love community help with:

Short-Term (Next 3-6 months)

1. CI/CD Pipeline

GitHub Actions for automated testing
Test on multiple platforms (Linux, macOS, BSD)
Code coverage reporting
Automated performance benchmarking

2. Additional Examples

CLI spell-checker tool
Web server integration example (with libevent)
Batch processing script
Performance comparison tool

3. Documentation Improvements

API reference (Doxygen-generated)
Architecture deep-dive
Performance tuning guide
Video tutorials

Medium-Term (6-12 months)

4. Language Bindings

Python wrapper (top priority based on demand)
Node.js native addon
Rust FFI bindings
Go CGo wrapper

5. Advanced Features

Compound word support: Handle "spellchecker" vs "spell checker"
Context-aware suggestions: Use n-grams for better ranking
Phonetic matching: Soundex/Metaphone for homophones
Multi-language support: Spanish, French, German dictionaries

6. Performance Optimizations

SIMD optimizations (AVX2/NEON)
Multi-threaded dictionary loading
Memory-mapped dictionary files
Incremental dictionary updates

Long-Term (12+ months)

7. Package Manager Distribution

Homebrew formula
Debian/Ubuntu packages (apt)
RPM packages (yum/dnf)
Arch AUR package
vcpkg/Conan package

8. Advanced Use Cases

Edit distance 3 support: For very messy input
Streaming interface: Process text streams efficiently
GPU acceleration: Batch processing on CUDA/OpenCL
Distributed system: Microservice architecture

9. Research Integration

Implement SymSpell 6.7 improvements
Integrate neural spelling correction as fallback
Combine with language models (GPT/BERT) for context
Research paper on C-specific optimizations

Blue Sky Ideas

Ambitious ideas that would be amazing but require significant work:

Real-Time Collaborative Editing

Integrate with OT/CRDT systems for real-time collaborative spell-checking in Google Docs-style applications.

Browser Extension

Compile to WebAssembly and create a privacy-focused, offline spell-checker browser extension.

Machine Learning Integration

Train a neural ranker to improve suggestion quality beyond frequency-based ranking.

Domain-Specific Dictionaries

Curated dictionaries for: medical terminology, legal documents, programming languages, academic writing.

Your Ideas?

Have an idea for improvement? Open a discussion on GitHub!

Conclusion: Lessons Learned and What's Next

What I Learned

Building and publishing SymSpell C99 taught me invaluable lessons:

Technical Skills

Systems programming: Deep understanding of memory management, cache optimization, and data structures
Performance optimization: Profiling, benchmarking, and iterative improvement
Cross-platform development: Writing portable C99 code that works everywhere
Algorithm implementation: Translating academic papers into production code
Debugging: Fixing complex bugs like ARM64 inline assembly issues

Soft Skills

Documentation: Writing clear, comprehensive documentation for users
Open source: Managing a public repository, responding to issues, accepting contributions
Communication: Explaining technical concepts to diverse audiences
Project management: Planning features, prioritizing work, meeting deadlines

Professional Growth

Portfolio building: Created a showcase project demonstrating expertise
Community engagement: Connected with algorithm creator and other developers
Mentorship: Learned from experienced systems programmers
Confidence: Proved I can build production-quality software

Key Takeaways

"Perfect is the enemy of good."

I could have spent months adding every possible feature. Instead, I shipped a solid v1.0 and will iterate based on feedback.

"Performance comes from understanding, not magic."

The 5µs average wasn't luck—it came from profiling, optimization, and deep understanding of how the algorithm and hardware work together.

"Documentation is code for humans."

Writing good docs is as important as writing good code. Nobody will use your library if they can't figure out how.

"Open source is about community, not just code."

The most rewarding part wasn't writing the code—it was sharing it, helping others use it, and learning from their feedback.

What's Next for Me?

This project opened doors I didn't know existed:

📧 Contacted Wolf Garbe to get listed on official SymSpell ports
💼 Portfolio piece demonstrating systems programming expertise
🎓 Learning opportunity in algorithms and performance optimization
🤝 Mentorship from experienced engineers (thank you, Dr. James Freeman!)
🌟 Open source contributor with a published project

Try It Yourself!

I encourage you to:

Clone the repo: github.com/sumanpokhrel-11/symspell-c99
Run the benchmarks: See the performance for yourself
Read the code: It's only 700 lines—very approachable
Integrate it: Use it in your next project
Contribute: Help make it even better

Get in Touch

I'd love to hear from you:

🐙 GitHub: @sumanpokhrel-11
💼 LinkedIn: Suman Pokhrel
📧 Email: [email protected]
🌐 Website: suman-pokhrel.com.np

Whether you're:

Using SymSpell C99 in a project
Have questions or feedback
Want to collaborate on open source
Interested in discussing algorithms or performance
Looking for a passionate developer to join your team

Let's connect!

Final Thoughts

Building SymSpell C99 reminded me why I fell in love with programming: the joy of creating something useful, the satisfaction of solving hard problems, and the thrill of sharing it with the world.

If you're reading this and thinking about starting your own project:

Do it.

Don't wait for the perfect idea or the perfect time. Pick something that excites you, start small, ship it, and iterate.

The best time to start was yesterday. The second best time is now.

Thank you for reading! Now go build something amazing. 🚀

Suman Pokhrel
Software Developer | Systems Programmer | Open Source Enthusiast
Kathmandu, Nepal

Read Entire Article