SymSpell C99: Building the Fastest Spell Checker in Pure C

1 day ago 2

How I packaged and published the first pure C99 implementation of a spell-checking algorithm that's 1 million times faster than traditional approaches

October 24, 2025 15 min read By Suman Pokhrel

C Programming Algorithms Open Source Performance

Table of Contents

  1. Introduction: The Need for Speed
  2. What is SymSpell?
  3. Why C99?
  4. How SymSpell Works
  5. Key Features
  6. Implementation Journey
  7. Performance Benchmarks
  8. Getting Started
  9. Real-World Use Cases
  10. How to Contribute
  11. Future Enhancements
  12. Conclusion

Introduction: The Need for Speed

Imagine typing in a search box, and as you type, the system instantly corrects your misspellings in real-time. Or think about a text editor that can check millions of words per second without slowing down your workflow. This level of performance seemed impossible with traditional spell-checking algorithms—until SymSpell came along.

When I discovered that no pure C implementation of this revolutionary algorithm existed, I saw an opportunity. What started as a learning project evolved into SymSpell C99—the first and only pure C99 implementation of Wolf Garbe's SymSpell algorithm, now available as open source.

🚀 Quick Stats:

  • ⚡ 5µs average lookup time (real-world usage)
  • 🎯 82-84% correction accuracy
  • 🧹 ~700 lines of clean C99 code
  • 🌐 Zero dependencies, POSIX-compliant
  • 📦 86,060 word dictionary included

GitHub Repository: github.com/sumanpokhrel-11/symspell-c99

What is SymSpell?

SymSpell is a spell-checking algorithm created by Wolf Garbe that's reportedly 1 million times faster than traditional spell-checkers like those based on Peter Norvig's approach.

The Problem with Traditional Spell Checkers

Traditional spell-checkers work by:

  1. Taking a misspelled word
  2. Generating all possible corrections (edits, insertions, deletions, transpositions)
  3. Checking which generated candidates exist in the dictionary
  4. Ranking by frequency/probability

For a word with edit distance 2, this can generate hundreds of thousands of candidates. It's computationally expensive and slow.

The SymSpell Innovation: Symmetric Delete

SymSpell flips this approach on its head with a brilliant insight:

"Instead of generating corrections from misspellings, pre-compute deletions from dictionary words."

The key insight: If you delete characters from both a misspelled word and the correct word, they might become the same string.

Example:

Misspelling: "spelling" → delete 1 char → "speling" Correct: "spelling" → no deletions → "spelling" Misspelling: "speling" → generate deletes → "speling", "peling", "seling", ... Dictionary: "spelling" → pre-computed deletes → "spelling", "pelling", "selling", ... Match found: "spelling" ✓

By pre-computing all possible deletions of dictionary words and storing them in a hash table, lookups become incredibly fast—just a few hash table lookups instead of generating thousands of candidates.

Why C99?

When I started this project, implementations existed in C#, Python, Java, JavaScript, and many other languages. But there was no pure C implementation. Here's why C99 was the right choice:

1. Performance

C provides direct control over memory and CPU, essential for a performance-critical algorithm like spell-checking. No garbage collection pauses, no interpreter overhead—just pure, predictable speed.

2. Portability

C99 is the most portable language. It runs on everything from embedded systems to supercomputers. POSIX compliance means it works seamlessly on Linux, macOS, BSD, and Unix systems.

3. Integration

C libraries can be easily integrated into virtually any other language through FFI (Foreign Function Interface). Want to use it from Python? Rust? Go? No problem.

4. Educational Value

Implementing complex algorithms in C teaches you how computers actually work—memory management, cache efficiency, data structures, and optimization techniques.

5. No Dependencies

The implementation is completely self-contained. No external libraries, no frameworks, no package managers. Just standard C99 and POSIX.

How SymSpell Works: Under the Hood

Let me walk you through the algorithm step by step.

Step 1: Dictionary Preprocessing

When loading the dictionary, for each word we:

  1. Store the original word with its frequency
  2. Generate all possible deletions up to edit distance N
  3. Store each deletion as a key pointing to the original word
Dictionary word: "example" Edit distance: 2 Generated deletes: - "xample" → points to "example" - "eample" → points to "example" - "exmple" → points to "example" - "exaple" → points to "example" ... (and many more) All stored in a hash table for O(1) lookup.

Step 2: Fast Path - Correct Words

When a user enters a word, we first check if it's in the dictionary:

Input: "example" Hash lookup: Found! ✓ Result: Return immediately (0.7µs)

This is the fast path. Since most words users type are spelled correctly (about 85%), this optimization is crucial. It's why our real-world average is 5µs, not 30µs.

Step 3: Correction Path - Misspelled Words

If the word isn't found, we generate deletions:

Input: "exampl" (missing 'e') Generate deletes: - "xampl" → lookup in hash table - "eampl" → lookup in hash table - "exmpl" → lookup in hash table - "examl" → lookup in hash table - "examp" → lookup in hash table ✓ Found! Hash table returns: "example" Calculate edit distance: 1 Return: "example" (30µs worst case)

Step 4: Ranking Results

When multiple candidates are found, we rank by:

  1. Edit distance (prefer closer matches)
  2. Word frequency (prefer common words)
  3. IWF score (Inverse Word Frequency - for filtering common words)

Why This is Fast

The beauty of SymSpell:

  • ✅ Pre-computation done once at startup
  • ✅ Lookups are just hash table operations (O(1))
  • ✅ No complex string generation during lookup
  • ✅ Memory for speed tradeoff (hash table is large but lookups are instant)

Key Features of SymSpell C99

1. Blazing Fast Performance

  • Correctly spelled words: 0.7µs (fast path)
  • Real-world average: ~5µs (accounting for 85% correct words)
  • Misspelling correction: 30µs (worst case)

2. High Accuracy

  • 82-84% correction rate on standard test datasets
  • 98% vocabulary coverage
  • Tested against multiple misspelling corpora (CodeSpell, Microsoft, Wikipedia)

3. Production-Ready Code

  • Zero compiler warnings with -Wall -Wextra -Werror
  • Comprehensive test suite included
  • Memory leak free (validated with Valgrind)
  • Well-documented with clear API

4. Optimized Dictionary

  • 86,060 words carefully curated for spell-checking
  • Frequency data from Google N-gram corpus
  • Balanced across multiple test sets
  • Additional 25,327-word dictionary for technical terms

5. Complete Tooling

  • Dictionary building scripts (Perl-based pipeline)
  • Performance benchmarking tools
  • Interactive test program
  • Comprehensive documentation

6. POSIX Compliant

  • Works on Linux, macOS, BSD, Unix
  • Custom POSIX compatibility layer included
  • No platform-specific dependencies

Implementation Journey: Challenges and Solutions

Building this wasn't without challenges. Here are some of the interesting problems I encountered and solved:

Challenge 1: Hash Table Performance

Problem: With 688,710 delete entries for 86k words, hash collisions were a concern.

Solution: Implemented a custom hash table using:

  • xxHash3 for fast, high-quality hashing
  • Open addressing with linear probing
  • Load factor tuning to keep it 16.4% full for optimal performance
// Using xxHash3 for superior hash distribution uint64_t hash = XXH3_64bits(word, word_len); // Open addressing with linear probing uint32_t index = hash % table->capacity; while (table->entries[index].key != NULL) { if (strcmp(table->entries[index].key, word) == 0) { return &table->entries[index]; // Found } index = (index + 1) % table->capacity; // Linear probe }

Challenge 2: Memory Efficiency

Problem: Storing 688k+ entries requires careful memory management.

Solution:

  • Custom memory pool for string storage
  • Compact data structures (packed structs)
  • Lazy loading of large dictionaries
  • Total memory footprint: ~45MB (acceptable for the performance gained)

Challenge 3: ARM64 vs x86 Compatibility

Problem: Ensuring consistent behavior across architectures.

Solution: While working on this project, I discovered and fixed a critical bug in the ARM64 inline assembly code in the POSIX compatibility layer:

// BEFORE (buggy): __asm__ volatile ( "mov w0, #0\n\t" // Writing to hardcoded w0 : "=r" (result) // Compiler assigns result to w7! ); // AFTER (fixed): __asm__ volatile ( "mov %w0, #0\n\t" // %w0 = use compiler-assigned register : "=r" (result) // Now correctly uses w7 );

This fix improved performance by 34.6% and taught me valuable lessons about inline assembly constraints. Read more about this in my article: "Debugging ARM64 Inline Assembly: A Case Study"

Challenge 4: Dictionary Quality

Problem: Generic word lists don't work well for spell-checking.

Solution: Built a comprehensive dictionary pipeline:

  1. Start with SCOWL (Spell Checker Oriented Word Lists)
  2. Add frequency data from Google N-grams
  3. Filter against real misspelling datasets
  4. Validate accuracy across multiple test corpora
  5. Iterate until achieving 82-84% accuracy

Challenge 5: Real-World Performance vs Benchmarks

Problem: Benchmarks showed 30µs, but that's only for misspellings.

Solution: Implemented a two-tier approach:

  • Fast path: Direct dictionary lookup (0.7µs) for correct words
  • Correction path: Full SymSpell algorithm (30µs) for misspellings
  • Real-world average: 5µs (85% fast path hits)

Performance Benchmarks: Numbers that Matter

Lookup Time Breakdown

Scenario Time Frequency
Correctly spelled word (fast path) 0.7µs ~85%
Misspelled word (correction) 30µs ~15%
Real-world average 5.1µs 100%

Calculation: 0.85 × 0.7µs + 0.15 × 30µs = 5.1µs

Accuracy Across Test Sets

Test Corpus Accuracy Test Size
CodeSpell misspellings 84.2% 2,156 pairs
Microsoft misspellings 82.7% 4,200 pairs
Wikipedia misspellings 83.1% 3,500 pairs
Average 83.3% 9,856 pairs

Memory Usage

  • Dictionary data: ~12MB
  • Delete hash table: ~33MB
  • Total: ~45MB
  • Per-lookup overhead: Negligible (stack-allocated)

Comparison with Other Implementations

While I can't do direct apples-to-apples comparisons (different languages, different test machines), here's how SymSpell C99 compares conceptually:

  • Python SymSpell: ~100-200µs (interpreted overhead)
  • C# SymSpell: ~10-20µs (JIT compilation, GC pauses)
  • SymSpell C99: ~5µs real-world (compiled native, no GC)

The C implementation benefits from:

  • No interpreter/VM overhead
  • No garbage collection pauses
  • Direct memory access
  • Compiler optimizations (O3 flag)
  • Cache-friendly data structures

Platform Performance

Apple M4 (ARM64):

  • Average lookup: 5.1µs
  • Fast path: 0.7µs
  • Correction: 30µs

x86-64 (Intel Xeon):

  • Average lookup: 5.8µs
  • Fast path: 0.8µs
  • Correction: 32µs

Performance is consistent across architectures thanks to portable C99 code.

Getting Started: Quick Start Guide

Installation

# Clone the repository git clone https://github.com/sumanpokhrel-11/symspell-c99.git cd symspell-c99 # Build make # Run interactive test ./test_symspell dictionaries/dictionary.txt # Run benchmark make benchmark

Basic Usage

#include "symspell.h" int main() { // Create dictionary with: // max_edit_distance = 2 // prefix_length = 7 symspell_dict_t* dict = symspell_create(2, 7); // Load dictionary int result = symspell_load_dictionary( dict, "dictionaries/dictionary.txt", 0, // term_index (column in file) 1 // count_index (frequency column) ); if (result < 0) { fprintf(stderr, "Failed to load dictionary\n"); return 1; } // Lookup suggestions symspell_suggestion_t suggestions[5]; int count = symspell_lookup( dict, "speling", // misspelled word 2, // max_edit_distance suggestions, // output array 5 // max suggestions ); // Print results printf("Suggestions for 'speling':\n"); for (int i = 0; i < count; i++) { printf(" %s (distance=%d, freq=%lld)\n", suggestions[i].term, suggestions[i].distance, suggestions[i].count); } // Cleanup symspell_destroy(dict); return 0; }

Output:

Suggestions for 'speling': spelling (distance=1, freq=234567) spieling (distance=2, freq=1234) peeling (distance=2, freq=5678)

Interactive Mode

The test program includes an interactive mode for experimentation:

$ ./test_symspell dictionaries/dictionary.txt Creating SymSpell dictionary... Loaded 86060 words, 688710 delete entries === Interactive Mode === Enter words to correct (or 'quit' to exit): > recieve Suggestions: receive (distance=1, iwf=7.23, prob=0.0012, freq=234567) > teh Suggestions: the (distance=1, iwf=1.05, prob=0.2456, freq=55234891) tea (distance=1, iwf=6.12, prob=0.0034, freq=76543) > quit

Building Custom Dictionaries

The project includes a complete dictionary building pipeline:

# Dictionary building scripts in dictionaries/reference/ # 1. Start with SCOWL wordlists cd dictionaries/reference/wordlists ./build-scowl.sh # 2. Add frequency data cd .. perl scripts/dictionary-build.pl \ wordlists/wordlist.txt \ frequencies/google-ngrams.txt \ > dictionaries/custom-dictionary.txt # 3. Validate against test data perl scripts/dictionary-find-misspellings.pl \ dictionaries/custom-dictionary.txt \ test/data/symspell/misspellings/ # 4. Test accuracy ./benchmark_symspell \ dictionaries/custom-dictionary.txt \ test/data/symspell/misspellings/misspell-codespell.txt

API Reference

Core Functions

// Create dictionary symspell_dict_t* symspell_create( int max_edit_distance, // 1 or 2 recommended int prefix_length // 7 recommended ); // Load dictionary from file int symspell_load_dictionary( symspell_dict_t* dict, const char* filepath, int term_index, // Column number for words int count_index // Column number for frequencies ); // Lookup suggestions int symspell_lookup( symspell_dict_t* dict, const char* input, int max_edit_distance, symspell_suggestion_t* suggestions, int max_suggestions ); // Get word frequency int64_t symspell_word_frequency( symspell_dict_t* dict, const char* word ); // Get IWF score float symspell_get_iwf( symspell_dict_t* dict, const char* word ); // Cleanup void symspell_destroy(symspell_dict_t* dict);

Data Structures

typedef struct { char term[MAX_WORD_LENGTH]; // Suggested word int distance; // Edit distance from input int64_t count; // Word frequency float iwf; // Inverse word frequency float prob; // Probability score } symspell_suggestion_t;

Real-World Use Cases

Where can you use SymSpell C99? Here are some practical applications:

1. Text Editors & IDEs

Integrate real-time spell-checking without UI lag. The 5µs average lookup time means you can check every word as the user types with zero perceptible delay.

Example: A text editor checking 100 words/second = 500µs total (0.0005 seconds)

2. Search Engines

Provide "Did you mean?" suggestions instantly. When a user searches for "pytohn tutorial", suggest "python tutorial" in microseconds.

Benefit: Improves search UX without adding latency.

3. Form Validation

Validate user input in real-time. Check email addresses, names, or technical terms as users type.

Example: A signup form that suggests corrections for common name misspellings.

4. Data Cleaning Pipelines

Process millions of records for data quality. Clean up messy datasets by correcting common misspellings.

Performance: Process 1 million words in ~5 seconds.

5. Chatbots & NLP Systems

Improve intent recognition by correcting user input before processing. Handle typos gracefully in conversational AI.

Example: User types "hw are yu", system understands "how are you".

6. Command-Line Tools

Build CLI tools with spell-check capabilities. Suggest corrections for command typos.

Example: Git-style "Did you mean 'commit'?" messages.

7. Embedded Systems

Run on resource-constrained devices. The small footprint (~45MB) and zero dependencies make it perfect for embedded Linux.

Example: Smart keyboards, IoT devices, automotive systems.

8. Database Query Correction

Fix typos in search queries before hitting the database. Improve search recall without expensive fuzzy matching.

Benefit: Better search results with exact match performance.

How to Contribute

SymSpell C99 is open source and welcomes contributions! Here's how you can help:

Ways to Contribute

1. Bug Reports

Found a bug? Please report it!

  • Open an issue on GitHub
  • Include: steps to reproduce, expected behavior, actual behavior
  • Bonus: Minimal code example demonstrating the bug

2. Performance Improvements

Can you make it faster?

  • Profile the code (use perf, Instruments, or Valgrind)
  • Identify bottlenecks
  • Submit a pull request with benchmarks showing improvement

3. Dictionary Enhancements

Improve accuracy by refining the dictionary:

  • Test against new misspelling corpora
  • Add domain-specific dictionaries (medical, legal, technical)
  • Improve frequency data
  • Document dictionary building process

4. Documentation

Help others understand and use the library:

  • Add code examples
  • Write tutorials
  • Improve API documentation
  • Translate documentation

5. Language Bindings

Make SymSpell C99 accessible from other languages:

  • Python: ctypes or Cython wrapper
  • Node.js: Native addon using N-API
  • Rust: FFI bindings
  • Go: CGo bindings
  • Ruby: FFI gem

6. Platform Support

Test and ensure compatibility:

  • Windows (via WSL or MinGW)
  • FreeBSD, OpenBSD
  • Android NDK
  • iOS (via cross-compilation)

7. Testing

Expand test coverage:

  • Unit tests for edge cases
  • Fuzzing for robustness
  • Memory leak detection
  • Thread safety tests

Contribution Guidelines

  1. Fork the repository git fork https://github.com/sumanpokhrel-11/symspell-c99.git
  2. Create a feature branch git checkout -b feature/your-feature-name
  3. Follow the code style
    • C99 standard compliance
    • Zero warnings with -Wall -Wextra -Werror
    • Clear, descriptive variable names
    • Comments for complex logic
  4. Write tests
    • Add tests for new features
    • Ensure all existing tests pass
    • Run make test before submitting
  5. Document your changes
    • Update README if needed
    • Add docstrings to new functions
    • Update CHANGELOG
  6. Submit a pull request
    • Clear description of changes
    • Reference any related issues
    • Include before/after benchmarks for performance changes

Code of Conduct

This project follows a simple code of conduct:

  • ✅ Be respectful and constructive
  • ✅ Welcome newcomers and help them learn
  • ✅ Focus on what's best for the project
  • ✅ Give credit where credit is due
  • ❌ No harassment, discrimination, or trolling

Recognition

All contributors are recognized in:

  • GitHub contributors page
  • CONTRIBUTORS.md file
  • Release notes

Let's build something great together! 🚀

Future Enhancements: What's Next?

Here's my roadmap for future development. These are areas I'm exploring or would love community help with:

Short-Term (Next 3-6 months)

1. CI/CD Pipeline

  • GitHub Actions for automated testing
  • Test on multiple platforms (Linux, macOS, BSD)
  • Code coverage reporting
  • Automated performance benchmarking

2. Additional Examples

  • CLI spell-checker tool
  • Web server integration example (with libevent)
  • Batch processing script
  • Performance comparison tool

3. Documentation Improvements

  • API reference (Doxygen-generated)
  • Architecture deep-dive
  • Performance tuning guide
  • Video tutorials

Medium-Term (6-12 months)

4. Language Bindings

  • Python wrapper (top priority based on demand)
  • Node.js native addon
  • Rust FFI bindings
  • Go CGo wrapper

5. Advanced Features

  • Compound word support: Handle "spellchecker" vs "spell checker"
  • Context-aware suggestions: Use n-grams for better ranking
  • Phonetic matching: Soundex/Metaphone for homophones
  • Multi-language support: Spanish, French, German dictionaries

6. Performance Optimizations

  • SIMD optimizations (AVX2/NEON)
  • Multi-threaded dictionary loading
  • Memory-mapped dictionary files
  • Incremental dictionary updates

Long-Term (12+ months)

7. Package Manager Distribution

  • Homebrew formula
  • Debian/Ubuntu packages (apt)
  • RPM packages (yum/dnf)
  • Arch AUR package
  • vcpkg/Conan package

8. Advanced Use Cases

  • Edit distance 3 support: For very messy input
  • Streaming interface: Process text streams efficiently
  • GPU acceleration: Batch processing on CUDA/OpenCL
  • Distributed system: Microservice architecture

9. Research Integration

  • Implement SymSpell 6.7 improvements
  • Integrate neural spelling correction as fallback
  • Combine with language models (GPT/BERT) for context
  • Research paper on C-specific optimizations

Blue Sky Ideas

Ambitious ideas that would be amazing but require significant work:

Real-Time Collaborative Editing

Integrate with OT/CRDT systems for real-time collaborative spell-checking in Google Docs-style applications.

Browser Extension

Compile to WebAssembly and create a privacy-focused, offline spell-checker browser extension.

Machine Learning Integration

Train a neural ranker to improve suggestion quality beyond frequency-based ranking.

Domain-Specific Dictionaries

Curated dictionaries for: medical terminology, legal documents, programming languages, academic writing.

Your Ideas?

Have an idea for improvement? Open a discussion on GitHub!

Conclusion: Lessons Learned and What's Next

What I Learned

Building and publishing SymSpell C99 taught me invaluable lessons:

Technical Skills

  • Systems programming: Deep understanding of memory management, cache optimization, and data structures
  • Performance optimization: Profiling, benchmarking, and iterative improvement
  • Cross-platform development: Writing portable C99 code that works everywhere
  • Algorithm implementation: Translating academic papers into production code
  • Debugging: Fixing complex bugs like ARM64 inline assembly issues

Soft Skills

  • Documentation: Writing clear, comprehensive documentation for users
  • Open source: Managing a public repository, responding to issues, accepting contributions
  • Communication: Explaining technical concepts to diverse audiences
  • Project management: Planning features, prioritizing work, meeting deadlines

Professional Growth

  • Portfolio building: Created a showcase project demonstrating expertise
  • Community engagement: Connected with algorithm creator and other developers
  • Mentorship: Learned from experienced systems programmers
  • Confidence: Proved I can build production-quality software

Key Takeaways

"Perfect is the enemy of good."

I could have spent months adding every possible feature. Instead, I shipped a solid v1.0 and will iterate based on feedback.

"Performance comes from understanding, not magic."

The 5µs average wasn't luck—it came from profiling, optimization, and deep understanding of how the algorithm and hardware work together.

"Documentation is code for humans."

Writing good docs is as important as writing good code. Nobody will use your library if they can't figure out how.

"Open source is about community, not just code."

The most rewarding part wasn't writing the code—it was sharing it, helping others use it, and learning from their feedback.

What's Next for Me?

This project opened doors I didn't know existed:

  • 📧 Contacted Wolf Garbe to get listed on official SymSpell ports
  • 💼 Portfolio piece demonstrating systems programming expertise
  • 🎓 Learning opportunity in algorithms and performance optimization
  • 🤝 Mentorship from experienced engineers (thank you, Dr. James Freeman!)
  • 🌟 Open source contributor with a published project

Try It Yourself!

I encourage you to:

  1. Clone the repo: github.com/sumanpokhrel-11/symspell-c99
  2. Run the benchmarks: See the performance for yourself
  3. Read the code: It's only 700 lines—very approachable
  4. Integrate it: Use it in your next project
  5. Contribute: Help make it even better

Get in Touch

I'd love to hear from you:

Whether you're:

  • Using SymSpell C99 in a project
  • Have questions or feedback
  • Want to collaborate on open source
  • Interested in discussing algorithms or performance
  • Looking for a passionate developer to join your team

Let's connect!

Final Thoughts

Building SymSpell C99 reminded me why I fell in love with programming: the joy of creating something useful, the satisfaction of solving hard problems, and the thrill of sharing it with the world.

If you're reading this and thinking about starting your own project:

Do it.

Don't wait for the perfect idea or the perfect time. Pick something that excites you, start small, ship it, and iterate.

The best time to start was yesterday. The second best time is now.

Thank you for reading! Now go build something amazing. 🚀


Suman Pokhrel
Software Developer | Systems Programmer | Open Source Enthusiast
Kathmandu, Nepal

Read Entire Article