High-performance and memory efficient native C++ text similarity for Node.js

3 months ago 5

High-performance and memory efficient native C++ text similarity algorithms for Node.js with full Unicode support. text-similarity-node provides a suite of production-ready algorithms that demonstrably outperform pure JavaScript alternatives, especially in memory usage and specific use cases. This library is the best choice for comparing large documents where other JavaScript libraries slow down.

High Performance: Native C++ implementation which is fast and efficient compared to pure JavaScript libraries
Memory Efficient: Optimized for low memory usage and high throughput
Asynchronous API: Non-blocking operations using worker threads
Unicode Support: Full UTF-8 support including emoji and international characters
Multiple Algorithms: 7+ algorithms for different similarity needs
Production Ready: Memory safety, comprehensive testing, and error handling
Easy Integration: Simple API compatible with existing workflows

npm install text-similarity-node

const textSimilarity = require('text-similarity-node'); // Levenshtein Similarity (edit distance) textSimilarity.similarity.levenshtein('hello', 'hallo'); // 0.8 // Jaccard Similarity (set intersection) textSimilarity.similarity.jaccard('hello world', 'hello universe', true); // 0.33 // Cosine Similarity with different options textSimilarity.similarity.cosine('hello', 'hallo'); // 0.5 (character n-grams) textSimilarity.similarity.cosine('hello world', 'hello universe', true); // 0.49 (word-based) // Additional algorithms textSimilarity.similarity.jaro('hello', 'hallo'); // 0.86 textSimilarity.similarity.jaroWinkler('hello', 'hallo'); // 0.88 textSimilarity.similarity.dice('hello', 'hallo'); // 0.5 // Distance measurements textSimilarity.distance.levenshtein('hello', 'hallo'); // 1 textSimilarity.distance.hamming('hello', 'hallo'); // 1 // Unicode Support textSimilarity.similarity.levenshtein('café', 'cafe'); // 0.75 textSimilarity.similarity.jaccard('Hello 👋 World 🌍', 'Hello 👋 World 🌎'); // 0.86 (different globe emoji) // Case-insensitive comparison textSimilarity.similarity.levenshtein('Hello', 'hello', false); // 1.0

The text-similarity-node library was created based on algorithm implementations from the TextDistance Python library, achieving a 95% success rate for result compatibility between this library and the reference Python version. The 95% compatibility rate is due to different tokenization methods implemented for cosine similarity calculations.

Levenshtein Distance: Classic edit distance for spell checking and typo detection
Hamming Distance: Fixed-length string comparison for error detection
Jaro Similarity: Optimized for short strings and proper names
Jaro-Winkler: Enhanced Jaro with prefix matching bonus

Jaccard Similarity: Set intersection for document similarity
Sorensen-Dice: Harmonic mean of precision and recall
Overlap Coefficient: Measures subset relationships

Cosine Similarity: Angular distance in vector space
Character Vectorization: Optimized frequency-based comparison

The Modern API provides comprehensive configuration options and consistent return formats:

// Available algorithm types textSimilarity.AlgorithmType = { LEVENSHTEIN: 0, // Edit distance DAMERAU_LEVENSHTEIN: 1, // Edit distance with transpositions HAMMING: 2, // Equal-length string distance JARO: 3, // Fuzzy string matching JARO_WINKLER: 4, // Jaro with prefix weighting JACCARD: 5, // Set similarity coefficient SORENSEN_DICE: 6, // Dice coefficient OVERLAP: 7, // Overlap coefficient TVERSKY: 8, // Asymmetric similarity with weights COSINE: 9, // Vector space cosine similarity EUCLIDEAN: 10, // Euclidean distance MANHATTAN: 11, // Manhattan distance CHEBYSHEV: 12 // Chebyshev distance }; // Preprocessing modes textSimilarity.PreprocessingMode = { NONE: 0, // No preprocessing CHARACTER: 1, // Character-level comparison WORD: 2, // Word-level tokenization NGRAM: 3 // N-gram based tokenization }; // Case sensitivity options textSimilarity.CaseSensitivity = { SENSITIVE: 0, // Case-sensitive comparison INSENSITIVE: 1 // Case-insensitive with Unicode support }; // Full configuration object structure const fullConfig = { algorithm: textSimilarity.AlgorithmType.COSINE, // Algorithm to use preprocessing: textSimilarity.PreprocessingMode.WORD, // Text processing mode caseSensitivity: textSimilarity.CaseSensitivity.INSENSITIVE, // Case handling ngramSize: 2, // N-gram size (default: 2) threshold: 0.5, // Early termination threshold alpha: 0.5, // Tversky alpha parameter beta: 0.5, // Tversky beta parameter prefixWeight: 0.1, // Jaro-Winkler prefix weight (0.0-0.25) prefixLength: 4 // Jaro-Winkler max prefix length };

// Get supported algorithms const algorithms = textSimilarity.getSupportedAlgorithms(); console.log(algorithms); // [{ type: 0, name: 'LEVENSHTEIN' }, { type: 5, name: 'JACCARD' }, ...] // Memory management const memoryUsage = textSimilarity.getMemoryUsage(); console.log(`Memory usage: ${memoryUsage} bytes`); textSimilarity.clearCaches(); // Clear internal caches // Get current global configuration const currentConfig = textSimilarity.getGlobalConfiguration(); console.log(currentConfig);

// Edit-based algorithms textSimilarity.similarity.levenshtein(s1, s2, caseSensitive = true) textSimilarity.similarity.damerauLevenshtein(s1, s2, caseSensitive = true) textSimilarity.similarity.hamming(s1, s2, caseSensitive = true) // Phonetic algorithms textSimilarity.similarity.jaro(s1, s2, caseSensitive = true) textSimilarity.similarity.jaroWinkler(s1, s2, caseSensitive = true, prefixWeight = 0.1) // Token-based algorithms textSimilarity.similarity.jaccard(s1, s2, useWords = false, caseSensitive = true, ngramSize = 2) textSimilarity.similarity.dice(s1, s2, useWords = false, caseSensitive = true, ngramSize = 2) textSimilarity.similarity.cosine(s1, s2, useWords = false, caseSensitive = true, ngramSize = 2) textSimilarity.similarity.tversky(s1, s2, alpha, beta, useWords = false, caseSensitive = true, ngramSize = 2)

textSimilarity.distance.levenshtein(s1, s2, caseSensitive = true) textSimilarity.distance.damerauLevenshtein(s1, s2, caseSensitive = true) textSimilarity.distance.hamming(s1, s2, caseSensitive = true) textSimilarity.distance.euclidean(s1, s2, useWords = false, caseSensitive = true, ngramSize = 2) textSimilarity.distance.manhattan(s1, s2, useWords = false, caseSensitive = true, ngramSize = 2) textSimilarity.distance.chebyshev(s1, s2, useWords = false, caseSensitive = true, ngramSize = 2)

All algorithms support async execution with worker threads:

// All similarity algorithms available in async form await textSimilarity.async.levenshtein(s1, s2, caseSensitive) await textSimilarity.async.jaccard(s1, s2, useWords, caseSensitive, ngramSize) await textSimilarity.async.cosine(s1, s2, useWords, caseSensitive, ngramSize) await textSimilarity.async.jaro(s1, s2, caseSensitive) await textSimilarity.async.jaroWinkler(s1, s2, caseSensitive, prefixWeight) // ... and more

Algorithm Category text-similarity-node string-comparison similarity

Edit-Based Algorithms
Levenshtein Distance	✅	✅	❌
Levenshtein Similarity	✅	✅	✅
Damerau-Levenshtein	✅	❌	❌
Hamming Distance	✅	❌	❌
Jaro Similarity	✅	✅	❌
Jaro-Winkler	✅	✅	❌
Token-Based Algorithms
Jaccard Similarity	✅	✅	❌
Sorensen-Dice	✅	❌	❌
Tversky Index	✅	❌	❌
Overlap Coefficient	✅	❌	❌
Cosine Similarity	✅	✅	❌
Vector-Based Algorithms
Euclidean Distance	✅	❌	❌
Manhattan Distance	✅	❌	❌
Chebyshev Distance	✅	❌	❌
Sequence-Based Algorithms
LCS (Longest Common Subsequence)	❌	✅	❌
Ratcliff-Obershelp	❌	❌	❌
Configuration & Features
Case-insensitive comparison	✅	✅	✅
Configurable n-gram sizes	✅	❌	❌
Word vs character tokenization	✅	❌	❌
Unicode normalization	✅	Partial	❌
Emoji support	✅	✅	✅
Performance & API
Native implementation (C++)	✅	❌	❌
Asynchronous API	✅	❌	❌
Worker thread support	✅	❌	❌
TypeScript definitions	✅	✅	✅
Memory optimization	✅	❌	❌

Based on extensive benchmarks, text-similarity-node stands out by delivering exceptional performance and scalability where it matters most.

Unmatched Memory Efficiency

Built with a native C++ core, text-similarity-node delivers a minimal memory footprint—ideal for memory-sensitive applications and large-scale data processing.

Jaccard Similarity: Uses just 392 bytes of heap memory, compared to over 35 KB for competitors like string-comparison (nearly 90× more).
Dice Coefficient: Allocates only 392 bytes, while alternatives require over 3 KB.

Exceptional Performance on Long Texts

text-similarity-node is optimized for long strings, outperforming JavaScript-based libraries:

For strings 70+ characters, it's nearly 6× faster than the popular similarity library.
For very long strings (1000+ characters), it's over 1000× faster, processing hundreds of thousands of operations per second while alternatives slow dramatically.

Dominant Speed in Key Algorithms

The library leads in performance for modern similarity use cases:

Jaccard Similarity: Over 5× faster than string-comparison — ideal for tag or keyword analysis.
Flexible Analysis Modes: Built-in character and word modes for Jaccard, Cosine, and Dice algorithms provide greater control over results.

Comprehensive Unicode support with proper handling of:

International Characters: Latin, Cyrillic, Greek, Chinese, Japanese, Arabic
Diacritics: Proper case-insensitive matching (café ↔ CAFÉ)
Emoji: Full emoji support including complex emoji sequences
Mixed Scripts: Seamless handling of multilingual text
Normalization: Automatic Unicode normalization for accurate comparisons

// International text examples textSimilarity.similarity.levenshtein('Москва', 'москва', false); // 1.0 textSimilarity.similarity.jaccard('你好世界', '你好世间'); // 0.5 // Emoji support textSimilarity.similarity.cosine('Hello 👋🌍', 'Hello 👋🌎'); // 0.86

# Install dependencies npm install # Build native addon npm run build # Run tests npm test

Runtime: Node.js 16.0.0+
Build Tools:
- Windows: Visual Studio Build Tools or Visual Studio
- macOS: Xcode Command Line Tools (xcode-select --install)
- Linux: build-essential package (sudo apt-get install build-essential)
Architectures: x64, ARM64
Platforms: Windows, macOS, Linux

Quick Start for Contributors

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes with tests
Run the test suite: npm test
Submit a pull request

Don't forget to exclude prebuilds directory from your pull request!

MIT License - see LICENSE file for details.

This library was created using a reference implementation TextDistance Python library, which provided a solid foundation for the algorithms and features included in this library.

Read Entire Article