Splits large English text corpora into meaningful sentences while preserving narrative flow and dialog structure.
SEAMS excels at preserving narrative structure where other tools break dialog incorrectly:
Input text:
SEAMS output (3 sentences):
- "Well, you can see him easily enough," said Mr. Hoad.
- "He's staying in your village, I believe. He's a nephew of Squire Broderick's."
- "What! Captain Forrester?" cried I.
Other tools break this into 6+ fragments:
- pysbd: Breaks mid-dialog ("He's staying in + your village, I believe.)
- nupunkt: Splits attribution ("What!" + Captain Forrester?" + cried I.)
- Both fragment quotes and lose dialog structure across paragraph breaks
See examples for more cases demonstrating:
- Dialog spanning paragraph separators - Dialog attribution stays connected across paragraph breaks
- Paragraph separators indicating end of never-closed quote - Letter format with implicit quote boundaries
Found a mis-split in English narrative? Show us!
No currently known mis-splits on 20K Project Gutenberg English texts. Python scripts in exploration/ help search for potential examples.
If you discover a counter-example:
- Grab the smallest passage that triggers the error
- Paste it into a new GitHub issue
- We'll reproduce it, fix it, and add the case to the public test corpus
Dialog-heavy text is where other sentence splitters fail - show us where SEAMS does too.
Test Corpus: 20,440 Project Gutenberg files (7.4 billion characters, 56 million sentences)
Test System: Intel i9-13900KF (16 cores, 32 threads) running Linux 5.15 WSL2, 32GB RAM
| seams | 32 | 6 s | 59 × | 8.6 M | 105.4 MB/s | 1176.2 MB/s | line offsets included |
| seams-single-cpu | 1 | 1 m 31 s | 4 × | 611 k | 450.6 MB/s | 90.4 MB/s | single-CPU baseline |
| nupunkt (0.5.1) | 1 | 6 m 23 s | 1 × | 179 k | 19.7 MB/s | 19.3 MB/s | pure-Python |
Additional Results: See benchmarks/performance-results.md for results across different systems including macOS ARM64.
For complete benchmark methodology and comparison tools, see benchmarks/ and run python run_comparison.py.
From crates.io:
From source:
Process all Project Gutenberg texts in a directory:
The tool will:
- Find all *-0.txt files recursively
- Extract sentences with boundary detection
- Write results to *_seams2.txt files alongside originals
- Generate processing statistics in run_stats.json
Process a Project Gutenberg mirror:
Reprocess all files (ignore existing _seams2.txt outputs):
Run benchmark comparison:
Debug sentence detection with state transitions:
Output shows internal state machine transitions:
For each input file book-0.txt, seams creates book-0_seams2.txt with:
Format: index<TAB>sentence<TAB>(start_line,start_col,end_line,end_col)
- Line and column numbers are 1-based
- Sentences are normalized (line breaks removed, whitespace collapsed)
- Span coordinates refer to the original text
- End-to-end throughput: 1176 MB/s multi-threaded (complete pipeline: file discovery, reading, boundary detection, span tracking, normalization, and writing output)
- Sentence detection: 451 MB/s single-threaded (pure boundary detection + line coordinate tracking)
- Single-threaded end-to-end: 90 MB/s (baseline for fair comparison)
- Parallel processing: Uses available CPU cores for file enumeration and sentence splitting
- Memory efficiency: Memory-mapped files for large corpora
- Incremental: Skip already-processed files automatically
Performance scales with available CPU cores and I/O bandwidth. Actual throughput varies by:
- Hardware: CPU cores, storage speed, memory bandwidth
- File characteristics: Size distribution, text complexity
- Workload: Complete pipeline vs. raw boundary detection only
Algorithm: DFA-based boundary detection using regex-automata with narrative-aware heuristics for dialog coalescing.
Performance: 23× faster than nupunkt single-threaded (451 MB/s vs 20 MB/s sentence detection). Multi-threaded end-to-end throughput reaches 1176 MB/s on test system.
Architecture: Two-stage pipeline with bounded parallelism (file enumeration + sentence splitting), async I/O with memory-mapped files.
For detailed design documentation, see SEAMS-Design.md.
For the dialog state machine implementation details, see docs/dialog-state-machine.md.
MIT License - see LICENSE file for details.
Thanks to Project Gutenberg for providing the freely available corpus used for testing and benchmarking.
.png)

