Show HN: Chonky – a neural text semantic chunking goes multilingual

20 hours ago 1

Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems.

🆕 Now multilingual!

Model Description

The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.

⚠️This model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192).

How to use

I've made a small python library for this model: chonky

Here is the usage:

from src.chonky import ParagraphSplitter # on the first run it will download the transformer model splitter = ParagraphSplitter( model_id="mirth/chonky_mmbert_small_multilingual_1", device="cpu" ) text = ( "Before college the two main things I worked on, outside of school, were writing and programming. " "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. " "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. " "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' " "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, " "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — " "CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights." ) for chunk in splitter(text): print(chunk) print("--")

Sample Output:

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep -- . The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. --

But you can use this model using standart NER pipeline:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline model_name = "mirth/chonky_mmbert_small_multilingual_1" tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024) id2label = { 0: "O", 1: "separator", } label2id = { "O": 0, "separator": 1, } model = AutoModelForTokenClassification.from_pretrained( model_name, num_labels=2, id2label=id2label, label2id=label2id, ) pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") text = ( "Before college the two main things I worked on, outside of school, were writing and programming. " "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. " "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. " "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' " "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, " "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — " "CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights." ) pipe(text)

Sample output

[{'entity_group': 'separator', 'score': np.float32(0.66304857), 'word': ' deep', 'start': 332, 'end': 337}]

Training Data

The model was trained to split paragraphs from minipile, bookcorpus and Project Gutenberg datasets.

Metrics

Token based F1-score.

Project Gutenberg validation:

Model de en es fr it nl pl pt ru sv zh

chonky_mmbert_small_multi_1 🆕	0.88	0.78	0.91	0.93	0.86	0.81	0.81	0.88	0.97	0.91	0.11
chonky_modernbert_large_1	0.53	0.43	0.48	0.51	0.56	0.21	0.65	0.53	0.87	0.51	0.33
chonky_modernbert_base_1	0.42	0.38	0.34	0.4	0.33	0.22	0.41	0.35	0.27	0.31	0.26
chonky_distilbert_base_uncased_1	0.19	0.3	0.17	0.2	0.18	0.04	0.27	0.21	0.22	0.19	0.15
Number of val tokens	1m	1m	1m	1m	1m	1m	38k	1m	24k	1m	132k

Various english datasets:

Model bookcorpus en_judgements paul_graham 20_newsgroups

chonkY_modernbert_large_1	0.79	0.29	0.69	0.17
chonkY_modernbert_base_1	0.72	0.08	0.63	0.15
chonkY_distilbert_base_uncased_1	0.69	0.05	0.52	0.15
chonky_mmbert_small_multilingual_1 🆕	0.72	0.2	0.56	0.13

Hardware

Model was fine-tuned on a single H100 for a several hours

Read Entire Article

Show HN: Chonky – a neural text semantic chunking goes multilingual

Model Description

How to use

Sample Output:

Sample output

Training Data

Metrics

Hardware

Related

Show HN: MicroDiagram Prototype

Show HN: Raise Animals – Codes, Calculators and Wiki Guide 2...

The Anti-Tail of 3I/Atlas Turned to a Tail