Automating Cantonese Romanization

3 months ago 4

1. Why is Jyutping Automation Hard?

In Cantonese, one Chinese character can often be read in up to ten ways. Converting Chinese characters to their Jyutping requires picking out the correct reading.

A short sentence like 曾太曾婆婆媽媽地愛 explodes to (2 x 2 x 2 x 2 x 3 x 3 x 4 x 4 x 4 x 2) = 18,432 possible combinations. Given a string of glyphs, and assigning them the correct Jyutping, is called the “grapheme to phoneme” (G2P) problem. While seemingly trivial (all native speakers do this naturally) it is hard to get right. Solving G2P is so important to other high-value tasks, that non-linguists are surprised it wasn’t solved long long ago. I was surprised.

The first hint (that I was too foolish to heed) that this is a hard problem is that linguists just don’t even try, or don’t believe that this could be solved very well. A second hint (that I also didn’t heed) is that when solutions were provided, they were never provided with any benchmark; none of the solutions announced their accuracy.

In hindsight, poor accuracy presents chicken-and-egg dilemma: benchmarking requires comparing the output of the solution against a large, diverse set of Golden Answers, but without a good solution, preparing the Golden Answers are prohibitively difficult. No one volunteers to work with 50,000 characters’ worth of eye-sore:

The size of the task, and the likelihood of missing something, means you need several people doing multiple passes to get this right. Preparing a Golden Answer is not glamorous and no one is funding six-months of salary / > HKD 120,000 for… “what? Can’t you just write some code for that?” (No, we can’t know how good the code is until we have the Golden Answer!)

Over the past three years, I developed the Cantonese Font and, as part of its development, solved this hard but fundamental Jyutping assignment problem. Version 3.4 benchmarked at

  • 99.2% for non-standardized spoken Cantonese and Written Chinese text (3rd party benchmark),
  • 99.72% for Written Chinese text following a style guide, and
  • 99.85% for spoken Cantonese text following a style guide

The Cantonese Font Jyutping assignments are more accurate than other solutions by 2-3 orders of magnitude. Exceptional accuracy, paired with easy readability, made it possible to prepare hundred-thousand characters Golden Answers. A character error rate of 0.2% makes it possible to work 50,000 characters with 100 errors; the same assignment with Claude Sonnet 3.5 (8.5% error rate) would suffer 4,250 errors.

Linguistically, the most surprising thing this non-linguist learnt was that G2P is not one problem, but a nested, entangled mess of problems, each carrying exceptions all the way down. Engineering-wise, the most surprising (and satisying) discovery was that this intractable problem has a simple, direct solution.

This article compacts hard-won insight of what Jyutping automation means; what are valuable qualities for Jyutping automation beyond accuracy; and how the Cantonese Font solves the problem on a conceptual and technical level. Let’s start by looking at why this is probably a harder problem than you have thought.

2. The Entangled Mess

The Chinese-Japanese-Korean (CJK) script is a set of ideographic glyphs shared across East Asia. It is used by different oral language systems, over wide geographic expanses and comes with a two-thousand year history. This means a Cantonese speaker can read aloud “in Cantonese”:

  • the Hanzi in a run of Japanese text
  • a poem engraved two thousand years ago
  • Chinese Maoist edict from 1970s, written with the Simplified Chinese script
  • modern Taiwanese text written with “Standard Written Chinese” grammar and vocabulary
  • English-mixed colloquial transcript of a phone call in Hong Kong

For pedagogical and practical usage, the main text-type of interest is a continuum of spoken Cantonese and written Chinese, where fragments of spoken Cantonese, written Chinese, and cultural (literal / historical) may be interspersed on a sentence and sub-sentence level.

While any Cantonese speaker can read-aloud these text, what they end up pronouncing may not be the same. Their interpretation are influenced by their locale (e.g., Hong Kong, Malaysian, and Guangzhou speakers), age, education, and in the cases where the run of glyphs is truly ambiguous (提醒下 haa5 / haa6 位顧客), personal experience. What is popular is not necessarily what is right, and what learnt men proclaims to be orthodox may be alienating to the masses.

To solve the Hong Kong Cantonese G2P problem is to

  1. unearth this shared but hidden cultural knowledge that a reasonable, modern (1950s-2020s) HK audience would accept; and
  2. apply this highly irregular heuristic to
    • every (who-knows-how-many) readable glyph (including Traditional HK variant, Traditional Taiwan standard, Simplified Chinese, Japanese Hanzi)
    • spoken Cantonese and written Chinese
  3. Taking into account special usage across eras and domains, including
    • literary uses
    • historical / religious terms
    • specialist knowledge (e.g., science, finance)
  4. Fail gracefully

Failing gracefully — and the related virtue that failures can be handled gracefully — is extremely important but often neglected. We will devote an entire section on this.

Glyphs & Codepoints

Chinese have many problems, one of which is that they are just so many (glyphs and people). But what makes this a thorny issue is that their identities are intertwined (also true for both glyphs and people).

Unicode is the standard encoding for “alphabets” in the past 25 years. The Unicode consortium work with Ideographic Research Groups (IRG) from different nation-states to assign codepoints (think a digital ID number) for every unique character in every language. There are 97,680 IDs issued (as of Unicode 16.0, 2025) for CJK glyphs.

We need to define some terms more clearly:

  • codepoint is a numeric ID
  • glyphs is a kind of idealized, standardized two-dimensional drawing;
  • character is a mushy idea of a package of glyph-reading-meaning

Glyphs are often localized and there are five main variants:

  1. Traditional Taiwan
  2. Traditional Hong Kong
  3. Simplified
  4. Japanese
  5. Korean

Due to localization and historical / political reasons, one character can be represented by multiple glyphs (row 1-3 in the next picture shows how 户 啟 説 are characters with three glyphs / three codepoints). One codepoint usually represents one glyph (but not always, see row 4 骨; in PRC this is written opposite). Then there are the other cases.

If you are confused, well, it is confusing. Yasuoka & Yasuoka 安岡 wrote an entire treatise on the inconsistencies (from which the above was excerpted).

Not every character can be pronounced in Cantonese. Over the years, I combed and combined all the dictionary / dictionary-like sources, and found only 30,000 codepoints that can be pronounced in Cantonese. They are all in the Cantonese Font.

Of the 30k codepoints, more than 7,500 (25%) are polyphonic (can plausibly be read in more than one way). While only 292 glyphs has five or more readings, these are often common characters (e.g., 行) and correctly assigning them is absolutely critical. Cantonese Font 3.5 includes a total of 39,984 glyph-phone combinations.

Beyond the magnitude of the problem, it is the entangled identities that truly confounds the issue. While native readers consider 靜靜鷄搵下、靜靜雞揾下、静静鸡揾吓 as equivalent, they contained nine distinct codepoints and 2 x 3 x 2 x 2 = 24 possible glyph combinations. 鷄、雞、鸡 are variant glyphs of the same characters, as is 搵 (TW standard) and 揾 (HK standard). 下 and 吓 are not the same character but many users treat them as if they are interchangeable: many people believe that 下 should be read as haa6, and in the cases of the 一下 contraction where it is pronounced as haa5, they prefer to substitute 吓-haa2 to represent the inflection.

By the way, if you skipped over the rest of the Yasuoka & Yasuoka image, my above example might give you the false impression that glyph locale variants map one-to-one. That is not true and can get pretty hairy. As an example, 干 is a Traditional glyph that in Simplified Chinese system maps to 3 characters (干 itself; 幹 tree branch / to do / “to do” (naughty, naughty); 乾). 乾 has two meanings (1) dry and (2) prosperous. When 乾 means dry, it is read as gon1; when it means prosperous, it is pronounced kin4. Now for the twist: when 乾 is read as kin4, you cannot represent it with 干. To assign kin4 lung4 to 干隆 (call-sign of an Emperor) is an error.

Average polyphonicity, Plausibility Scaling, and Difficulty of G2P Problems

With 29,150 glyphs plausibly read in 39,992 ways, Cantonese has a naive “polyphonicity” of 1.37. This is not frequency-weighed and gives us a lower-bound to the difficulty of the G2P problem.

A string of 100 random characters carries 1.37 x 1.37 x … x 1.37 = 1.37100 = 46,995,547,730,037 plausible combinations. A G2P system has to pick out the right needle from the massive haystack.

Characters, of course, don’t occur randomly. A better way to estimate the polyphonicity is to weigh by frequency of when a polyphonic character shows up. The most commonly used characters are all polyphonic; my hunch is that polyphonicity, thus measured, would be close to 2. The number of plausible combinations blow up far more quickly: for one hundred characters, plausible combinations would go from 1013 to 1030.

The truly good way to estimate the narrowness of the right solution is to weigh by the glyphon frequencies, and what happens if we simply pull out the most likely glyphon. For example, that given 長 appears 75% as coeng4 and 25% as zoeng2, we could just assign every 長 as coeng4. We should be able to assess how difficult a G2P problem is, and get to a different way of benchmarking.

Language Systems

The 鷄、雞、鸡 / 搵、揾 / 干、幹、乾 example showed how characters are confounded at the glyph and codepoint level. A second entanglement happens between modern Cantonese and other language systems in space and time.

The most pronounced issue is the divergence between “vernacular reading” and “literary reading” (文白異讀). As an example, the verb to sit 坐 will be indicated as zo6 in the dictionary (literary reading; “high register”) but co5 in daily usage. The literary reading is used in most (but not all!) classical idioms, and for reading aloud standard written Chinese text; annotating written Cantonese with the literary reading Jyutping, however, is an error. The opposite case, of annotating standard written Chinese with vernacular Cantonese sounds, is also an error.

However, classifying whether a piece of text is standard written Chinese or written Cantonese is not straight-forward. The Facebook/Meta “No Language Left Behind” (NLLB) project in 2021 included a classifier for identifying the language of the text. Given a piece of text is written in 漢字, its success rate for discerning whether it is in Mandarin (standard written Chinese) or Cantonese was 50%. That is as good as a coin flip.

The real world have text more challenging than that in the NLLB test cases. We frequently have text where standard written Chinese and written Cantonese is woven into one another within a paragraph, or even within the same sentence.

Geographically, Cantonese also receive influences from Min 閩, Hakka 客家, and 壯侗語 (languages spoken in Thailand / Myanmar), and these result in special lexicon that needs to be accounted for. Cantonese opera, for example, due to its descent, reads the characters in 中州話 (Zung Zau language), and 合尺 (in tune) is not read as hap6 cek3 as would be normally expected but ho4 ce1.

European loan-words and transliterations often attaches new, one-of-a-kind sounds to an existing glyph; 朱古力, instead of gu5 lik6, is read as gu1 lik1. All of these are irregular and follows no rules.

Historically, Cantonese descends from ancient 2500 years old Chinese. Glyphs change and morph, but there is an accepted body of scholarly work that traces a “correct reading” for each remnant piece of text over each era. For example, 丁 is solely read as ding1 in the modern world; however, in the context of 伐木丁丁, a sentence from a poem collection dating from 700-1100 BC, it must be read as zang1.

A specific subset of this historical readings are the idioms 成語. These are usually four-characters and there are thousands of them.

A good G2P system needs to correctly classify text into these historical / geographical language systems correctly and handle often singleton exceptions.

Sound as Meaning-Setter

Common characters, especially those with a long history, often have different readings that is linked to the meaning. I am using “linked” here because the causality goes both ways — the context changes the sound, but the sound chosen often also determines the meaning.

An example is the character 樂. When pronounced as lok it is the adjective happy; ngok is describing music (noun) or musical (adj); ngaau is the verb to enjoy, but only in ancient classical Chinese usage.

In some cases the same character has multiple sounds for the same parts-of-speech. 彈, for example, have a reading of taan4 which means to play a string instrument, and 彈 daan6 which means to bounce.

A G2P system needs to interact with the context to provide a most-likely usage.

Language Use Standards

Cantonese was, like most languages, primarily spoken throughout history. In modern times, where it was spoken (Canton, Hong Kong) was situated at the interface of Great Empires. Colonizers have little interest in promoting the language, and it had received no state-sponsored standardization. In Hong Kong, a metropolis where over 90% of the population speaks Cantonese as a mother tongue, there is not only no school instruction on Cantonese nor Jyutping, but written Cantonese is actively penalized as “bad Chinese”.

This meant that we often agree on a sound-sequence having a specific meaning, but not the glyphs used to represent it. This can be compounds like ham6 baang6 laang6 (“absolutely everything”), lat1 kak1 (“stutter”), or single character. Some concepts like beu6 (to bump) or he3 (to slack off) or kem5 (to be embarrassingly nerdy) we generally don’t even try to represent with a glyph; others like the sentence final particles receive confusing, conflicting, and ambiguous mappings. It is never very clear whether 係呀 should be hai6 aa4 (yea, you think so?) or hai6 aa3 (yes indeed).

Learnt men often believe that we must adopt a standard, and rightfully it should be their standard. Sadly, other learnt men do not agree on whose standard we ignorant masses should follow.

Since written Cantonese have no guidelines nor standards nor examinations, it is often written with a great range of diversity. For example, many writers do not distinguish between gam3 (to this extent) and gam2 (in this manner) and simply represent both as 咁 (as opposed to using 噉 for gam2); some argue for distinguishing between the use of 俾-bei2 and 畀-bei2; yet others advocate for 吓-haa2 for the contraction of 一下, conventionally represented as 下-haa5. A G2P method needs to take some stance; no stance is very much a stance.

Back to the learnt men: one of the biggest split that had caused endless arguments since its ascendence in the late 90s was the Right Pronunciation 正音 Movement by a Professor Ho of CUHK. He argue for restoring the Right Pronunciation, often citing as evidence his reading of a rhyme dictionary in 1080 AD. The right pronunciation proposed is often absurd to native speakers, and never really caught on in the populace; broadcasting institutions, however, had been enamoured with it. A G2P algorithm simply have to choose whether 時間 (time) is si4 gaan3 (how normal people read it) or si4 gaan1 (what Prof Ho advocate).

The Curious Case of the Cockroach

My favorite demonstration for the lack of standardization are the characters 曱 and 甴, which together means cockroach. Some people write 曱甴, others 甴曱, but they both expect the reader to pronounce gaat-zaat.

Prof Abraham Chan (UoW) solved this by tracing how gaat-zaat was written through 150 years. This had historically been written as 甴曱; some time in 1956, in a student magazine, the reversed 曱甴 was first written. 曱 looks like 甲-gaap, and it seems 曱甴 might be plausible and this orthography took root.

What does the Cantonese Font do? I came across this before Prof Chan’s lecture; spent hours on this case and went nowhere. You get gaat-zaat no matter how you write the pair.

Tone Changes

Cantonese additionally have extensive tone-changes. Most tone changes are irregular (and sometimes dependent on the individual). As an example, 人 (people) is normally jan4, but in 男人 (man) is receives a tone-shift to jan2. However (!!), in the extended fragment of 男人老狗 it must revert to jan4. None of this is systematic.

These combines with loan words and mixed scripts. For example, the last name 王 wong4 receives a tone-change to wong2 when it is used with 伯-baak3 (uncle), but it also receives the tone-change when suffixed with English sir.

Domains

Language captures lived experiences, and people live very different lives. Each specialty domain has their own parlance, and sometimes simple characters take on unexpected readings. For example, 奇 is normally kei4, but when used in the mathematics context of an odd number it is gei1.

Between natural sciences, economics / finance, history, arts, religions, medicine — many domains have entire Chinese and western complements — there’s lots of esoteric grounds to cover.

What needed to be covered are not just domain-specific terms, but proper nouns such as 巴甫洛夫 where 甫 could be pou5 or fu2. (It’s fu2; 甫 is the first V in Pavlov, of Pavlov’s salivating dogs fame.)

One area (excuse my pun) that people really expects a G2P system to get right are place-names. Someone growing up in 上山鷄乙 finds it inexcusable that your G2P assigns 乙 as jyut6 when it is read as fat1. And there are lots of place-names.

Pseudo-domains

By now you’d get the sense that “knowing what is the right reading” is not always straight-forward even for native speakers. When there is a knowledge gap, cottage industries naturally spring to fill the gap.

In Hong Kong, there are Right Reading 正音 (mentioned above) and Right Writing 正字 movements, some headed by scholars, others headed by charlatans. You look from one to the other, and often it is hard to tell who is who. People feel good when they feel superior to others; and in a culture prizing literacy, you can feel really superior over your neighbours when you know how 齮齕 and 㪐㩿 and 虢礫緙嘞 should be read and they don’t.

This presents a dilemma for people working on G2P (me) and input methods (not me). I personally agree with KT Shek (my personal hero that digitized twenty historical dictionaries into Jyut.net) that: 「任何貼文、文章或影片,只要出現「廣東話/粵語正字」字眼,你就知道,這貼文、文章或影片必會變為業餘、不認真,和像馬戲團。」(“If a post, article, or video contains the words Cantonese Right Writing, you will know this post, article, or video is amateurish, unserious, and like a clown show.”) However, if someone insists on 㪐㩿 being annotated as lat1 kak1, why should I correct them?

3. Systems: Rule-based vs LLM

Two broad categories of approaches exists.

  • Rule-based approaches states that we, humans with puny feeble minds, can nonetheless understand the problem and write the solution down. Examples include PyCantonese (Jackson Lee) and ToJyutping (CanCLID). These tend to be fast at 104 of tokens / second.
  • Machine learning / artificial intelligence / large language models throw lots of data at computers, and let the expensive graphics cards spend expensive electricity to figure out a black-box model. Once the model exists, we can throw future problems at it and, provided we still have our expensive graphics cards and expensive electricity, get to solutions. These tend to be slow, with throughput of 10s of tokens / second.

LLMs didn’t exist in 2021, and while other machine learning techniques exist, no one had used them to attack the Cantonese G2P problem. The largest class of proprietary LLM (Claude, GPT4o, Grok) tend to not be too absurd with their Jyutping assignments. Training AIs need large volume of good training data; what circulated on the internet was usually at a one-error-per-sentence level (1-in-15), so it couldn’t get better than that. Decontaminating influences from Mandarin pinyin and Yale / Canton Pinyin / … was also an issue.

Rule-based solutions depends on rules, and most of these rules came as character:Jyutping pair listings exported from dictionaries and keyboard input methods. (PyCantonese leans on the Rime-Cantonese list, and Wing Font uses the TypeDuck list.)

The Cantonese Font is a rule-based system. In hindsight, its success is 20% innovation and 80% execution. Writing and maintaining tens-of-thousands of rules is tedious and error prone. Most lists were “fragilely crowd-sourced” with 3 contributors contributing 80% of material; large blocs were copied from older lists (inheriting decades-old errors), includes procedurally generated blocks (new errors en masse), and the length of the flat, unordered, untested list just grows and grows. Two things happen with chicken-and-egg relationship: the main contributors retire, and the list collapses under its own weight and inconsistency.

Then someone with a new project sees the now-abandoned list, and without knowing its history, they base their project off a compilation of this and other abandoned lists, and the cycle of life continues…

The Cantonese Font is, from Day 1, a commercial project. My vision is for “Cantonese infrastructure building”. That is sustainable only when one can provide the pay and culture to keep a good team growing together.

4. Goodness of Solutions

Originally this section (indeed, the entire article) was on the accuracy of G2P. Writing helps you think, and I realized that (within the same ball-park) there are far graver sins than a few % lower accuracies. This arrangement is ordered in my ranked importance, most important first.

Failure Mode

I would probably not appreciated this before, but I now consider how a G2P system fails to be of paramount importance. (Failures inevitably happen, and how/whether they can be handled gracefully is also important; that is split off into its own section.)

Failures must be deterministic. Non-deterministic failures are unacceptable. Certain G2P methods produce results with some randomness injected; their output is non-deterministic (every run of the same input could produce different output). Machine-learning (“AI”, large language models) usually produces output like this. 三 being saam1 in one place is no guarantee that it will not be sam1, or god forbid, zot3 elsewhere. Correcting these means poring over every Jyutping, and are very mentally taxing.

Deterministic errors are often available for batch correct. Consider the case of 馬蹄, which could be 馬蹄-tai2 (water chestnut) or 馬蹄-tai4 (horse hooves). The Cantonese Font assigns 馬蹄-tai2 (water chestnut) because, well, I thought people eat and cook more often than they own or ride horses. When working on Animal Farm, there was lots of horses and lots of horse hooves. A single search-and-replace fixed a quarter of all errors.

Failures should be non-slipping. N characters should produce N Jyutping. It should not insert extra Jyutping for imaginary characters, nor should it silently fail to emit an Jyutping. Slippage is problematic when the Jyutping is not linked to the characters by co-located display or a data structure, because when one is aligning the character to the Jyutping (manual or automatic) it is often not clear where the slippage occurred and generates a cascade of errors.

Failing to emit an Jyutping (and doing so silently) is common with rule-based systems, when it encounter a character that it fails to handle (encoding issue, or is not included in its rules). Inserting extra Jyutping requires hallucinating non-existing characters, and really only happens with LLM.

Failures should not startle the user. Frequently G2P is done by a producer (e.g., author), who is herself not the consumer of the Jyutping (the reader). It is the producer’s responsibility to make sure the Jyutping is correct; however, Jyutping is not an examinable skill in school, and there are currently very few Jyutping-fluent readers. In the wild, I estimate that at least half of the Jyutping, once it is generated from a G2P, is not further proof-read.

The reader that leans on the Jyutping will have empathy when 下 in 一下車 is annotated as haa5; they will not extend the same empathy when it’s labelled as xiem1. Hallucinating is obvious the expertise of LLMs; I shall not flog the issue. Surprisingly, this also happens with rule-based systems. Rule-based systems often include thousands of rules; they are not only maintained by several people, but were imported from other crowd-sourced projects en masse. A xiem1 input by a whisky-sipping contributor in 2004 can have repercussions far in time.

This can be quantify with some kind of Levenstein distance, or piecewise Jyutping component proximity.

Failures should be concentrated. Ideally, 90% of failures is concentrated on 10 characters. The task of proof-reading then involves 10 passes over these characters instead of paying deep attention to each character. This can reduce the time required to proof-read to a third.

Failures should be stable for the user. This is related to the first point about failures should be deterministic, but on a longer time horizon. For professional users — and we should hope for more and more doing Cantonese work professionally — they would work on similar projects months after months, years after years. Over time, for the text type they work with, the user develops a certain intuition about where the technology is likely to fail (provided the Failures are Concentrated). Thoughtful technology ought not push changes onto the user unless they ask for it, even if it is objectively an minor upgrade. (“Yes, dear, my desk is a mess to you, but I know exactly where everything is in that mess.”)

Expressiveness

Can the G2P solution express the character and Jyutping? This may not always be true, and the most acute cases are in the font-based solution.

In the font-based solutions (Cantonese Font and Wing Font), if a character+jyutping was not included as a glyph, the user cannot access that combination. In the case of Wing Font, it contains ~10,000 most common Traditional characters and their 13,000 most common Jyutping combinations. If you need the Simplified Chinese 条, you need to build your own font; if you need to say 皺【皺-caau2】地 or 甜【甜-tim2】地, you need to build your own font; if you need the character 嚱 (都幾好睇【嚱-he3】)… you need to build your own font. And to build your own font, you’ll need to make sense of an 180,000 entry CSV merge that literally no one had made sense of…

(This is always on my mind. It is relatively trivial to include every character that can be pronounced in Cantonese — there’s about 29000 of them — but not possible to guarantee exhaustiveness about every tone-change / rare application of common characters. Cantonese Font 3.4 includes 39,981 character-Jyutping combinations, and I take documented requests to patch missing characters including specialized cases like 乙 as ji1 in Cantonese opera.)

Other notable cases include data models founded upon incorrect beliefs such as Jyutping does not contain spaces, or -em is not a valid final. In these cases, it would not be possible to express polysyllabic characters (the most famous of which is 卅 saa1 aa6) or 舔-lem. These can be wired at a very deep level, and causes slippage (aa6 aligned on the next character) or strangeness if “Jyutping of a word” is glued together and separated by space.

Accuracy

Ah, and you thought the whole article was about this! If you jumped directly to this section, go back up to read Failure Modes and Expressiveness.

Accuracy is a straight-forward number, and technologists and reductionists love numbers. Alas, this is a nuanced number with some common misconceptions.

First, we tend to present accuracy as a percentage, and treat the number as if they were scores on a highschool test. “92%? Awesome!” “98%? A little more awesome!” This is the wrong interpretation. What we ought to care about is how frequently an error occurs, and this is the ratio of correct vs wrong.

Thus 92% says that out of a hundred characters, 92 will be right, 8 will be wrong, for a ratio of 1-in-12. 98% is 98:2 for a ratio of 1-in-50. This is not a 6% but four-fold (400%) improvement. Disparity between error ratio and “school score” grows dramatically at the near-100 end:

Accuracy as %Average characters to find an errorNumber of errors in a 50,000 textCommentG2P example
75%312,500Llama 3.1 405B
92.5%123,750Error every sentenceClaude Sonnet 3.5
95%192,500PyCantonese
97.5%391,250Error every paragraph
99.0%99500G2P generally not limiting
99.5%199250Cantonese Font 2.8
99.85%66575Cantonese Font 3.4

If this looks dramatic, it actually understates the difference. I annotated a Gospel (43k characters) with PyCantonese before Cantonese Font v2 was available, and in practice, the human effort needed scales non-linearly with errors. Managing 2500 errors is more than thirty times the tediousness than that of 75. You end up building systems to manage this essays-worth of error and soon you’re in “your errors have errors” territory right out of Kafka…

What about Humans?

Awful.

How awful depends on if the humans are trained in Jyutping, and allowed use of assisting tools. Native speakers provided with a list of the rules will probably generate an error for every correct Jyutping. It is a learnt skill.

Trained native speakers without being able to look up dictionaries will probably be in the 75% category. Many native speakers cannot reliably distinguish aa / ngaa, -k / -t, 2-5, 3-6, 3-5 tone pairs. This is on top of mixing up characters. (Several times a teacher-user reports missing glyphons in Cantonese Font, but at the end it was them mixing up two somewhat related characters…)

An expert native speaker with Jyutping as primary input method, and having proof-read 200,000 characters’ worth of Jyutping, scores around 95% (90% without the tone-marks). Supporting evidence:

A G2P method tends to be tuned for spoken Cantonese or written Chinese, but this doesn’t show up until the system is sufficiently good (est. 95% or so). In rules-based system, this is what the method spits out when the input matches none of the rules.

  • PyCantonese (in its 2022 version) inherits from the Rime-Cantonese dataset, and defaults to citation reading thus written Chinese;
  • InjectJyutping / ToJyutping includes an unknown compressed trie
  • Wing Font uses an unknown mix of TypeDuck, 粵語審音配詞字庫, and Cantonese Romanization Converter. My limited tests says it also defaults to written Chinese

In building the Cantonese Font, I picked the default for every character. The criteria was a balance of what standalone sound gives an overall better result, while remaining acceptable to a range of users (more on this later). Overall there is a spoken Cantonese bias to the common characters but this is not general; this results in a higher error rate in pure written Chinese texts.

(Where things really fall apart is with ancient Chinese literature 文言文. Think Shakespeare but twice as old. Accuracy of the Cantonese Font falls to 97% or so over 9,000 characters, and the failure modes become very unusual.)

“Four nines” 99.99% was my aspiration / obsession for years. That goal turned out to be misguided. Once past 99% or so, the bottleneck was in the writing and editing. This is especially true for spoken Cantonese text.

Spoken Cantonese did not undergo state-enforced standardization. As a result, there are lots of ways to write what diverges from Mandarin / written Chinese. Writers, for example,

  1. use 阿 啊 吖 呀 啞 interchangeably to cover aa1 aa2 aa3 aa4 aa5 (and their ng- analogues),
  2. use 嘅 for ge3 ge2, 喎 for wo3 wo4 wo5
  3. write 既 instead of 嘅 (this chains with Example 2!), 系 係 喺 used interchangeably
  4. does not distinguish between 咁 and 噉
  5. 食吓 instead of 食下, 我地 instead of 我哋, but 紅紅哋 instead of 紅紅地
  6. make typos like mixing up 晒 and 哂, 口 and 囗
  7. …and many many others

Modern speakers in HK also vary in their pronunciation of 黑 as hak-haak, 測 as cak-caak, whether to form 成日 as seng or keep as sing, or blends tones 2 5, 3 5, and 3 6, or merge -k and -t finals, or…

Most writers (and speakers for audio transcripts) have some special non-standard habits. These are often influenced by their input methods, and occurs systematically. They vary from what I would consider an unambiguous style guide (compiled from Jyutping.org and Resonate editorial style) Careful professionals are off by about 1-in-200. This means that in practice, raw “received as-is” text cannot get past a 0.5–1.0% phoneme error rate because the grapheme is wrong / ambiguous.

A group of researchers investigated in Q1-2025 how good Large Language Models are for Cantonese. One of the tests was for G2P. This group benchmarked Cantonese Font 3.3 at 99.2%, with 20 errors over 2600 characters of mixed spoken Cantonese and written Chinese. Of these, 6 were stylistic 呀-啊 嘅 usage differences.

At the near-perfect edge, a class of ambiguous outputs appears. These contributes to about 0.04% of error rates (1/5 of the Font errors). In these cases, the combination of glyph-plus-reading informs the meaning: it is fundamentally not possible to judge which is correct. Examples include

  • 上(5/6?)堂有分(1/6?)數
  • 成(sing/seng?)份錯晒
  • 姐姐(1 1 / 4 1?)去買餸
  • 提醒下(5/6?)位顧客
  • 等到(2/3?)你嚟

Cases where experts cannot agree, even when they follow the same “describe modern HK usage” stance, also come into play.

For the Cantonese Font, the majority of errors are now

  1. misclassification of colloquial vs literal reading (聽 ting-teng, 名 ming-meng)
  2. where line-breaks cut up a group
  3. segmentation errors (characters grouped in wrong combination). With 2.8, segmentation reduces the error rate from 0.5% to 0.15%; effect of segmentation on 3.4 is untested.

With high readability, well-defined failure modes, and exceptional hit rate, the Font is very easy to work with.

Speed & Cost

Speed and cost is simple: Faster is better, cheaper is better.

Rule-based systems are much faster than machine-learning systems, and doesn’t demand a H100, A100, 4090, or even the existence of a graphics card. You generally don’t need to pay by the token. The Cantonese Font works on a Kindle.

Rule-based systems all work by transforming a string, then passing the result for the next transformation. Every rule-based system works 1000x faster than real time, be it typing, dictating, reading, or scrolling a page as fast you can go. The Cantonese Font works in real time on a Kindle. Typesetting 46,000 characters (plus translation on two synchronized columns) on 250 pages takes 1.2 seconds.

Speed only matter if you do bulk data-mining work. Since rule-based systems are straight-forward substitutions, speed is probably determined by the runtime. I would expect it to be in the order of PyCantonese (python; interpreted) << ExCantonese (elixir) < ToJyutping (nodeJS) <<< font shapers (C / C++ / Rust). Font shapers are some of the most precise, elegant, performant software written.

We can estimate how fast the shaping happens by modifying a single line of consecutive characters; adding or deleting characters forces shaping across all the characters. On a M1 Macbook in Pages, shape and re-render 182,700 characters takes one second.

ExCantonese is an Elixir (programming language) library I wrote, which does all the Cantonese smarts. ExCantonese includes a poor man’s font shaper, which apply the patterns from the font to transform a string of Chinese codepoints into (ascii) Jyutping. This is part of my Grand Plan, where you write and do Jyutping editing with the Cantonese Font using all the | and .jyutping, and then take this raw text for other downstream applications. The separate article on the Cantonese Markup describes this in more details.

Handling Failures

Chinese ideogram sequences sometimes intrinsically contains ambiguity.

  • 喎 can be read as wo3, wo4, or wo5 and the meaning of the entire sentence completely differs
  • 睇相-soeng2 is “looking at photograph” and 睇相-soeng3 is “reading fortune”
  • 牛奶皮蛋撻 can be parsed as 牛奶皮-pei2・蛋-daan6撻 (cream puff pastry egg tart) or 牛奶・皮-pei4蛋-daan2撻 (“preserved egg blended with milk” tart).

It is thus essential to be able to make modifications. Working at production scale, it means the G2P system should

  1. support making corrections to a Jyutping using stable, sensible mechanisms
  2. forbidding illegal Jyutping changes
  3. support batch changes to Jyutping, and
  4. support change tracking

Most G2P systems provide a Jyutping dump as a text-string. Plain text runs are easy to change and can be version controlled, but they really need supporting tooling before they are usable at scale. You can’t make corrections until you detect errors: separating the Jyutping from the runs of glyphs means flitting your eye across two disconnected spaces, and detecting errors are difficult.

Plain-text based G2P systems become much more usable when they output to some data structure that aligns the glyph with the Jyutping. Chaak built tooling like this for working with the Hambaanglaang graded readers.

Forbidding illegal changes seems a strange requirement, but editors need to be protected from themselves. Correcting Jyutping runs is a surprisingly error-prone task; it isn’t hard to type a tone 3 into a tone 4 (esp when there is no tone marks to assist). Tooling should make illegal Jyutpings out-of-bounds, and users need to explicitly confirm before they are allowed there.

At production scale, often you need to run the same correction. 馬蹄-tai2 is water chestnut and 馬蹄-tai4 is horse-hooves; the Cantonese Font picks water chestnut as the default (how many people from Hong Kong have horses?) During the annotation of the Animal Farm, however, horse-hooves were everywhere. I just did a search and replace and poof goes 25% of Jyutping errors. A G2P that wraps the Jyutping into a data structure may make bulk changes like this no longer possible.

The Cantonese Font fulfils all four criteria but I think we can do better. Two of my ongoing development projects are a tree-sitter grammar so there is syntax highlighting even as plain-text markup, and to write a language server for the Cantonese Markup so data smarts can be integrated into IDEs. (People who have seen me working thinks it’s an insane grotesque idea, but I use the Cantonese Font mostly in Visual Studio Code ☠️)

State of Flow

Speaking of proof-reading, I readily confesses that I am not a fluent Jyutping reader, despite having proof-read around 200,000 characters worth of Jyutping. When there are Chinese glyphs, I tend to read those and use my mental sound override the Jyutping on top.

The Large Jyutping variants of the Cantonese Font has a really odd quality to it, at least on me. I can choose to focus on the color and tune out the Chinese character; the tone-marks and placement helps me get the tones right. It’s the one environment that lets me speed through 600 Jyutping / minute, and being able to retain the capacity to peek at the Chinese.

I think this is a learnable state, but you can only learn to get into the flow when long, correct runs of Jyutping are available. Probably helps if you start young 😭 It will be extremely interesting to see when we make lots of Jyutping books whether there would be a new generation of Jyutping-native readers.

5. Untangling the Mess

Chapter 2, The Entangled Mess, gives the direct conceptual path to how to untangle the mess. Alas, it could only be written by someone who untangled the mess.

The path to untangling the mess is to resolve the issues one at a time. We know that 文白異讀 literal-vernacular readings exists in mixes of micro-environments; so we must first classify text into “literal chunks” and “vernacular chunks”. Historical, geographical, domain-specific sections go into their own chunks. These chunks will then be assigned Jyutping in their own mode.

We know that correct segmentation is the key to making 人 correct in 一個人、一個男人、男人老狗; so we need to apply some segmentation somewhere.

If you are a Cantonese linguist you would be grimacing. Classification of standard written Chinese and written Cantonese is non-trivial, and Cantonese segmentation is non-trivial. The trick here is that you don’t need to get each of these steps 100% right, and the accuracy does not scale linear with effort / number of rules. I have no firm data on this, but my intuition is that the most commonly encountered 10% situation represents 90% of the cases.

A veritable stroke of genius in the Cantonese Font is the introduction of the | symbol as a word segmentation marker. This means that the Font implicit segmentation can be augmented by an external, muscular segmentation tool, as long as that tool can insert a | to indicate word boundaries. I have built exactly such a segmentation tool; an earlier version is available as the (awkwardly named) Text Pre-processor on the website. On Cantonese Font v2, pre-segmenting usually removed half the errors.

6. The Cantonese Font G2P

The Cantonese Font G2P technical implementation looks nothing like the conceptual flow. It cannot.

The earliest product decision is that it must be a font. The smarts must then be constraint to whatever can be expressed as the subset of OpenType features that can be interpreted by every font shaper released since 2010. This means I have one and only one tool to implement everything: pattern-matched string substitution.

The general gist is best represented by an example, with piecewise changes bolded:

  1. Each codepoint receives a default reading. (How this is chosen is an art that takes its own article.)
  2. There are 1st-order rules that look at the local environment (sequence of 2-8 characters); when they match, they perform reading substitution of a single glyphon.
    • starting dictionary set came from 粵典 (words.hk)
    • this is supplemented by specialty lists and other dictionaries, and
    • non-dictionary sets (classification, segmentation)
  3. Then we have 2nd-order rules that look at the environment extended out from 1st-order rules. When these match (e.g., 隻馬太) we override the first-order changes.
  4. Then we have 3rd-order rules that override the 2nd-order rules
  5. Transformation rules describes the relationship between codepoints; these were used to prepare parallel copies of the above rules for different locale usage
    • a notable deficiency to this approach is that it does not handle mixed locales in one pattern. Mixed scripts fans out to explosive number of rules. The Font handles 靜靜雞, 靜靜鷄, or 静静鸡, but not 静靜鸡、靜静鸡、静静雞、静静鷄…

In version 3.4 there are 29,145 defaults (one for every character, chosen to match the lines), 36,000 lines for Jyutping assignments, and 160,000 lines in total. The split between 1st, 2nd, and 3rd order rules is about 80 : 20 : 0.1.

I expect, over years, for the 2nd order rules to grow to twice the size as the 1st-order rules. These cannot be derived from tri-grams; a few thousands can be anticipated a priori but most just have to be encountered in the wild a posteriori. To use a vulgar but funny example, assignment of (1st order) 開下-haa5 seems to be OK once you have (2nd order) accounted for 開下面、開下方、開下一個 […]; practically no one would have anticipated that 難開下道 (word play on 蘭開夏道 Lancashire Road) is also confounded.

Where’s the special sauce?

Sorry to disappoint if you expected clever complicated algorithms. There is no special sauce.

Pattern matching, carefully layered and arranged, could express the conceptual flow needed to solve an intractable problem. That is the central discovery, whose simplicity I find quite elegant and poetic.

7. What is next?

Having an accurate, flexible, fast G2P unlocks entire tech trees. Here are three things I did in the last six weeks:

  1. We can now prepare Jyutping-related material at a scale that was prohibitively expensive. In 2022, despite being a language with 86 million first-language speakers, we had one Jyutping-annotated novel, The Little Prince (translated into Cantonese by Thomas Tsoi). Annotating the book took six months, involving six people with mastery of Jyutping (headed by Chaak). Last month, annotating the Animal Farm — twice the length — took me three days and it was easy. We, as a community, can attempt far longer works.
  2. We can prepare assistive technologies. Cantonese Braille runs on a phonetic system. In the past sixty years, preparing them essentially involves converting 漢字 -> jyut6ping3 -> j(‘&0@ -> ⠚⠷…. The last set of symbols are control-codes for Braille. As one can imagine, this is a tedious process with limited capacities for non-citation readings. (Colloquial work is practically undoable.) Cantonese Font can be used by sighted users to confirm the readings, then by simply switching to the Braille variant, outputs the Braille dots in one click.
  1. We can explore data-based pedagogy. If we know that 行 has six readings, then we need to ask which one ought we teach first. The frequency of a glyphon (glyph- + -phone, combination of ideogram and a specific sound) should inform its priority for learning. We could not know what the frequency of a glyphon until we can assign Jyutping to each glyph in large, diverse corpuses; for that we need an accurate, flexible, fast G2P. I have now counted novels (46,000 chars), oral transcripts (170,000 chars), and Yue Wikipedia (23 million chars). The number of glyphons you need to understand 80% of written characters is surprisingly few.

Very readable Jyutping tones, ability to prepare them in all media flexibly, the concept of the glyphon, and the (font-based) capacities to dim out Jyutping sets of glyphons while preserving Jyutping on unlearnt glyphs, opens up different paths to learning and teaching material that previous could not be imagined.

canto.hk is a tiny, mighty workshop. We

  1. build Cantonese infrastructure and systems for teachers, creators, and researchers; plus
  2. create learning material for serious / intermediate+ learners

The Cantonese Font is our flagship invention which lets you get accurate Jyutping in beautiful layouts with easy-to-read tone marks on documents, webpages, and video subtitles.

Read Entire Article