Skip to content

Commit 4b6f645

Browse files
committed
clarify stuff about pinyin model
1 parent 47b24a7 commit 4b6f645

File tree

1 file changed

+46
-5
lines changed
  • content/posts/chinese_app_release

1 file changed

+46
-5
lines changed

content/posts/chinese_app_release/index.md

Lines changed: 46 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -145,12 +145,53 @@ so that word lists can include jargon that might not be in a standard dictionary
145145
{{< /gallery >}}
146146

147147

148-
**Better accuracy in word mappings**
148+
**Re-trained Word Mapping Model**
149+
150+
It turned out there was a lot wrong with the data I was using to train the
151+
pinyin model. The dataset I used was a large corpus of Chinese sentences
152+
from the web, and I didn't realize it was parsed using [python-pinyin-jyutping-sentence](https://github.com/Vocab-Apps/python-pinyin-jyutping-sentence).
153+
I was essentially training on the output of a mostly rules-based system and
154+
learning its innacuracies.
155+
156+
I took that same corpus combined with candidate mappings from a dictionary
157+
and ran it through an LLM (gpt-5-mini and gemini-3 preview). The LLM did
158+
a much better job of word segmentation, and selecting the correct pinyin
159+
for each word in context.
160+
161+
```
162+
Total sentences in validation set: 528
163+
Sentences with differences: 452 (85.6%)
164+
Total characters: 14201
165+
Total matches: 12508
166+
Overall library vs LLM accuracy: 88.08%
167+
```
168+
169+
The library failed on very common cases, both on tones and pronunciations:
170+
171+
```
172+
'地' (114x): llm='de' vs lib='dì' (112x) +2 more patterns
173+
'都' (66x): llm='dōu' vs lib='dū' (64x) +2 more patterns
174+
'来' (48x): llm='lai' vs lib='lái' (35x) +10 more patterns
175+
'还' (47x): llm='hái' vs lib='huán' (44x) +3 more patterns
176+
'个' (46x): llm='ge' vs lib='gè' (41x) +5 more patterns
177+
'得' (44x): llm='de' vs lib='dé' (44x)
178+
'的' (43x): llm='de' vs lib='yī' (4x) +34 more patterns
179+
'奶' (43x): llm='nai' vs lib='nǎi' (40x) +3 more patterns
180+
'更' (39x): llm='gèng' vs lib='gēng' (34x) +5 more patterns
181+
'什' (36x): llm='shén' vs lib='shí' (36x)
182+
```
183+
184+
185+
I'm much more confident on the new model which is trained on more data, and
186+
cleaner data.
187+
188+
I also realized [ckip-transformers]() has a GPL license. Although I may open
189+
source eventually, but not just throw my amalgm of LLM generated helper scripts
190+
and hacks on GitHub. For now, I've switched to
191+
[UER-py](https://github.com/dbiir/UER-py/). The
192+
`uer/albert-base-chinese-cluecorpussmall` has the exact same architecture and
193+
vocab as the CKIP model, so it was a drop-in replacement.
149194

150-
As I detailed in the [last post](/posts/chinese_app_og/), the training data
151-
wasn't perfect. It's still not perfect, but after a round of training with
152-
cleaner data I can confidently say it's better. I don't want to provid a number
153-
here yet, but I anticipate getting this to several 9's of accuracy soon.
154195

155196
**Optimized data sync**
156197

0 commit comments

Comments
 (0)