You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/chinese_app_release/index.md
+46-5Lines changed: 46 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -145,12 +145,53 @@ so that word lists can include jargon that might not be in a standard dictionary
145
145
{{< /gallery >}}
146
146
147
147
148
-
**Better accuracy in word mappings**
148
+
**Re-trained Word Mapping Model**
149
+
150
+
It turned out there was a lot wrong with the data I was using to train the
151
+
pinyin model. The dataset I used was a large corpus of Chinese sentences
152
+
from the web, and I didn't realize it was parsed using [python-pinyin-jyutping-sentence](https://github.com/Vocab-Apps/python-pinyin-jyutping-sentence).
153
+
I was essentially training on the output of a mostly rules-based system and
154
+
learning its innacuracies.
155
+
156
+
I took that same corpus combined with candidate mappings from a dictionary
157
+
and ran it through an LLM (gpt-5-mini and gemini-3 preview). The LLM did
158
+
a much better job of word segmentation, and selecting the correct pinyin
159
+
for each word in context.
160
+
161
+
```
162
+
Total sentences in validation set: 528
163
+
Sentences with differences: 452 (85.6%)
164
+
Total characters: 14201
165
+
Total matches: 12508
166
+
Overall library vs LLM accuracy: 88.08%
167
+
```
168
+
169
+
The library failed on very common cases, both on tones and pronunciations:
170
+
171
+
```
172
+
'地' (114x): llm='de' vs lib='dì' (112x) +2 more patterns
173
+
'都' (66x): llm='dōu' vs lib='dū' (64x) +2 more patterns
174
+
'来' (48x): llm='lai' vs lib='lái' (35x) +10 more patterns
175
+
'还' (47x): llm='hái' vs lib='huán' (44x) +3 more patterns
176
+
'个' (46x): llm='ge' vs lib='gè' (41x) +5 more patterns
177
+
'得' (44x): llm='de' vs lib='dé' (44x)
178
+
'的' (43x): llm='de' vs lib='yī' (4x) +34 more patterns
179
+
'奶' (43x): llm='nai' vs lib='nǎi' (40x) +3 more patterns
180
+
'更' (39x): llm='gèng' vs lib='gēng' (34x) +5 more patterns
181
+
'什' (36x): llm='shén' vs lib='shí' (36x)
182
+
```
183
+
184
+
185
+
I'm much more confident on the new model which is trained on more data, and
186
+
cleaner data.
187
+
188
+
I also realized [ckip-transformers]() has a GPL license. Although I may open
189
+
source eventually, but not just throw my amalgm of LLM generated helper scripts
190
+
and hacks on GitHub. For now, I've switched to
191
+
[UER-py](https://github.com/dbiir/UER-py/). The
192
+
`uer/albert-base-chinese-cluecorpussmall` has the exact same architecture and
193
+
vocab as the CKIP model, so it was a drop-in replacement.
149
194
150
-
As I detailed in the [last post](/posts/chinese_app_og/), the training data
151
-
wasn't perfect. It's still not perfect, but after a round of training with
152
-
cleaner data I can confidently say it's better. I don't want to provid a number
153
-
here yet, but I anticipate getting this to several 9's of accuracy soon.
0 commit comments