This repository is a collection of almost all Thai tokenisers that are publicly available. Having this collection allows us to try each algorithm as ease via Docker.
Technically, each project (called vendor) has its own Docker image with a entry script and auxiliary scripts.
These scripts bring a unified interface, allowing us to run those algorithms in the same way.
| Vendor | Alias | Available Methods | Container Profile |
|---|---|---|---|
| PyThaiNLP | pythainlp | newmm, longest | |
| DeepCut | deepcut | deepcut | |
| CutKum | cutkum | cutkum | |
| Sertis | sertis | sertis | |
| Thai Language Toolkit | tltk | mm, ngram, colloc | |
| Smart Word Analysis for Thai (SWATH) | swath | max, long | |
| Chrome's v8Breakiterator | chrome | v8breakiterator |
Please see Usages for more details.
- Pull necessary Docker images. Please check Docker Hub for the avaliable images.
$ docker pull pythainlp/word-tokenizers:<vendor-alias>
- Put text files that you want to tokenise into
./data. - Run the following command ...
$ ./scripts/tokenise.sh <vendor-alias>-<method> <**filename**>
Please check Vendors section for vendors and methods included here.
Let's say you want to tokenise text in ./data/example.text using PyThaiNLP's newmm algorithm. You can use the following command:
$ cat ./data/example.text
อันนี้คือตัวอย่าง
$ ./scripts/tokenise.sh pythainlp:newmm example.text
# Please be aware that you don't need to have ./data in front of the filename.
# Command Output
Tokenising example.text using vendor=pythainlp and method=newmm
CMD: docker run -v /Users/heytitle/projects/tokenisers-for-thai/data:/data thai-tokeniser:pythainlp newmm example.text
100%|██████████| 1/1 [00:00<00:00, 151.70it/s]
Tokenising /data/example.text with newmm
Tokenised text is written to /data/example_tokenised-pythainlp-newmm.text
$ cat ./data/example_tokenised-pythainlp-newmm.text
อันนี้|คือ|ตัวอย่าง
TBD.
$ ./scripts/build <vendor>
$ ./scripts/push <vendor>
- This repository was initially done by Pattarawat Chormai, whiling interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand.