A text mining model that uses N-gram models (in this instance 3) to detects if someone is from Taiwan or China. Originally called 共匪測試機, and changed as the name was not being very friendly to our overseas neighbours and potential overlords.
There are two overall goals for this project:
- Calculate the probability of whether a sequence of strings is more likely to be from Taiwan or China
- Predict the next character / string from a given string
Install python 3, pip and (optionally) venv on your computer and the required packages from the requirements.txt file. As of writing this, there is no need to install anything other than python 3 to have it functional. However if you wish to use the features of
- translating 簡體華文 to 繁體華文
- checking the F1 score for the predictions
install the required python packages
pip install -r requirements.txtThis is tested on python 3.8.6 on Ubuntu 20.10 and Windows but most likely there would not be any problems if you're using python 3.x or running Mac OS.
As of now, we do not use the nltk package for our purposes, rather we wrote out own implementation of
- tokenization (removing all non-繁體華文 unicode characters)
- building the n-gram
- smoothing technique (Lidstone's Law)
- some kind of classifier
We wish to pivot towards using more standard packages (such as nltk) in the future.
To use this, prepare a bunch of documents (.txt files) that are from China and Taiwan and seperate them in two folders (The default preset is ChinaDataset/ and TaiwanDataset/). You could change the folder directories in ngram.py.
# change the directories if you wish
china_dataset = files_to_list('./ChinaDataset/')
taiwan_dataset = files_to_list('./TaiwanDataset/')Then just use your terminal/command line and type
python3 ngram.pylet it train and type in the sentence you wish to check
Sentence: 我是從火星來的Parts of my code comes from the articles I have read online and I may miss out on some credits. So, if you see your code used and not credited here, please do tell.
- A Comprehensive Guide to Build your own Language Model in Python! - Mohd Sanad Zaki Rizvi
- Building Language Models for Text with Named Entities - Md Rizwan Parvez, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang
- Building language models - bogdani
- 自然語言處理 — 使用 N-gram 實現輸入文字預測 - Airwaves
- 结巴:中文分词组件
Also, my greatest thanks to my teammates for helping me. Even though there are no commit messages written by them, most of commit #34dd2d83 and all the web scrapping for the databases is not my work.