Skip to content
This repository was archived by the owner on Aug 6, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ be found in [6], together with an extensive experimental evaluation.
* [Faiss](https://github.com/facebookresearch/faiss), for fast similarity search and bitext mining
* [transliterate 1.10.2](https://pypi.org/project/transliterate), only used for Greek (`pip install transliterate`)
* [jieba 0.39](https://pypi.org/project/jieba/), Chinese segmenter (`pip install jieba`)
* [mecab 0.996](https://pypi.org/project/JapaneseTokenizer/), Japanese segmenter
* [mecab 0.996](https://pypi.org/project/JapaneseTokenizer/), segmenter only used for Japanese.
* tokenization from the Moses encoder (installed automatically)
* [FastBPE](https://github.com/glample/fastBPE), fast C++ implementation of byte-pair encoding (installed automatically)

Expand All @@ -46,6 +46,8 @@ be found in [6], together with an extensive experimental evaluation.
`export LASER="${HOME}/projects/laser"`
* download encoders from Amazon s3 by `bash ./install_models.sh`
* download third party software by `bash ./install_external_tools.sh`
* If you do not have `unzip` on your system, run `sudo apt install unzip` beforehand.
* If your task involves Japanese text, install mecab manually by following the instructions that appear at the end.
* download the data used in the example tasks (see description for each task)

## Applications
Expand Down
7 changes: 6 additions & 1 deletion install_external_tools.sh
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,11 @@ InstallFastBPE
echo ""
echo "automatic installation of the Japanese tokenizer mecab may be tricky"
echo "Please install it manually from https://github.com/taku910/mecab"
echo ""
echo "The installation directory should be ${LASER}/tools-external/mecab"
echo ""
echo "When configuring mecab prior to installation, please enable utf8 output by running the following."
echo "for mecab: "
echo "./configure --enable-utf8-only"
echo "for mecab-ipadic: "
echo "./configure --with-charset=utf8"
echo ""