scripts for training a transliterator using a list of transliteration pairs.
- m2m-aligner
- python v2.7 (+ modules: argparser)
- cdec decoder
- ducttape v2.1 https://github.com/jhclark/ducttape
- ken lm
an example configuration file is provided ruen-config.tape. The following variables are mandatory:
ducttape_outputoutput directorytransliterator_homeroot of the transliterator's repositoryall_oovssource-language words which needs to be transliterated (e.g. a test set)char_lmkenlm-compiled language model of target language characters. An English character language model is providedtransliteration_pairssrc-tgt transliterations, one per line, formatted asSOURCE LANGUAGE ||| CEURSE LAUNJEm2m_maxXmaximum source-language character sequence which corresponds to one character in target languagem2m_maxYmaximum target-language character sequence which corresponds to one character in source languagenprocsnumber of processors to use for trainingwammar_utils_dirroot of this repositorym2m_alignerpath to m2m alignercdec_dirpath to cdec decoderDelX: yesmeans that some characters in the source language may be deletedDelY: yesmeans that some characters in the target language may be deleted
ducttape translit.tape -C ruen-config.tape -p Full -y
- use
mpi_adagrad_optimizeinstead ofmpi_flex_optimize - rewrite
convert-alignments-to-cdec-format.py
##disclaimer:
scripts are still under development and may be unstable. please do contact me if anything does not work.
if you use this software, consider citing our ACL 2012 workshop paper: http://www.cs.cmu.edu/~wammar/pubs/translit-acl12.pdf