GitHub - wammar/transliterator: a transliterator based on http://www.cs.cmu.edu/~wammar/pubs/translit-acl12.pdf

scripts for training a transliterator using a list of transliteration pairs.

dependencies:

m2m-aligner
python v2.7 (+ modules: argparser)
cdec decoder
ducttape v2.1 https://github.com/jhclark/ducttape
ken lm

configurations:

an example configuration file is provided ruen-config.tape. The following variables are mandatory:

ducttape_output output directory
transliterator_home root of the transliterator's repository
all_oovs source-language words which needs to be transliterated (e.g. a test set)
char_lm kenlm-compiled language model of target language characters. An English character language model is provided
transliteration_pairs src-tgt transliterations, one per line, formatted as SOURCE LANGUAGE ||| CEURSE LAUNJE
m2m_maxX maximum source-language character sequence which corresponds to one character in target language
m2m_maxY maximum target-language character sequence which corresponds to one character in source language
nprocs number of processors to use for training
wammar_utils_dir root of this repository
m2m_aligner path to m2m aligner
cdec_dir path to cdec decoder
DelX: yes means that some characters in the source language may be deleted
DelY: yes means that some characters in the target language may be deleted

example usage:

ducttape translit.tape -C ruen-config.tape -p Full -y

todos:

use mpi_adagrad_optimize instead of mpi_flex_optimize
rewrite convert-alignments-to-cdec-format.py

##disclaimer:

scripts are still under development and may be unstable. please do contact me if anything does not work.

if you use this software, consider citing our ACL 2012 workshop paper: http://www.cs.cmu.edu/~wammar/pubs/translit-acl12.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
ruen		ruen
README.md		README.md
acc.py		acc.py
accBackend.py		accBackend.py
analyze-transliterations.py		analyze-transliterations.py
aren-config.tape		aren-config.tape
augment-parallel-names-with-russian-inflections.py		augment-parallel-names-with-russian-inflections.py
convert-alignments-to-cdec-format.py		convert-alignments-to-cdec-format.py
convert-alignments-to-testset.py		convert-alignments-to-testset.py
convert-bars-format-to-m2m-format.py		convert-bars-format-to-m2m-format.py
convert-cdec-kbest-output-to-xml.py		convert-cdec-kbest-output-to-xml.py
convert-test-xml-format-to-cdec-input-format.py		convert-test-xml-format-to-cdec-input-format.py
convert-xml-format-to-m2m-format.py		convert-xml-format-to-m2m-format.py
convert-xml-format-to-wordpair-format.py		convert-xml-format-to-wordpair-format.py
create-kbest-grammar.py		create-kbest-grammar.py
filter-rules.py		filter-rules.py
hien-config.tape		hien-config.tape
mono.en.char.lm		mono.en.char.lm
prob-ylen-given-xlen.py		prob-ylen-given-xlen.py
remove-long-examples.py		remove-long-examples.py
rerank.py		rerank.py
rerankBackend.py		rerankBackend.py
ruen-config.tape		ruen-config.tape
split-alignments-into-train-test.py		split-alignments-into-train-test.py
string-to-cdec-input.py		string-to-cdec-input.py
test-conditional-length-model.py		test-conditional-length-model.py
test.py		test.py
train-conditional-length-model.py		train-conditional-length-model.py
translit-oovs.py		translit-oovs.py
translit.tape		translit.tape
tuneRerankWeights.py		tuneRerankWeights.py
word-to-char.py		word-to-char.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dependencies:

configurations:

example usage:

todos:

About

Uh oh!

Releases

Packages

Languages

wammar/transliterator

Folders and files

Latest commit

History

Repository files navigation

dependencies:

configurations:

example usage:

todos:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages