-
Notifications
You must be signed in to change notification settings - Fork 1
Model File
Modeling different combinations of datasets and model methodologies may be important for some projects. Model parsing notation is a shorthand that informs the model about which datasets to use in training and testing as well as which classifiers to use and how to weight them if necessary.
This is the example model.json provided for Map Gloss:
[
{
"name": "main",
"train": ["dev1", "dev2"],
"test": ["test"],
"classifiers": {
"tbl": 1.0
}
},
{
"name": "russian",
"train": ["dev1!rus", "dev2", "test"],
"test": ["dev1-rus"],
"classifiers": {
"tbl": 1.0
}
}
]Note that these examples only use the Transformation Based Learning (TBL) classifier. It is the only one thus far that has proven effective on a unique gloss based gloss mapping methodology.
The model.json file a user creates must be a list at the top level and each model is an object with key-value pairs. Each model has exactly four attributes that define it. The following explain this so that the user might better understand how to create their model.json file.
The user may select any name they wish for the model. All outputs will reflect this name.
Training and test sets both denote which datasets they draw from using lists. The lists simply contain strings of the dataset names.
Sometimes a user may want to further specify to include or exclude only certain languages within a dataset. Datasets have the optional notation marks of - and ! to do this. - delimits language ISO 639-3 codes after the name of the dataset. Proper notation is dataset-ISO-ISO-ISO. Simply leaving the dataset without any languages - delimited will cause all languages to load for the dataset. Including the languages will cause only those listed to be loaded. On the other hand, ! will cause the dataset to load all languages in the dataset except those following the !. The ! can only appear directly after the name of the dataset and precede any language ISO 639-3 codes. An example of this is above with the "russian" model will excludes rus in training but includes it during testing.
Optionally, the user may set "train" equal to none. Map Gloss will then load the AGGREGATION files Map Gloss has already processed including their gold standard and use these files for training.
Classifiers are defined by an object where the key is the classifier name and the value is the weight of the classifier. The sum of the classifier weights must exactly equal 1.0. However due to the current methodology only TBL is allowed so all classifiers should assign TBL a weight of 1.0 as in the examples.