Skip to content

Dataset File

mlockwood edited this page Dec 27, 2016 · 8 revisions

Purpose

The dataset file provides a shorthand syntax for discovering and loading XIGT and choices files. All IGT is loaded from XIGT based XML files. Map Gloss does not load via any other method. Choices files are for evaluating the inference section of map gloss only. Choices files follow the conventions of the LinGO Grammar Matrix Customization System (the matrix can be accessed here). The Grammar Matrix and XIGT are components of the AGGREGATION project (visit AGGREGATION's website).

Example

This the example dataset.json provided for Map Gloss:

[
  {
    "find_path_key": "map_gloss",
    "method": "agg",
    "name": "dev1",
    "path": "example/data/dev1"
  },
  {
    "find_path_key": "map_gloss",
    "method": "agg",
    "name": "dev2",
    "path": "example/data/dev2"
  },
  {
    "find_path_key": "map_gloss",
    "method": "agg",
    "name": "test",
    "path": "example/data/test"
  }
]

Dataset Attributes

The dataset.json file a user creates must be a list at the top level and each dataset is an object with key-value pairs. During processing this gets reconfigured so that the top level is an object where the dataset name is the key and the dataset is the value but the user does not need to be aware of this. Each dataset has potentially up to four attributes that define it. The following explain this so that the user might better understand how to create their dataset.json file.

Name

The name attribute is the name provided for the dataset. This can be any string value.

Path and Find_Path_Key

The path key is an absolute path to point Map Gloss where it should look for XIGT and choices files. However, if the user does not want to place the absolute path they can use find_path_key to point to a directory location and Map Gloss will find the absolute path up to that point. Then the path attribute can be used to specify the location further from that point or it can be left blank. A key caveat is that find_path_key only works on directories that map_gloss resides within. If the user places map_gloss within a project/src/ directory structure then find_path_key can be called to project, src, and map_gloss. But if project has another subdirectory called data find_path_key will not find this location.

Method

The method attribute tells Map Gloss which discovery method it will employ.

For "agg" Map Gloss will use the combined result of find_path_key and path to search its subdirectories. Any subdirectory of length less than or equal to 3 will be considered a language for discovery. Then each subdirectory will be explored. Files with "testsuite-enriched.xml" will be extracted as XIGT and files with "choices.up" will be extracted as choices.

For "odin" Map Gloss will use the combined result of find_path_key and path to search for XML files. Any XML file it finds will be considered a XIGT file and the name of the file excluding the .xml will be considered the ISO 639-3 name of the language.

In the example only the "agg" option was shown; "odin" datasets would look similar except their file discovery strategy would be different.

Clone this wiki locally