Skip to content

Commit c5828ef

Browse files
committed
Scripts and updated data for tagger training
1 parent 1fea699 commit c5828ef

20 files changed

+224
-324
lines changed

.gitattributes

Lines changed: 2 additions & 197 deletions
Original file line numberDiff line numberDiff line change
@@ -1,202 +1,7 @@
11
* text=auto !eol
2-
/AUTHORS -text
3-
/COPYING -text
4-
/ChangeLog -text
5-
/Makefile.am -text
6-
/NEWS -text
7-
/README -text
8-
/apertium-eng.eng.acx -text svneol=unset#application/xml
9-
/apertium-eng.eng.dix -text
10-
/apertium-eng.eng.mtx -text
11-
/apertium-eng.eng.rlx -text
12-
/apertium-eng.eng.tsx -text
13-
/apertium-eng.pc.in -text
14-
/apertium-eng.post-eng.dix -text
15-
/autogen.sh -text
16-
/commondefns.mtx -text
17-
/configure.ac -text
18-
dev/eng.freq.txt -text
19-
/eng.prob -text svneol=unset#unset
20-
/modes.xml -text svneol=unset#application/xml
21-
/tagger.make -text
22-
texts/MANIFEST -text
23-
texts/TRAINING -text
24-
texts/afghanistan1.handtagged.ilie.txt -text
25-
texts/afghanistan1.handtagged.yongloong.txt -text
26-
texts/afghanistan1.raw.txt -text
27-
texts/afghanistan1.tagged.txt -text
28-
texts/aizkolaritza1.handtagged.andrei.txt -text
29-
texts/aizkolaritza1.raw.txt -text
30-
texts/aizkolaritza1.tagged.txt -text
31-
texts/alabama1.handtagged.darkgaia.txt -text
32-
texts/alabama1.handtagged.ilie.txt -text
33-
texts/alabama1.raw.txt -text
34-
texts/alabama1.tagged.txt -text
35-
texts/albedo1.handtagged.andrei.txt -text
36-
texts/albedo1.handtagged.niroulapradeep.txt -text
37-
texts/albedo1.raw.txt -text
38-
texts/albedo1.tagged.txt -text
39-
texts/alfred1.handtagged.darkgaia.txt -text
40-
texts/alfred1.handtagged.yongloong.txt -text
41-
texts/alfred1.raw.txt -text
42-
texts/alfred1.tagged.txt -text
43-
texts/algeria1.handtagged.quicksilver.txt -text
44-
texts/algeria1.handtagged.rap.txt -text
45-
texts/algeria1.handtagged.yugin.txt -text
46-
texts/algeria1.raw.txt -text
47-
texts/algeria1.tagged.txt -text
48-
texts/aluminium1.handtagged.andrei.txt -text
49-
texts/aluminium1.raw.txt -text
50-
texts/aluminium1.tagged.txt -text
51-
texts/ambiorix1.handtagged.darkgaia1.txt -text
52-
texts/ambiorix1.handtagged.quicksilver.txt -text
53-
texts/ambiorix1.raw.txt -text
54-
texts/ambiorix1.tagged.txt -text
55-
texts/anarchism1.handtagged.darkgaia.txt -text
56-
texts/anarchism1.handtagged.ilie.txt -text
57-
texts/anarchism1.raw.txt -text
58-
texts/anarchism1.tagged.txt -text
59-
texts/anarchism1.tagged.txt.1 -text
60-
texts/animism1.handtagged.andrei.txt -text
61-
texts/animism1.handtagged.cmy.txt -text
62-
texts/animism1.raw.txt -text
63-
texts/animism1.tagged.txt -text
64-
texts/atoms1.handtagged.andrei.txt -text
65-
texts/atoms1.handtagged.kristian-hansen.txt -text
66-
texts/atoms1.raw.txt -text
67-
texts/atoms1.tagged.txt -text
68-
texts/belize1.handtagged._VN_.txt -text
69-
texts/belize1.handtagged.darkgaia.txt -text
70-
texts/belize1.raw.txt -text
71-
texts/belize1.tagged.txt -text
72-
texts/belize2.handtagged.andrei.txt -text
73-
texts/belize2.handtagged.darkgaia.txt -text
74-
texts/belize2.raw.txt -text
75-
texts/belize2.tagged.txt -text
76-
texts/bessarabia1.handtagged.andrei.txt -text
77-
texts/bessarabia1.handtagged.darkgaia.txt -text
78-
texts/bessarabia1.raw.txt -text
79-
texts/bessarabia1.tagged.txt -text
80-
texts/caucasus1.handtagged.bobsan.txt -text
81-
texts/caucasus1.handtagged.leongxuhua.txt -text
82-
texts/caucasus1.handtagged.rap.txt -text
83-
texts/caucasus1.raw.txt -text
84-
texts/caucasus1.tagged.txt -text
85-
texts/coil1.handtagged.andrei.txt -text
86-
texts/coil1.handtagged.kranzer.txt -text
87-
texts/coil1.raw.txt -text
88-
texts/coil1.tagged.txt -text
89-
texts/colombia1.handtagged.ilie.txt -text
90-
texts/colombia1.handtagged.quicksilver.txt -text
91-
texts/colombia1.raw.txt -text
92-
texts/colombia1.tagged.txt -text
93-
texts/consensus/afghanistan1.consensus.txt -text
94-
texts/consensus/aizkolaritza1.consensus.txt -text
95-
texts/consensus/alabama1.consensus.txt -text
96-
texts/consensus/albedo1.consensus.txt -text
97-
texts/consensus/alfred1.consensus.txt -text
98-
texts/consensus/algeria1.consensus.txt -text
99-
texts/consensus/aluminium1.consensus.txt -text
100-
texts/consensus/ambiorix1.consensus.txt -text
101-
texts/consensus/anarchism1.consensus.txt -text
102-
texts/consensus/animism1.consensus.txt -text
103-
texts/consensus/atoms1.consensus.txt -text
104-
texts/consensus/belize1.consensus.txt -text
105-
texts/consensus/belize2.consensus.txt -text
106-
texts/consensus/bessarabia2.consensus.txt -text
107-
texts/consensus/caucasus1.consensus.txt -text
108-
texts/consensus/coil1.consensus.txt -text
109-
texts/consensus/colombia1.consensus.txt -text
110-
texts/consensus/dakhma1.consensus.txt -text
111-
texts/consensus/derbent1.consensus.txt -text
112-
texts/consensus/eng.tagged -text
113-
texts/consensus/faust1.consensus.txt -text
114-
texts/consensus/karabakh1.consensus.txt -text
115-
texts/consensus/kazakhstan1.consensus.txt -text
116-
texts/consensus/khazars1.consensus.txt -text
117-
texts/consensus/kyrgyzstan.consensus.txt -text
118-
texts/consensus/monarchy1.consensus.txt -text
119-
texts/consensus/partisans1.consensus.txt -text
120-
texts/consensus/peru1.consensus.txt -text
121-
texts/consensus/pirates.consensus.txt -text
122-
texts/consensus/sahara1.consensus.txt -text
123-
texts/consensus/sunday1.consensus.txt -text
124-
texts/consensus/tokamak1.consensus.txt -text
125-
texts/consensus/transmission1.consensus.txt -text
126-
texts/consensus/turing1.consensus.txt -text
127-
texts/consensus/vietnam1.consensus.txt -text
128-
texts/dakhma1.handtagged.spectie.txt -text
129-
texts/dakhma1.raw.txt -text
130-
texts/dakhma1.tagged.txt -text
131-
texts/derbent1.handtagged.andrei.txt -text
132-
texts/derbent1.handtagged.ilie.txt -text
133-
texts/derbent1.raw.txt -text
134-
texts/derbent1.tagged.txt -text
135-
texts/faust1.handtagged.spectie.txt -text
136-
texts/faust1.raw.txt -text
137-
texts/faust1.tagged.txt -text
138-
texts/generate-manifest.sh -text
139-
texts/karabakh1.handtagged.darkgaia1.txt -text
140-
texts/karabakh1.handtagged.ilie.txt -text
141-
texts/karabakh1.raw.txt -text
142-
texts/karabakh1.tagged.txt -text
143-
texts/kazakhstan1.handtagged.andrei.txt -text
144-
texts/kazakhstan1.handtagged.ilie.txt -text
145-
texts/kazakhstan1.handtagged.kranzer.txt -text
146-
texts/kazakhstan1.raw.txt -text
147-
texts/kazakhstan1.tagged.txt -text
148-
texts/khazars1.handtagged.andrei.txt -text
149-
texts/khazars1.handtagged.kranzer.txt -text
150-
texts/khazars1.raw.txt -text
151-
texts/khazars1.tagged.txt -text
152-
texts/kurchatov1.raw.txt -text
153-
texts/kyrgyzstan1.handtagged.andrei.txt -text
154-
texts/kyrgyzstan1.handtagged.jxthng.txt -text
155-
texts/kyrgyzstan1.handtagged.kiril-kostadinov.txt -text
156-
texts/kyrgyzstan1.raw.txt -text
157-
texts/kyrgyzstan1.tagged.txt -text
158-
texts/monarchy1.handtagged.andrei.txt -text
159-
texts/monarchy1.handtagged.ivanstamboliev.txt -text
160-
texts/monarchy1.raw.txt -text
161-
texts/monarchy1.tagged.txt -text
162-
texts/partisans1.handtagged.andrei.txt -text
163-
texts/partisans1.handtagged.quicksilver.txt -text
164-
texts/partisans1.raw.txt -text
165-
texts/partisans1.tagged.txt -text
166-
texts/peru1.handtagged.andrei.txt -text
167-
texts/peru1.handtagged.darkgaia.txt -text
168-
texts/peru1.raw.txt -text
169-
texts/peru1.tagged.txt -text
170-
texts/pirates.handtagged.porsi98.txt -text
171-
texts/pirates.raw.txt -text
172-
texts/sahara1.handtagged.andrei.txt -text
173-
texts/sahara1.handtagged.darkgaia1.txt -text
174-
texts/sahara1.handtagged.niroulapradeep.txt -text
175-
texts/sahara1.raw.txt -text
176-
texts/sahara1.tagged.txt -text
177-
texts/sunday1.handtagged.alvin.txt -text
178-
texts/sunday1.handtagged.andrei.txt -text
179-
texts/sunday1.handtagged.kristian-hansen.txt -text
180-
texts/sunday1.handtagged.yoonglong.txt -text
181-
texts/sunday1.raw.txt -text
182-
texts/sunday1.tagged.txt -text
183-
texts/tin_soldier.handtagged.wenxuanteoh.txt -text
184-
texts/tin_soldier.raw.txt -text
185-
texts/tokamak1.handtagged.andrei.txt -text
186-
texts/tokamak1.handtagged.jorge-correia.txt -text
187-
texts/tokamak1.raw.txt -text
188-
texts/tokamak1.tagged.txt -text
189-
texts/transmission1.handtagged.spectie.txt -text
190-
texts/transmission1.raw.txt -text
191-
texts/transmission1.tagged.txt -text
192-
texts/turing1.handtagged.spectie.txt -text
193-
texts/turing1.raw.txt -text
194-
texts/turing1.tagged.txt -text
195-
texts/vietnam1.handtagged.andrei.txt -text
196-
texts/vietnam1.handtagged.yuang-siang-ng.txt -text
197-
texts/vietnam1.raw.txt -text
198-
texts/vietnam1.tagged.txt -text
2+
*.prob binary
1993
*.dix linguist-language=XML linguist-detectable=true
4+
*.metadix linguist-language=XML linguist-detectable=true
2005
*.lrx linguist-language=XML linguist-detectable=true
2016
*.lsx linguist-language=XML linguist-detectable=true
2027
*.tsx linguist-language=XML linguist-detectable=true

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,10 @@
1818
/.deps
1919
/test/*-output.txt
2020
/dev/testvoc/duplicates.*.txt
21+
/tagger-data/eng.tagged
22+
/tagger-data/eng.tagged.txt
23+
/tagger-data/eng.untagged
24+
/tagger-data/eng.dic
25+
/tagger-data/eng.dic.expanded
26+
/tagger-data/eng.crp
27+
/tagger-data/eng.crp.txt

apertium-eng.eng.dix

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,6 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
111111
<sdef n="obj" c="Object"/>
112112
<sdef n="subj" c="Subject"/>
113113
<sdef n="pers" c="Personal (pronoun)"/>
114-
<sdef n="file" c="Filename"/>
115114
</sdefs>
116115
<pardefs>
117116
<pardef n="guionet">
@@ -251,10 +250,6 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
251250

252251
</pardef>
253252

254-
<pardef n="filenames">
255-
<e> <re>[a-z\-_0-9]+.[a-z0-9]+</re><p><l></l><r><s n="n"/><s n="file"/></r></p></e>
256-
</pardef>
257-
258253
<pardef n="Aa">
259254
<e> <i>A</i></e>
260255
<e r="LR"><p><l>a</l> <r>A</r></p></e>
@@ -8319,11 +8314,13 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
83198314
<e lm="northbound"> <i>northbound</i><par n="expensive__adj"/></e>
83208315
<e lm="northeasterly"> <i>northeasterly</i><par n="expensive__adj"/></e>
83218316
<e lm="northeastern"> <i>northeastern</i><par n="expensive__adj"/></e>
8317+
<e r="LR" lm="north-eastern"><p><l>north-eastern</l><r>northeastern</r></p><par n="expensive__adj"/></e>
83228318
<e lm="northerly"> <i>northerly</i><par n="expensive__adj"/></e>
83238319
<e lm="northern"> <i>northern</i><par n="expensive__adj"/></e>
83248320
<e lm="northernmost"> <i>northernmost</i><par n="expensive__adj"/></e>
83258321
<e lm="northwesterly"> <i>northwesterly</i><par n="expensive__adj"/></e>
83268322
<e lm="northwestern"> <i>northwestern</i><par n="expensive__adj"/></e>
8323+
<e r="LR" lm="north-western"><p><l>north-western</l><r>northwestern</r></p><par n="expensive__adj"/></e>
83278324
<e lm="nostalgic"> <i>nostalgic</i><par n="expensive__adj"/></e>
83288325
<e lm="nosy"> <i>nosy</i><par n="expensive__adj"/></e>
83298326
<e lm="not-for-profit"> <i>not-for-profit</i><par n="expensive__adj"/></e>
@@ -16390,6 +16387,7 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
1639016387
<e lm="4-3-3 system"> <i>4-3-3<b/>system</i><par n="house__n"/></e>
1639116388
<e lm="4-4-2 system"> <i>4-4-2<b/>system</i><par n="house__n"/></e>
1639216389
<e lm="(x)th and (y)th centuries (AD|BC)"><par n="ordinals-century"/><i><b/>and<b/></i><par n="ordinals-century"/><i><b/>centuries</i><par n="BC"/><i></i><par n="personnel__n"/></e>
16390+
<e lm="decades"> <re>[12][0-9][0-9]0s</re><p><l></l><r><s n="n"/><s n="pl"/></r></p></e>
1639316391

1639416392

1639516393
<!-- SECTION: Nouns -->
@@ -69352,6 +69350,7 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
6935269350
<e lm="suck"> <i>suck</i><par n="accept__vblex"/></e>
6935369351
<e lm="suckle"> <i>suckl</i><par n="liv/e__vblex"/></e>
6935469352
<e lm="suffer"> <i>suffer</i><par n="accept__vblex"/></e>
69353+
<e lm="suffer from"> <i>suffer</i><par n="accept__vblex"/><p><l><b/>from</l><r><g><b/>from</g></r></p></e>
6935569354
<e lm="suffer together"> <i>suffer</i><par n="accept__vblex"/><p><l><b/>together</l><r><g><b/>together</g></r></p></e>
6935669355
<e lm="suffice"> <i>suffic</i><par n="liv/e__vblex"/></e>
6935769356
<e lm="suffix"> <i>suffix</i><par n="accept__vblex"/></e>
@@ -69750,6 +69749,7 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
6975069749
<e lm="use as a proverb"><i>us</i><par n="liv/e__vblex"/><p><l><b/>as<b/>a<b/>proverb</l><r><g><b/>as<b/>a<b/>proverb</g></r></p></e>
6975169750
<e lm="use bellows"> <i>us</i><par n="liv/e__vblex"/><p><l><b/>bellows</l><r><g><b/>bellows</g></r></p></e>
6975269751
<e lm="use to"> <i>us</i><par n="liv/e__vblex"/><p><l><b/>to</l><r><g><b/>to</g></r></p></e>
69752+
<e lm="use to"> <i>us</i><par n="liv/e__vblex"/><p><l><b/></l><r><j/></r></p><i>to</i><par n="at__pr"/></e>
6975369753
<e lm="usher"> <i>usher</i><par n="accept__vblex"/></e>
6975469754
<e lm="usurp"> <i>usurp</i><par n="accept__vblex"/></e>
6975569755
<e lm="utilise"> <i>utili</i><par n="analy/se__vblex"/></e>
@@ -70110,7 +70110,6 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
7011070110
<e> <par n="coma"/></e>
7011170111
<e> <par n="cometa"/></e>
7011270112
<e> <par n="emails"/></e>
70113-
<e> <par n="filenames"/></e>
7011470113
<e> <par n="guionet"/></e>
7011570114
<e> <par n="lainausmerkit"/></e>
7011670115
<e> <par n="numeros"/></e>

eng.hmm.prob

-46.2 KB
Binary file not shown.

eng.perceptron.prob

1.11 MB
Binary file not shown.

eng.prob

-1.06 MB
Binary file not shown.

modes.xml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@
2525
<program name="cg-proc -w">
2626
<file name="eng.rlx.bin"/>
2727
</program>
28+
<program name="apertium-tagger -g">
29+
<file name="eng.prob"/>
30+
</program>
2831
</pipeline>
2932
</mode>
3033

tagger-data/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Corpora for tagger training
2+
3+
This folder contains the corpora required to train the statistical tagger. They are grouped in two folders, `tagged` and `crp`.
4+
5+
## tagged
6+
7+
Corpora for [supervised training](https://wiki.apertium.org/wiki/Supervised_tagger_training). Files must have the extension `.tagged` and each line must contain a disambiguated Apertium lexical unit, in [Apertium stream format](https://wiki.apertium.org/wiki/Apertium_stream_format). For example:
8+
9+
```
10+
^Stars/star<n><pl>$
11+
^shine/shine<vblex><pres>$
12+
^brightly/brightly<adv>$
13+
^in/in<pr>$
14+
^the/the<det><def><sp>$
15+
^sky/sky<n><sg>$
16+
^./.<sent>$
17+
```
18+
19+
Before training, files from the folder are merged into a file named `eng.tagged`.
20+
21+
## crp
22+
23+
Corpora for [unsupervised training](https://wiki.apertium.org/wiki/Unsupervised_tagger_training). Files must have the extension `.txt` and contain **plain (raw)** text.
24+
25+
Before training, files from the folder are merged into a file named `eng.crp.txt`.

tagger-data/crp/example.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This is an example text for unsupervised training.

0 commit comments

Comments
 (0)