lookup2cg - script
lookup2cg - script
Presentation
The script lookup2cg reformats the lookup output so that it can be interpreted as input to CG, the tool vislcg3 input. lookup2cg is a perl script, and as all other scripts, it is located in the gt/script directory.
The implementation
The input to the script is the output of lookup. The command to produce the input is, e.g:
$ echo "Dán" | lookup -flags mbTT -utf8 ~/main/gt/sme/bin/sme.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% Dán dát+Pron+Dem+Sg+Acc Dán dát+Pron+Dem+Sg+Gen
The lookup gives a list of available analyses for a given word form. The output of the lookup2cg script is input to vislcg which requires a format where the analyzed word form comes before the analyses. The analysis lines will have to consist of a base form in "" followed by the morphological tags.
"<Dán>" "dát" Pron Dem Sg Acc "dát" Pron Dem Sg Gen
The script reads one cohort at the time, and reorganizes the different parts of the analysis. In addition to the basic processing, lookup2cg has some special functions: It constructs the base form of an analyzed compound by comparing the analyses and the original input form. The compounds are rated according to the compounding points and only the analyses with least number of the compounding points are preserved. In addition to compound processing, the derivational tags which are not taken into account in CG are marked in the analysis with asterix (*). These special funcitons of the lookup2cg are discussed in detail in the following sections.
Compounds
Building a base form of a compound
The input to CG consists of the analyzed word form followed by a list of possible analyses. Each analysis contains a base form and the morphological tags. The compounds are problematic in this respect; in the lookup output, the analysis of a compound expression contains also the complete analyses of its parts. For example,
$ echo "bohccobiergobuktagiid" | lookup -flags mbTT -utf8 ~/main/gt/sme/bin/sme.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% bohccobiergobuktagiid boazu+N+SgGenCmp+Cmp#buvtta+N+Sg+Acc bohccobiergobuktagiid boazu+N+SgGenCmp+Cmp#buvtta+N+Sg+Gen
However, in CG, only the tags of the last compounding word are examined, and the analyses of the compounding parts are redundant information. The intermediate tags may thus be removed. On the other hand, the base form of the compound as a whole is not available, but has to be constructed in lookup2cg.
The problematic part is identifying the compound boundary. Just taking the first part from the analysis will not do, as there may be changes of 3 kinds: The final vowel (á, i, u) may have been weakened to (a, e, o), as for dállodoall_o_ekonomiija; there may be consonant gradation in the form (as when 'alimus#riekti#duopmu' becomes 'alimusrievttiduomuin') with a kt:vtt change; and the compound form may be shortened (and eventually changed), as when 'geahččat + vuohki' becomes 'geahččanvuogi'.
The base form is constructed basically by taking the analyzed word form, in this case "bohccobuktaga" and replacing the last word "buktaga" by its base form "buvtta". The output of the lookup2cg is then:
"<bohccobuktaga>" "bohcco#buvtta" N Sg Gen "bohcco#buvtta" N Sg Acc
The compound boundary is not marked in the input word "bohccobuktaga" but it has to be searched. The search is done by seraching the first letters of the base form with the input word form. First, the first 4 letters of the base form are searched from the input word; then 3, 2, and, as a last resort, 1. In the previous example, the matching string consisted of two letters: "bu".
This method is a source of a number of errors, since it is common that a string of two letters occurs several times in the compound, not to mention a string of only one letter. For example the compound "sierravuoigatvuođaid" has among others the following reading:
sierravuoigatvuođaid sierra+A+Attr+Cmp#vuoigat+A+Der/vuohta+N+Pl+Acc
Now when the base form "vuohta" is searched from the analyzed form "sierravuoigatvuođaid" by first comparing the first 4 letters of "vuohta" namely "vuoh". This string cannot be found. Then first 3 letters are searched: "vuo". That string occurs two times in the input word and basically there is no way to determine which one is correct. There is a heuristic rule involved that selects the latter occurence and replaces the word "vuođaid" vith "vuohta". The resulting base form is thus:
"<sierravuoigatvuođaid>" "sierravuoigatvuohta" N Pl Acc
(This is not completely true, since the derivational tags (here A) are not removed in the lookup2cg, but if they were.) The word form "sierravuoigatvuođaid" has another analysis as well, namely one which does not involve the derivational tag:
sierravuoigatvuođaid sierra+A+Attr+Cmp#vuoigatvuohta+N+Pl+Acc
Now the base form of the last part is "vuoigatvuohta". The first 4 letters "vuoi" are found from the analyzed form. Importantly, the first 3 letters would not sufffice to determine the correct word boundary, since the string "vuo" occurs two times in the analyzed word form. If the heuristic that selects the latter string were used, the wrong word form would be produced: "sierravuoigatvuoigatvuohta". Consider the word "sealgeetniin" which has among others the following readings:
sealgeetniin sealgi+N+SgNomCmp+Cmp#eadni+N+Sg+Com sealgeetniin sealgi+N+SgNomCmp+Cmp#eadni+N+Pl+Loc
The strings that are to be searched from the form "sealgeetniin" are the following "eadn", "ead", "ea" and "e", from the base form of the last part "eadni". By chance, the string "ea" is found from the input word, but not in the correct place. The correct string to search would have been the last "e" in the input word form. The procedure wrongly generates:
"<sealgeetniin>" "s#eadni" N Pl Loc "s#eadni" N Sg Com
This bug seems to be solved (?) also without lexicalising the word:
"<vealgeetniin>" "vealge#eadni" Hum N Pl Loc "vealge#eadni" Hum N Sg Com
Clearly, the basic string-comparison operations are not satisfactory method for producing base forms for compounds. The alternative would be to start using generative lexicon in finding out the base form, this is not implemented, mainly due to practical reasons.
Note that dropping the analyses of the compound parts also make it possible to get rid of "ambiguities" like the following:
rámmaeaktu rámma+N+SgNomCmp+Cmp#eaktu+N+Sg+Nom rámmaeaktu rámma+N+SgGenCmp+Cmp#eaktu+N+Sg+Nom
And to produce only one one analysis for CG:
"<rámmaeaktu>" "rámma#eaktu" N Sg Nom
Rating compounds according to the word boundaries
The compounds are rated according to (Fred) Karlsson's law: "In a compound word analysis, the analysis with the fewest word boundaries is the correct one." Only the compounds with the fewest word boundaries are preserved.
For example, the following input to the lookup2cg:
$ echo "bohccobiergobuktagiid" | lookup -flags mbTT -utf8 ~/main/gt/sme/bin/sme.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% bohccobiergobuktagiid boazu+Ani+N+SgGenCmp+Cmp#biergu+N+SgNomCmp+Cmp#buvtta+N+Pl+Gen bohccobiergobuktagiid boazu+Ani+N+SgGenCmp+Cmp#biergu+N+SgNomCmp+Cmp#buvtta+N+Pl+Acc bohccobiergobuktagiid boazu+Ani+N+SgGenCmp+Cmp#biergobuvtta+N+Pl+Gen bohccobiergobuktagiid boazu+Ani+N+SgGenCmp+Cmp#biergobuvtta+N+Pl+Acc bohccobiergobuktagiid bohccobiergu+N+SgNomCmp+Cmp#buvtta+N+Pl+Gen bohccobiergobuktagiid bohccobiergu+N+SgNomCmp+Cmp#buvtta+N+Pl+Acc
The compounds are rated straight after they arrive to the lookup2cg. Only the readings with the fewest compounding points are subject to further processing, in this example the lines:
"bohccobiergo#buvtta" N Pl Acc "bohccobiergo#buvtta" N Pl Gen "bohcco#biergobuvtta" N Pl Gen "bohcco#biergobuvtta" N Pl Acc
Derivational tags
Since the input to the parser is a human-readable dictionary, many derivations are present already in the dictionary. Due to the dynamic derivation component, they come out with a double or even multiple analysis, as the analysis with the derivational affix added in the parsing process is given as well. thus, we have "ambiguities" like the following:
$ echo "mearkkašupmi" | lookup -flags mbTT -utf8 ~/main/gt/sme/bin/sme.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% mearkkašupmi mearkkašit+V+TV+Der/PassL+V+Der/upmi+N+Sg+Nom mearkkašupmi mearkkašupmi+N+Sg+Nom $ echo "ealáhusheiveheapmi" | lookup -flags mbTT -utf8 ~/main/gt/sme/bin/sme.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% ealáhusheiveheapmi ealáhus+N+SgNomCmp+Cmp#heivet+V+IV+Der/h+V+Der/NomAct+N+Sg+Nom ealáhusheiveheapmi ealáhus+N+SgNomCmp+Cmp#heivehit+V+TV+Der/NomAct+N+Sg+Nom ealáhusheiveheapmi ealáhus+N+SgNomCmp+Cmp#heiveheapmi+N+Sg+Nom ealáhusheiveheapmi ealáhusheiveheapmi+N+Sg+Nom
Here is the list of the derivational tags for North Sámi:
+Der/adda +Der/ahtti +Der/alla +Der/asti +Der/d +Der/NomAct +Der/eamoš +Der/eapmi +Der/g +Der/geahtes +Der/h +Der/heapmi +Der/hudda +Der/huhtti +Der/huvva +Der/j +Der/l +Der/laš Der/+meahttun +Der/muš +Der/n +Der/š +Der/st +Der/stuvva +Der/us +Der/vuohta +Der/goahti +Der/lágan +Der/Dimin +Der/PassL +Der/PassS
The derivational tags are associated with at least a POS tag (N, V, Adv, A). The POS tags are marked with asterisk (*) to distinguish them from the POS tag of the compound. Thus the output of lookup2cg for is the following:
"<mearkkašupmi>" "mearkkašit" V* TV Der/PassL V* Der/upmi N Sg Nom "mearkkašupmi" N Sg Nom "<ealáhusjurddašeapmi>" "ealáhus#jurddašit" V* TV Der/NomAct N Sg Nom
The marking of the tags and constructing the base form of a word with derivational suffixes has to be reconsidered. More of derivational suffixes is presented in the following chapter. The improvements listed there are not implemented in lookup2cg.
Moments for building a preprocessor geared towards disambiguation
The goal is to feed only syntactically relevant information to the disambiguator. So, in the analysis of "bargiin", the correct analysis is that it is Sg Com of "bargi". Since this word is lexicalised, it is found as a noun in the lexicon.
"bargiin" S:1995 "bargat" V* TV Der/NomAg N Sg Com "bargi" N NomAg Sg Com
What we want is thus to treat all Actor nouns as if they were found in the lexicon in the first place. The problem is then to reverse the morphological process, and find the stem.
Der/NomAct
"lohkamat" S:631, 631, 631 "lohkat" V* TV Der/NomAct N Pl Nom "lohkan" N Pl Nom
Derivations
These ones do not induce consonant gradation in the stem:
- Der/alla
- Remove the -it part from the basic form and the and insert "alla"
- Der/ahtti
- Remove the -it part from the basic form and the and insert "ahtti"
- Der/NomAg
- Remove the -it part from the basic form and the and insert "eaddji"
- Der/NomAct
- Remove the -it part from the basic form and the and insert "eapmi"
- Der/l
- Remove the -t part from the basic form and insert "l"
- Der/vuohta
- Just add vuohta to the basic form, removing the intervening A tag. Problem: there is often a tag 'las1' to the left of 'vuohta', this tag causes CG. In these cases, vuohta cannot be added easily.
These ones do:
- Der/heapmi
- Der/d
- Der/h
For the non-gradating verb-to-noun suffixes, remove the V label preceeding the N.
"" S:1708 "čuovvut" V* TV Der/l V* Der/NomAct N Sg Nom "čuovvulit" V* TV Der/NomAct N Sg Nom "" "vuodjat" V* IV Der/d V* Der/NomAct N Sg Acc "vuodjat" V* IV Der/d V* Der/NomAct N Sg Gen "" "jorgalit" V* TV Der/ahtti V* TV Der/NomAct N Sg Ill "jorgalahttit" V* TV Der/NomAct N Sg Ill "" S:662 "mearridit" V Der/NomAg N Pl Ill
For the gradating suffixes, we should think more before doing anything.
"" S:636, 1479 "lassi" N* Der/heapmi A Sg Nom
by Trond Trosterud