Discussions with colleagues
Introduction
As the work has progressed, several problems have come up and been resolved, not the least due to generous input from colleagues at Xerox. Parts of the discussions are reprinted here, for future reference.
The Letters
Shortening
[Trond:] I have problems understanding a set of rules, cf. the following quote from my rule file (it works, now, but I do not understand why) (Note that "Shortening" below is the Saami linguistic term for the u-o and i-e althernation, the lax vowels are percieved as shorter than the tense ones): "Vowel Shortening in Compounds after Long 1st Syllable 1" Vx:Vy <=> [ Vow Vow | á ] Cns:+ Cns: _ StemCns:* X7: ; where Vx in (u i) Vy in (o e) matched ; ! The following 1st line was removed from this rule: ! Vx:Vy <=> [ Vow Vow | á ] Cns:+ Cns _ StemCns:* [ X7: | X8: ] ; ! the problem with this one was that it was not restricted to compounds !! ! The X7: | X8: line was commented out since no lexicon in sme-lex.txt ! pointed from X8 to R. This move should be checked for compounds later. KRB: I don't understand the use of X7 and X8, of course, but I note that there are a couple of differences between the current rule and the line commented out: 1. The current rule has Cns:+ Cns: in the left context. The commented out rule had Cns:+ Cns I.e. in the current case, you need to match at least 2 Cns symbols ON THE UPPER SIDE. In the commented out context, you need to match at least one Cns on the upper side, followed by a Cns that appears on both the upper and lower sides. 2. The other difference, of course, is that the current rule needs an upper side X7: to fire. It won't fire with an X8: Is that helpful? I assume you know about how "local variables" work in twolc rules. But just in case, here Vx and Vy are local variables, defined in the 'where' clause. Vx ranges over the set (u i) and Vy ranges over the set (o e). The 'where' clause also states that the ranges are "matched", meaning that u corresponds to o, and i corresponds to e. So the rule, as written, is a shorthand for the two following rules: "Rule 1" u:o <=> [ Vow Vow | á ] Cns:+ Cns: _ StemCns:* X7: ; "Rule 2" i:e <=> [ Vow Vow | á ] Cns:+ Cns: _ StemCns:* X7: !"Vowel Shortening in Compounds after Long 1st Syllable 2" ! Vx:Vy <=> [ #: | Cns ] Vz Cns+ Cns: Cns+ _ StemCns:* [ X7: | X8: ] ; ! where Vz in (a i o u) ; ! where Vx in (u i) ! Vy in (o e) ! matched ; ! XXX Remainder! This rule is commented out since including it ! spoils the i:á alternation of verbs (boahtiQ4n:boad1á, for some ! reason. ! This should be looked into. There are several very difficult problems encountered when writing two-level rules. 1. You have to be very careful about writing the contexts, which are themselves "two-level". Note that the example above has Vz Cns+ Cns: Cns+ in the left context, which is somewhat suspicious. Vz is equivalent to Vz:Vz, i.e. the same symbol on both sides. Then Cns+ matches one or more Cns symbols (on both sides), Cns: matches a Cns JUST ON THE UPPER SIDE, and another Cns+ matches one or more Cns symbols ON BOTH SIDES. In two-level rules, it is as easy to make your contexts TOO specific as it is to make them too general. Very often you need to write something like c:, matching only on the upper side, if some other rule in your system is (simultaneously) mapping that c to something else, like a k, or an empty string, on the lower side. In such a case, writing c, which is equivalent to c:c, is too specific. When you write or edit a two-level rule, you have to be aware of what all the other rules in the set are trying to do at the very same time. 2. Rule clashes. The rule spoils the i:�alternation probably because it _clashes_ with the rule controlling i:� I.e. there exists one or more upper-side input strings that match both rules, and one rules tries to map the lexical i to a surface e, and the other rule tries (simultaneously) to map the same lexical i to a surface � Instant rule clash. Looking at the two clashing rules together, you need to understand exactly WHY they are clashing and rewrite one or both of them so that the i:e alternation will occur always and only in the places where it is appropriate, and at the same time, the i:á alternation occurs always and only in the places where it is appropriate. The twolc rule compiler is very good at detecting the POSSIBILITY of a clash, even if your language never presents such a possibility to the rules. In such a case, you might consider using the load-lexicon feature (see the twolc documentation) to block the clash messages that just don't matter, for your language. In general, debugging twolc rule sets is a headache, which is why so many of our people prefer the xfst Replace Rules. "Vowel Shortening in Compounds of Contract Stems" Vx:Vy <=> [ Vow Vow | á ] Cns:+ Cns: _ StemCns:* X4: X7: ; where Vx in (u i) Vy in (o e) matched ; "Vowel Shortening for buorre" Vx:Vy <=> [ Vow Vow | á ] Cns:+ Cns _ X8: ; where Vx in (u i) Vy in (o e) matched ; ! This rule was actually made for one word only, buorre:buoret ! It replaces the missing "| X8:" sequence that is commented out in the ! "Vowel shortening ... 1" rule above. Why X8: does not work there is ! unclear. Note in the first rule that the left context has Cns:+ Cns: while the secondt has Cns:+ Cns (note the Cns rather than Cns:) That could be significant. If there is another rule in the set that maps the Cns just before _ to something else, then the second rule won't match. The second rule requires Cns:Cns just before the replace position _ . ! This rule should in any case where X4, X8 trigger cns gradation, whereas X7 does not. This looks like a mess. As the comment stated, the Shortening ...1 rule earlier contained the more elegant [ X7: | X8: ], but that did not work. So I removed X8:, got the problem with buorre, and reintroduced X8: in a separate rule. The result works, but is confusing, and makes me believe in ghosts in machines. I see that the left contexts are not quite identical (Cns vs. Cns: in the last LC position of the last rule), but do not remember why I have this difference. In two-level rules the difference between Cns and Cns: can make all the difference in the world, especially if there is another rule in the set that is mapping lexical Cns to something else on the surface. In retrospect, I cannot understand how the Shortening...2 rule could cause the (then very annoying) problem with failed i > �in present 1st, 2nd of verbs, but the problem went away when Shortening---2 was removed. Here is the rule for Present tense 1st, 2nd i -> � really a safe rule for boahtiQ4n:boad1�, as it seems (Q4 triggers cns grad as well): "Stem Vowel in 1st and 2nd Person Singular Present" i:á <=> Cns _ Q4: ; There are two kinds of two-level rule clash possible: 1. Right-arrow clashes. e.g. a:b => l _ r ; a:b => q _ m ; The first rule says that the pair a:b occurs only in the context l _ r. The second rule says that the pair a:b occurs only in the context q _ m. Clash. Easy to resolve: just combine the two rules into one rule with two contexts: a:b => l _ r ; q _ m ; 2. Left-arrow clashes. e.g. i:e <= _ x ; i:á<= y _ ; the problem here is what to do when the input is the string "yix", which matches both rules. The first want to output yex, and the second yá. Clash. Understanding why two-level rules clash takes some careful looking and a complete mastery of the semantics of the rules. Because the rules all apply simultaneously, and because the contexts of all the rules match on both sides of the relation, the difference between Cns and Cns: can be very significant. I hope that helps, Ken
Umlaut
I struggle with twolc, on different topics. Here I consider Umlaut and Vowel Harmony issues. In your Monish task in the Book, you use archiphonemes. For VH this is OK, I can make archiphonemic suffixes. But for Umlaut I would like to have a clean lexicon representation (just enter roots from the dictionary as is, and indicate exceptional Umlaut stems by sending them to trigger-containing sublexica, rather than to write double entries in the lexicon. Thus rather lexicon LEXICON NounRoot bok UMLAUTFEMNOUN ; grop FEMNOUN ; than LEXICON NounRoot bok:b^Ok FEMNOUN ; gropFEMNOUN ; Trond, There's nothing wrong with your approach. However, keep in mind that when using Xerox technology, you are no longer restricted to the two-level geometry. In a classic two-level (KIMMO-style) system, the forms you put in the lexicon are the forms that the ultimate user will see. But with Xerox tools, you can write the core dictionary to look like this LEXICON NounRoot b^Ok FEMNOUN ; and then, after all the alternation rules have been applied on the lower side, you can compose the following trivial rule ON TOP of the lexicon fst. o <- %^O .o. LexiconFST mapping the ^O UP to a plain o so that the user will see "bok" instead of "b^Ok" in final solution strings. The dictionaries I want to pour into the system already contain grammatical codes that matches the former alternative, and I thus prefer the former solution, unless there are good reasons to chose the latter one (well, one reason is a smaller number of lexica, but in this case I can live with that). No, it sounds like your approach is perfectly reasonable. bokZ1er => bker! book.IndefPl w/Uml, since o: <=> _ Cns:* Z1: ; groper =>groper! hollow.IndefPl without Umlaut b^Oker => bker! ditto, since ^O: <=> _ Cns: +Pl: ; groper => groper! (can I make reference to grammatical categories like this?) I take it that the Z1 (Diacritics without being it) is something you do not like, but that is the format I inherited (btw., should I keep it as such (cf. attachment) or should I use the Diacritics section that you deprecate in the book?). There's nothing wrong with using diacritics. After they've served their purpose, and you're done with them, you can map them to the empty string to keep the solution strings clean (if you like): 0 <- Z1 .o. LexiconFST Now, if you compile the two files I send you you will see that nob.fst cannot analyse bker, but if you invert the stack to inob.fst, then it can generate it. How comes? And how do I assure analysing capability as well as generation? This looks like a bug. The application routines don't like any of the surface words containing the letter. Thanks for sending the actual FST so that I could reproduce the problem. I've sent a report to the coders. I have the same problem with Finnish, when I try to make a vowel harmony rule based on the monish one. I generate, but do not analyse (source file forthcoming if needed). I take it the problem is the same, since I get the message "o: is not a feasible pair" when I do lex-pair test in twolc. This sounds like part of the same problem. Let's see what the coders have to say. Thanks for the report. Sorry for the confusion. Ken
Flag diacritics
Flag Diacritics This time the Real Diacritics. Here is the problem: Saami can have N+N compounds, and I install a loop lexicon, R, so that nominative and genitive forms of nouns are redirected to R (with due vowel shortening, as described in an earlier mail). Unfortunately, some nouns do not start out as such, but are derived from verbs and adjectives. This derivation process is handled by going from the approporiate adjectival and verbal stems via derivational suffixes to the relevant nominal sublexica, like this: verbstem+V => deraff+N, gives the wordform verbstem+V+deraff+N These (productively) derived nouns may participate in compounding. I thus want to allow the following compounds: stem+N & stem+N stem+N & [ stem+V aff+N ] stem+N & [ stem+A aff+N ] At the moment, I let my recursive lexicon R accept continuation both to N, V and A, thus I also allow the illicit compounds stem+N & stem+A stem+N & stem+V Cf. these examples: viesumuitalus viessu+N+Sg+Gen#muitalit+V+us+N+Sg+Nom viesumuitalit viessu+N+Sg+Gen#muitalit+V+Inf The first one is correct, as the 2nd component verb (muitalit) is turned into a noun (muitalus) via us-derivation (yes, the tag is +us). The 2nd one i want to avoid, since it is ungrammatical, (N+V compound), but as you can see, it is allowed by the 8at the moment9 overgenerating parser. KRB: you could, of course, produce the overgenerating lexicon and then remove the overgeneration by composing filters on the top. They would have to be carefully written to allow multiple-part legal compounds, but it would almost certainly be possible to match and filter out illegal compounds that way. But let's assume that you want to use flag diacritics. It seems I need a flag diacritic at LEXICON R saying "ok to continue to +V or +A only if I find an +N tag later on in the string". Since I want to use diacritics sparsely, I would like it as follows (I use the flag diacritic X instead of {U, P, N, R, D, C} in this illustration, since I am not sure which one will fill my needs): LEXICON R # NounRoot ; #@X.compound.illicit@ VerbRoot ; #@X.compound.illicit@ AdjectiveRoot ; Flag Diacritics are tricky for many users. Be sure to read the Flag Diacritics chapter and learn the semantics of each type. Remember that features are set, checked and perhaps cleared linearly as analysis goes from beginning to end of the word. The semantics you suggest above correspond to "Forward-looking Feature Requirements", which is treated in the last section in the current chapter on Flag Diacritics. You want to allow something, but only if something else is eventually found/satisfied down the road. The compounding significantly complicates the picture. As you go around the compound loops, you may find a need to set, check and clear features carefully at critical points. I had to do this a lot in Aymara, and it wasn't easy. For my own mnemonic convenience, I name such forward-looking feature requirements with "Need" or "Require" in the title, e.g. NeedNom for "needs nominalization". Let's just look at Nouns and Adjs, to make it slightly simpler. If and only if the Adjs are nominalized, they can participate in compounds. MulticharSymbols @P.NeedNom.ON@ ! positive setting of NeedNom = ON @D.NeedNom.ON@ ! disallow NeedNom = ON @C.NeedNom@ ! clear the NeedNom feature back to neutral LEXICON Root NounRoot ; ! Start of normal noun AdjRoot ; ! start of normal adj LEXICON NounRoot dog N ; cat N ; rat N ; ! at the end of each noun root, or nominalized noun stem, go through ! LEXICON N LEXICON N NounEndings ; ! whatever they might be, leading to end of word Compound ; ! or look for a noun or nominalized adj to compound onto LEXICON Compound %#:0 NounRoot ; ! add a compound boundary, loop back to NounRoot; easy case < %#:0 @P.NeedNom.ON@ > AdjRoot ; ! or add a compound boundary and set a feature NeedNom=ON and continue ! to the AdjRoot lexicon LEXICON AdjRoot blue A ; black A ; ! all adjectives pass through the following LEXICON A LEXICON A @D.NeedNom.ON@ AdjEndings ; ! go to normal adj endings, ! but DISALLOW this path if NeedNom is ON ness@C.NeedNom@ N ; ! nominalizing suffix, makes the result a noun ! continue to LEXICON N, which leads to nominal ! endings or further compounding; Clear any ! outstanding "need" for nominalization-- ! it has been satisfied. Continue like any noun. Handle the nominalized verbs similarly to the nominalilzed adjectives above. That's probably not completely right, but it's all I have time for now. The example above is relatively simple because it is written so that every nounroot or nominalized stem has to pass through LEXICON N, and every AdjRoot has to pass through LEXICON A. LEXICON A, in particular, is the place where you need to check and clear feature values. If your grammar has branches all over the place, and no central LEXICON through which all adjectives are funneled, then it will be harder to identify all the key points for checking and clearing features. NOt impossible, but a bit harder. Then derivational sublexica might look like LEXICON DeverbalNounsDOHPPE- +n+N@X.compound.licit@+Actio:m BOAHTIN ; +mus1+N@X.compound.licit@:mus1s1 MUSH ; +meahttun+A:X7#meahttum MEAHTTUN ; The idea being that a warning flag is set at R (value = illicit), and that it is removed only if the path ends in +N (value changed to licit), otherwise the string is blocked. Right. Setting NeedNom=ON is not difficult. You do that just before continuing to AdjRoots or VerbRoots that are acceptable only if they are eventually nominalized. And "removing" (clearing) NeedNom features when you find a suitable nominalizing suffix is easy. The harder part is identifying all the other paths out of AdjRoots and VerbRoots that do _not_ involve nominalization and blocking them if NeedNom=ON. That's what the first entry in LEXICON A does. It lets you follow a non-nominalizing path out of AdjRoot, but only if NeedNom is Not set to ON. What diacritic should I use? I have suggested a combination of P (positive setting), D (disallow) and C (clear) diacritics. There are no doubt other ways, using U (unification) and perhaps R (require) features, but this one fits my intuition about "forward- looking feature requirements". Must I have a line in the xfst part of the Makefile, to remove strings with a certain flag diacr value, or will one of your flags do that by itself (string is removed if flag value is inherently illegal, so to speak) Re-read the chapter on Flag Diacritics, which explain how they work. It's the application routines themselves (apply up, apply down, and the 'lookup' application) that are sensitive to flag diacritics and will block illegal paths at runtime. It's automatic. (In xfst, you can manually turn off the "obeying" of flag diacritics, but it's on by default.) but from what I see, you check from the suffix point of view (is the input I get legal) rather from the stem's point of view (I'll be legal only if saturated by a certain type of suffix later in this derivation). See the last section in the chapter "Forward-Looking Feature Requirements" to see a somewhat simple example of "Allow this tentatively, only if something else turns up later". When compounding is involved, it gets a bit trickier, but not impossible. Good luck, Ken