Lexc Twolc Development
This document explains how to improve the analysers. We assume everything is
You know you have reached this stage when the command make check gives you
SUMMARY for the gt-desc fst(s): PASSES: 36 / FAILS: 232 / TOTAL: 268
An example
Consonant gradation in Inari Saami
When debugging errors, you must investigate what happens when the errouneous
At least 4 files are involved in giving us the genitive form, namely:
- src/morphology/root.lexc
- src/morphology/stems/nouns.lexc
- src/morphology/affixes/nouns.lexc
- src/phonology/smn-phon.twolc
We will return to the first one. The lemma (ito) and the stem
grep '^ito:' src/morphology/stems/nouns.lexc
The answer (i.e. the entry for ito) is
ito: i%^RVto%^SV PARGO ;
This means that the lemma is ito, and the stem is i%^RVto%^SV.
+N+Sg+Gen:%^WG K ; ! kisá
Both these entries contain a colon. The left of the colon we call the
ito+N+Sg+Gen --------------- i%^RVto%^SV%^WG
The symbols %^RV, %^SV, %^WG (and similar symbols for other words)
Now, what we want is not i%^RVto%^SV%^WG, but iđo. In order to see
The twolc file takes the lower level of lexc as its upper level, and
ito+N+Sg+Gen = lexc upper --------------- ---------- i%^RVto%^SV%^WG = lexc lower i%^RVto%^SV%^WG = twolc upper --------------- ---------- iđo = twolc lower
We see that
For the t:đ change, let us look for the twolc rule being responsible for it.
"t:đ gradation" t:đ <=> Vow: _ (k4:) Vow (Cns) (Dummy:*) %^WG:0 ;
The rule says: There is a t: đ alternation whenever there is an underlying vowel to
The net result is that gradation takes place, and that we get the form we want.
Debugging
Now, this all went fine. What we want is the cases where we get no analysis, or
- Errors in the stems/nouns.lexc file
- The lemma is missing from the lexicon file (here: stems/nouns.lexc)
- The lemma is there, but the stem (to the right of : ) is not what I expected it to be
- There is a typo in either the lemma or the stem
- The lemma has another continuation lexicon than it should
- The lemma is missing from the lexicon file (here: stems/nouns.lexc)
- Errors in the affixes/nouns.lexc file
- The entry (here: +N+Sg+Gen) is missing from the continuation lexicon
- The entry is there, but it has the wrong form (e.g. there should have
- The entry (here: +N+Sg+Gen) is missing from the continuation lexicon
- Errors in the twolc file
- Look at the lower lexc string (here: i%^RVto%^SV%^WG
- Everything may be fine with the rule you intended to use, but there may be
- Look at the lower lexc string (here: i%^RVto%^SV%^WG
- Multicharacter symbols errors
- Is the multicharacter symbol defined? If the entry contains symbols like
- Is the multicharacter symbol defined? If the entry contains symbols like
Testing
So, how do we know there is an error?
We may check a word with the usmn command, and see that it gets no
grep ' ito+' test/src/gt-norm-yamls/*
The file was N-even-o_gt-norm.yaml.
After having written make check, we may, in the terminal window search for the file
pushd /Users/trond/main/langs/smn/test/src; /opt/local/bin/python3.3 /Users/trond/main/giella-core/scripts/morph-test.py -c -i -v -S xerox --app "/usr/local/bin/lookup -flags mbTT" --morph ././../../src/analyser-gt-norm.xfst --gen ././../../src/generator-gt-norm.xfst ./gt-norm-yamls/N-even-o_gt-norm.yaml; popd
(the /Users/trond part is obviously different for other users)
Glue this command in any terminal window (opening a new one may be a good idea).
--------------------------------------- Test 2: Noun - ito (Lexical/Generation) --------------------------------------- [ 1/16][PASS] ito+N+Sg+Nom => ito [ 2/16][PASS] ito+N+Sg+Gen => iđo [ 3/16][PASS] ito+N+Sg+Acc => iđo [ 4/16][FAIL] ito+N+Sg+Ill => Missing results: iton [ 4/16][FAIL] ito+N+Sg+Ill => Unexpected results: iiton [ 5/16][FAIL] ito+N+Sg+Loc => Missing results: iiđoost [ 5/16][FAIL] ito+N+Sg+Loc => Unexpected results: iđost [ 6/16][PASS] ito+N+Sg+Com => iđoin [ 7/16][PASS] ito+N+Sg+Abe => iđottáá ... ------------------------------------- Test 6: Noun - ito (Surface/Analysis) ------------------------------------- [ 1/14][PASS] ito => ito+N+Sg+Nom [ 2/14][PASS] iđo => ito+N+Sg+Gen [ 2/14][PASS] iđo => ito+N+Sg+Acc [ 3/14][FAIL] iton => Missing results: ito+N+Sg+Ill [ 4/14][FAIL] iiđoost => Missing results: ito+N+Sg+Loc [ 5/14][PASS] iđoin => ito+N+Sg+Com [ 6/14][PASS] iđottáá => ito+N+Sg+Abe
The most interesting one in this context is the generation one,
In our case, the genitive form is ok, but the illative
The procedure for finding the errors is exactly the same as
- go through the automaton step by step, and find the stems
- Ill: i%^RVto%^SV%^RLEN%>n K ; ! kiisán
- Loc: i%^RVto%^SV%^SV%^WG%^CLEN%^SLEN%>st K ; ! kissáást
- Ill: i%^RVto%^SV%^RLEN%>n K ; ! kiisán
- look at errors in the lexc file, if there are no errors,
- look at the twolc rules
In this particular case, it seems we have a lexc error:
For the locative, two conventions seem to have clashed
These errors will be fixed, but in principle, this is the type of
twolc debugging
The program twolc may be used in order to see whether the twolc
cd src/phonology twolc read-grammar smn-phon.twolc compile
The computer now prints strange messages to you for, say, half a minute
In the twolc file, there are test cases (lines starting with !€).
lex-test
and glue in the upper line of a test pair, e.g.
i%^RVto%^SV%^WG
The result should be
iđo i ^RV:0 t:đ o ^SV:0 ^WG:0
If this is not the case (e.g. you get no result, or another result),
pair-test
then write your input, ENTER, and the output that you want
If things do not work, you will get a message telling what rule
If you change the twolc rule file and want to try again, leave
When done, leave the twolc program by saying quit.