Bidix Improvement Plan
Procedure for bidix improvement:
The bidix file
The file is found as follows:
-
cd apertium/nursery/apertium-sme-smn
- see apertium-sme-smn.sme-smn.dix
After 75 initial lines of definitions, the bidix contains, in this order:
-
The initial bidix chapter A starts at appr line 75
- Manual additions from text + some loanwords
- Manual additions from text + some loanwords
-
tEQ1 chapter B starts at appr line 670
- words from Cip's bidix run having a 1-1 match sme-smn
- words from Cip's bidix run having a 1-1 match sme-smn
-
1-m chapter C - starts at appr line 5500
- Word pairs with one sme and more than one smn, ordered according
- Word pairs with one sme and more than one smn, ordered according
-
Names chapter F line 9500 appr
- These are foreign names , just ignore them for now
Todo: Choose the right smn for each sme in chapter C.
Procedures
use xml or xsl mode in SubEthaEdit.
Procedure for editing existing word pairs
Start on the top of section C.
There will be more that one sme reading, as follows:
<e><p><l>hiljážii<s n="adv"/></l><r>kuuloold<s n="adv"/></r></p></e> <e><p><l>hiljážii<s n="adv"/></l><r>šiäđust<s n="adv"/></r></p></e> <e><p><l>divttásmuvvat<s n="vblex"/><s n="iv"/></l><r>sovđâđ<s n="vblex"/></r></p></e> <e><p><l>divttásmuvvat<s n="vblex"/><s n="iv"/></l><r>suáhuđ<s n="vblex"/></r></p></e>
The procedure for editing is:
- Remove whole lines
- Remove the lines that give a wrong translation
- In cases where more than one translation is ok, remove the less general (or less common) ones
- You are allowed to leave two translations only in the following case:
- You are able to state explicitly when to use one, and when to use the other, e.g.
- This verb is translated to X for human subjects but to Y for non-human subjects
- This adjective is translated to X when it modifies words for food, but to Y when it does not
- ..
- This verb is translated to X for human subjects but to Y for non-human subjects
- In that case, you do the following:
- Keep both lines
- Open the file apertium-sme-smn.sme-smn.lrx, and write an explanation in the beginning of that file.
- Note that if we are not able to formalise the difference, we should just keep one pair. Remove the one you do not want, and remove the whole line.
- Keep both lines
- You are able to state explicitly when to use one, and when to use the other, e.g.
Correction of errors:
- Do not correct the sme entries. If they contain errors, delete the whole line
- If none of the smn translations are correct, you may take one of the sme-smn lines
When the smn translation should consist of more than one word, the blank is
<e><p><l>ovddos<b/>guvlui<s n="adv"/></l><r>ovdâskuávlui<s n="adv"/></r></p></e>
In most cases, we do not want multiword translations in the bidix, but in the transfer rules.
When you are done editing, do the following:
- At the point in the file where you are, make a new empty line.
- Write a note (appr <!-- Checked until this line 1.11.15. TT -->)
- save the file
- write make, and look for error messages saying e.g.
Procedure for adding new word pairs
Give the lemma of both sme and smn. Check the analysis, e.g. ávvudoalut:
ávvudoalut ávvodoalut+Err/Orth+N+Pl+Nom <= the lemma is ávvodoalut
Be aware of that some verbs are IV, other verbs are TV. At the time being we add this tag only to the sme lemma:
<e><p><l>birget<s n="vblex"/><s n="iv"/></l><r>piergiđ<s n="vblex"/></r></p></e>
Special cases - and how to handle them
sme lemma is Pl, smn lemma is Sg – or the other way round
Some lemmas are lexicalised as plurals. As long as it is the same for sme and smn, it is no problem. But if the number is not the same for these two languages, then the number tags must be given to the bidix.
E.g. ávvodoalut+N+Pl vs. juhlálâšvuotâ+N+Sg. Add plural and singular tags to the bidix:
<e><p><l>ávvodoalut<s n="n"/><s n="pl"/></l><r>juhlálâšvuotâ<s n="n"/><s n="sg"/></r></p></e>
sme lemma is an adverb, smn lemma is not lexicalised as adverb, but a noun in locative.
Many adverbs are really inflected nouns, usually locatives, illatives or genetives. Sometimes the lemma can be lexicalised as an adverb in one of the languages, but not in the other language. One could consider if the word should be lexicalised also in the other language. If the bidix-worker is not responsible for the FST for the language in question, she should just leave a comment about it.
E.g. iđđes vs. iđedist. Give correct tags, and a comment:
<e><p><l>iđđes<s n="adv"/><s n="tv"/></l><r>iiđeed<s n="n"/><s n="sg"/><s n="loc"/></r></p></e> <!-- not same PoS -->
sme lemma is not lexicalised
Sometimes the lemma can be lexicalised as a postposition in one of the languages, but not in the other language. One could consider if the word should be lexicalised also in the other language. If the bidix-worker is not responsible for the FST for the language in question, she should just leave a comment about it.
E.g. háldui+Po vs. haaldun+Po. Add a comment:
<e><p><l>háldui<s n="po"/></l><r>haaldun<s n="po"/></r></p></e> <!-- not in sme -->
sme lemma has no counterpart in smn, in stead smn has an inflection of the noun:
e.g. haga+Po vs. abessive case in smn.
Give explanations and examples in the contrastive grammar (or another common file for such notes) and a comment about it in the bidix:
<e><p><l>haga<s n="po"/></l><r><s n="po"/></r></p></e> <!-- abessive, explained in the contrastive grammar -->
For historical reference:This was done to create the bidix:
- Diff the manual work done for
- build a new bidix from fresh data, as follows:
- take the 1-1 pair from words/finsmn/trans-dict/all_sme2smn.csv - DONE
- for the 1-m (one-to-many) pairs,
- take the cognates (= Levenshtein =< 3) from
- take the remaining 1-m sme words, and order them after sme POS,
- take the cognates (= Levenshtein =< 3) from
- take the 1-1 pair from words/finsmn/trans-dict/all_sme2smn.csv - DONE