docu-lex.eng

Lexicon for the Komi analyser

The lexicon file format

The Komi lexicon files are used both for dictionary creation and for the transducer.

The interplay between lexicon xml files, derived lexc files and morph files

The main komi file is kt/kom/src/kom-lex.txt. It contains the lexicon Root (the initial lexicon). In the same src catalogue is found the catalogue working_files. (cf. here for a look).

During compilation, the entries from the xml files in the dictionary are extracted, and put in the catalogue kt/tmp/out/ (two levels up).

To take an example:

The file working_files/PRON-PERS_kom-lex.xml has an entry

   <entry>
      <lemma>ме</lemma>
      <stem/>
      <contlex>PRON-PERS-SG1-NOM</contlex>
      <pos>PRON-PERS</pos>
      <article>
         <eng>
            <choice>
               <variant>I</variant>
            </choice>
         </eng>
         <fin>
            <choice>
               <variant>minä</variant>
            </choice>
         </fin>
      </article>
   </entry>

From this file, the compilation process derives a lexc file to the catalogue kt/tmp/out. Here, we find a derived file PRON-PERS_kom-lex.txt. The first three lines of that file are:

LEXICON PRON-PERS

ме PRON-PERS-SG1-NOM  "I" ;

The file-name of the xml file (PRON-PERS) is the name of the continuation lexicon. Each entry has a lemma (here ме), and a stem (here, the stem is identical to the lemma). Then comes space, and then the contlex (here, the contlex is PRON-PERS-SG1-NOM. The contlex is found in the file kt/kom/src/pron-kom-morph.txt.

The lexicon files

The Komi lexicon files are found here (you may have to choose "show source code" in the browser):