Lexicalising Norwegian
For analysis of Norwegian we may use either the Oslo-Bergen tagger
(obt) or the nob finite state transducer (nob.fst) from Giellatekno.
The Giellatekno fst is based
upon a wordform list and contains approximately 2000 unclassified
verbs and 2700 unclassified nouns. At the outset, the obt pipeline
is thus better. On the positive side for the gt fst is its flexibility.
For Neahttadigisánit we use the gt fst, and therefore we lexicalise
all compouds found in the dictionary.
The gt fst is found in $GTHOME/langs/nob, and is thus part of
the new infrastructure, with the stems in src/morphology/stems.
The nouns, verbs and adjectives are given the continuation lexica
found in Bokmålsordboka, the inflection
code system is also found at the top of the files in both the
stems/ and the affixes/ catalogues.
Lexicalisation
The nob.fst may be set up to include or exclude dynamic compounds.
To check today's behaviour, check for the words hybelkanin
(lexicalised) and hybelhest (not lexicalised). If both are accepted,
dynamic compounding is ON, if only the former is accepted, it is OFF.
The behaviour is regulated by commenting in and out 3 lines of the
lexicon R in src/morphology/root.lexc.
Turn dynamic compounding off (if needed), and find unknown verbs
for example as follows:
cat file|preprocess|rev|sort|rev|uniq|unob|grep '?'|cut -f1
Add words to the files in src/morphology/stems/ by following
the pattern indicated on the top of each file. When words may be
both masculine and feminine (like boka vs. boken), choose
feminine. The analyser treats all feminines as potential masculines.