Korp Karp Installation
Korp is a Corpus tool and Karp a Lexicon tool from the Swedish
Links
Work plan
- Download Korp code
- Install at gtweb
- Install corpora
- Make interface
Corpora available
- Free
- skuvlahistorja1-6
- fad
- skuvlahistorja1-6
- Bound
- news
- ficti
- NT
- news
Corpus mixes
- smesme: news + ficti
- nob2sme: fad + skuvlahistorja1-6
- smedep: news + ficti + facta/skuvlahistorja1-6 + bibel/newtestament
Interface
- search for sme wordforms (kwic-snt in corpus ccat) – corpus: smesme
- search for sme lemmas (kwic-snt? in analysed corpus syn) – corpus choices: smesme, nob2sme
- search for sme and nob in translations (lemma search in sentence aligned sentences) – corpus: nob2sme
- deepdict sme (lemma search -> dependency daughters in corpus dep) – corpus: smedep
Lemgram
Definitions
-
lexeme = member of an open lexical category, having meaning and form but being neither
-
lemma = wordform used as representative for lexeme
-
grammatical word pair of lemma+grammatical properties and wordform
-
paradigm = set of grammatical words realising a lemma
- lemgram = set of wordforms in paradigm
Generation
Use dict-isme-norm.fst or generator-dict-gt-norm.xfst or generator-dict-gt-norm.hfst. We remove the tags v1, v2.. from the fst. It is better for the user that all variants of the same paradigm are in the same lemgram. Many fst-lemmas have more than one entry in lexc, so the list should be uniqed before generating forms. I suggest that we start with these files:
For nouns, we pick different 3 lists: The ordinary nouns, the actors (NomAg), and the G3-marked nouns.
noun-sme-lex.txt:
- Ordinary words:
egrep -v "(G3|ACTOR|CmpN/Only|ShCmp|RCmpnd|\+V\+|^\!)"
- ACTOR:
grep N+NomAg
- G3:
grep N+G3
verb-sme-lex.txt:
egrep -v "(ENDLEX|\+V|^\!)"
adj-sme-lex.txt:
egrep -v "(LEXICON|Der| Rreal | R |^\!)"
adv-sme-lex.txt:
egrep -v "(LEXICON| K |^\!)"