Test diary
Test results for the morphology and lexicon files
This document contains test results for the Lule Saami parser. We will move to an automatic test regime, but while waiting, the first initial steps will be documented here.
Test results for the lexicon
The following table records recall for word forms in various texts. Here we measure coverage of the vocabulary, by recording all word forms that are not recognised.
----------------------------------------------------------------------------------- zcorp/gt/smj/bible/nt/lule_sami_new_testament.html.xml Token recall testing Type recall testing ----------------------------------------------------------------------------------- Test 1 lex Wf-total Wf-tkn %-recall Tytot Wf-typ %-recall 070627 120070 119752 99,7 % 060228 19742 135662 131212 96,7 % 13289 11831 89,0 % ← 978 inc missing. 060228 18307 135662 125367 92,4 % 13289 11385 85,7 % ← More rare words. 060227 17997 135662 123368 90,1 % 13289 9938 74,8 % ← More Kintel, äöŋ fix 060226 17723 135662 108573 80,0 % 13289 8952 67,4 % ← More Kintel 060222 135662 82748 70,0 % 13289 2195 16,5 % ← First Kintel import 060124 3368 135662 75018 55,3 % 13289 2084 15,6 % ← Still no lexicon ----------------------------------------------------------------------------------
Lower token than type percentage indicates that the parser misses common words more often than seldom ones. Lower type than token percentage (which is the case) indicates that the parser is good at the core vocabulary, but has lower coverage of more seldom words.
Each text is given a separate section in the table, ordered chronologically, with the oldest test case (Test 1) at the bottom. The first line of each section gives the name of the file. Each line represents a test run. The first colum gives the test date (in the format ddmmyy), the second (WFtot) the total number of words in the file question, the third (Wf-tkn) the number of recognised word form tokens, and the percentage compared to the total. The next columns does the same for wordform types (cf. below for the commands used to calculate the numbers).
Test 1 does not cover proper nouns, as they are not added to the lexicon yet. The commands used to get the numbers are:
- The file command (below referred to as ccat file
- ccat zcorp/gt/smj/bible/nt/lule_sami_new_testament.html.xml | ...
- Wftot (total number of wordform types)
- ccat file | preprocess | grep -v '^[A-Z]' | wc -l
- Wf-tkn (total number of wordform tokens)
- Wftot - Non_recognised_wf
- Non_recognised_wf:
- ccat zcorp/gt/smj/bible/nt/lule_sami_new_testament.html.xml | preprocess | grep -v '^[A-Z]' | lookup -flags mbTT -utf8 bin/smj.fst | grep '\?' | grep '[a-z]' | wc -l
- Wf-tkn =
- Wftot - Non_recognised_wf
- %-recall =
- Wf-tkn * 100 / Wftot
- Tytot (Total number of wordform types):
- ccat file | preprocess | grep -v '^[A-Z]' | sort | uniq | wc -l
- Non_recognised_wt (Number of non-analysed wordform types:
- ccat file | preprocess | sort | uniq |lookup -flags mbTT -utf8 bin/smj.fst | grep '\?' | grep -v CLB | wc -l
- Wf-typ (Number of recognised wordform types) =
- Tytot - Non_recognised_wt
- %-recall =
- Wf-typ * 100 / Tytot
Last modified $Date: 2013-01-23 21:26:54 +0100 (gask, 23 ođđj 2013) $, by $Author: sjur $