Adding Morphological Test Data
Presently we have three types of morphology testing:
- lemma generation
- yaml tests
- lexc tests
These will briefly be presented here, with instructions on how to adapt or
augment them.
Lemma generation
Included from the und template there is a simple shell script to test lemma
generation for nouns. The basic idea is simple: extract all lemmas in the
lexicon (in the src/morphology/stems/ dir), and try to generate the lemma.
It should always succeed.
In practice it is a bit more complicated, and the script may also need some
adaption to each language.
The adaption is basically that one needs to check that the tag string used for
generating the lemma form actually corresponds to what is used in the language
(and there are languages where the concept of a "lemma" doesn't make that much
sense - if that is so, remove the test script by removing it from the TESTS
variable in test/src/morphology/Makefile.am).
Complicating factors might be that some nouns do not inflect in singular (the
usual lemma form), and other forms of irregular lemma creation.
The template only gives noun lemma generation, but it is easy to use that script
as a template for doing the same for verbs, adjectives and proper nouns. At least
North, Julev and South Sámi have more elaborate test scripts for all of these
parts-of-speeches. Have a look there for inspiration.
Note that this setup does not work for languages with gender systems, dividing
nouns into different classes.
Yaml tests
The most widely used morphological testing are the Yaml tests. The data format
is simple and straightforward, with a simple header followed by the actual test
data:
Config:
hfst:
Gen: ../../../src/generator-gt-norm.hfst
Morph: ../../../src/analyser-gt-norm.hfst
xerox:
Gen: ../../../src/generator-gt-norm.xfst
Morph: ../../../src/analyser-gt-norm.xfst
App: lookup
Tests:
Noun - atim - ok : # -m animate noun
atim+N+AN+Sg: atim # this is a comment
atim+N+AN+Pl: atimwak # test
atim+N+AN+Loc: atimohk # really rare form
The yaml syntax is simple, but relies on indenting: two spaces for each level of
data structure nesting.
The header is started by the keyword Config, and lists fst's to be used for
analysis and generation, for both Xerox and Hfst. The path is relative to the
test dir test/src/morphology/.
The test data is similarly started by the keyword Tests, followed by a line
containing the name of the test (Nound - atim - ok in the example above).
On the following lines there are one line for each morphosyntactic form, using
the notation analysis string followed by colon, followed by
wordform string. If there are more than one possible wordform, they are all
on the same line, separated by comma and space, and enclosed in square brackets:
ненэцьʼ+N+Sg+Loc: [ненэцяӈгана, ненэцяӈгна]
Remember to always indent properly!
Negative Yaml tests
Sometimes it can be valuable to specify negative tests. Usually they should
notbe needed, since any overgeneration will be reported as a FAIL. It might
still be a good idea to test for word forms that are known to have caused
problems.
To specify a negative test, add a tilde in front of the word form in the Yaml
data, as follows:
gierehtse+N+Sg+Acc: [gierehtsem, ~gieriehtsem]
Now the Yaml test will only pass if the last word form given is NOT generated,
and is NOT giving any analyses.
Filenames for Yaml tests
The filenames for the yaml tests are built up with the following components:
- a descriptive part, anything but underscore goes
- an underscore
- an fst specificator
- an optional .ana or .gen specifier
- the suffix .yaml
The underscore is the separator between the "free" part and the fst specifier.
By specifying the fst as part of the filename, it is possible to write tests for
all of the produced fst's.
By specifying .ana or .gen before the .yaml suffix, only
analysis or generation testing will be done on the data. This is useful
for testing transducers that do not naturally come in generation/analysis pairs.
Lexc tests
It is also possible, and often a very good idea, to add test cases directly to
the LexC source code. The syntax is very similar to the Yaml syntax (and is
parsed and tested by the same machinery that uses the yaml files), and looks
like the following:
!!€gt-norm: adjectives
!!€ isvelihks: isvelihks+A+Attr
!!€ isveligs: isvelihks+A+Attr
!!€ isvelihks: isvelihks+A+Sg+Nom
!!€ isveligs: isvelihks+A+Sg+Nom
The first line specifies which transducer to run the test data against, followed
by colon and space, and then the name of the test. There must be no space
between the Euro sign and the transducer specifier, and no space between the
transducer specifier and the following colon. The string !!€gt-norm: is
obligatory (you can replace gt-norm with another fst specifier if you want
to test against e.g. a descriptive fst, or an fst with a different tagset), but
the name of the test (adjectives in the case above) is optional. If not
specified, the name will be the last seen lexicon name above.
The rest of the lines specify the test data, one line per word form, in two
columns: the first column contains the surface wordform, and the second column
the corresponding analysis string.
Positive tests are specified with the string !!€ at the very beginning
of the line, whereas negative tests are specified by the string !!$ at
the beginning of the line. Then both are followed by a space, then the word
form, then a colon followed by whitespace, and finally the lemma+tags:
! Test data:
!!€gt-norm: gierehtse # Odd-syllable test
!!€ gierehtse: gierehtse+N+Sg+Nom
!!€ gierehtsen: gierehtse+N+Sg+Gen
!!€ gieriehtsasse: gierehtse+N+Sg+Ill
!!€ gierehtsem: gierehtse+N+Sg+Acc
!!$ gieriehtsem: gierehtse+N+Sg+Acc ! Block diphthongues in odd syll
Note the last line, where we explicitly check that the illegal word form
gieriehtsem is never generated or accepted.
Note that there must be a space between !!€ or !!$ and the following
word form in the test data, and there must be a colon followed by whitespace
after the word form and before the lemma+tag string. This syntax allows
multiword expressions as test data.
It is ok to have LexC style comments after the second column, as shown in the
last line.
NB! Possible pitfal: due to the way the parsed test data is stored internally
by the test bench, you can not use the same lemma more than once for the same
fst within the same lexc file. That is, check that the words you use for testing
are only used in one test each, and you should be fine.
One-sided (half) tests in lexc
In some cases you may want to run the tests in only one direction: only analysis
or only generation tests. This is required when testing specialised fst's
that do not exist in pairs. Here is one example (from Inari Sámi):
!!€dict-gt-norm.gen: # Even-syllable test, generation only
!!€ raattâđ: raattâđ+V+Inf
The dict-gt-norm fst is only used for generation (the dictionary analyser
is descriptive, not normative, to allow non-normative input to dictionary lookup,
which means that the analyser and the generator will cover a different set of
word forms, thus they need different sets of data for testing)
and we need to tell the test bench that the fst and the following test data
should only be used for generation testing. This is done by adding the «suffix»
.gen to the fst name. If you need to run certain tests only for the analyser,
add the «suffix» .ana to the fst name just before the colon.