141021

Contents:

FST
CG
MT
Oahpa
Next meeting

SamEst meeting 21.10.2014

Present: Fran, Heiki, Heli, Jaak, Sjur, Trond

Issues/topics:

FST
CG
MT
Oahpa

FST

Jaak checked in a todo list in langs/est/doc/, things that we might want to or should do:

Tag conversion
- Proper gt-style tagging of components (+Cmpnd-stuff)
- Ask HJK for improved verb paradigm
- Decide how to (not to?) encode "defaults"
Improvements waiting to happen
- proper gt-style punctuation in FST
- stress information in FST
Known bugs: süüa/sööa*, juua/jooa*, lapse/lapsu*

Status for todo-list issues

Cmpnd not done yet. An article on this?
Defaults: not discussed
Punctuation: Fran has something
Stress info: For later
The bugs: Something wrong in the twol rules, probably

Jaak made a command line something that makes it possible to generate word forms from lemmata + MS categories.

alias dest='$LOOKUP $GTHOME/langs/est/src/generator-gt-desc.xfst'
alias hdest='$HLOOKUP $GTHOME/langs/est/src/generator-gt-desc.hfstol'
dest
sööma+V+Pers+Prs+Ind+Sg2+Aff
sööma+V+Pers+Prs+Ind+Sg2+Aff        sööd

cf. 
sme: borrat+V+TV+Ind+Prs+Sg2    borat
fin: syödä+V+Act+Ind+Prs+Sg2    syöt

Let us work with pronouns and determiners, while waiting for verb work to happen elsewhere. Thereafter conjunctions, subjunctions, and then one noun (per paradigm). Then plural nouns, like häät. Then adjectives, and we wait with the verbs.

Method:

Generate all the forms in Finnish, e.g. for tämä
Replace tämä with ta
Try to generate the same for Estonian
Have a look, and correct
1. Add +Dem tag to Estonian
2. See that Nom and Gen work, note that Tra does not work
Go to next pronoun, minä -> mina
1. etc. (there are 81 pronouns in the Finnish source code)
Sum up as to what matches and what does not.

While waiting, Jaak could make a fullform list of a Finnish verb and show Heiki-Jaan.

Regular verbs: sanoa, antaa.

Command + output for checking tag matches:

echo 'm a j a "+N" [ ? - [ "#" | "+Dim/ke" | "+Dim/kene" ] ] *' |   hfst-regexp2fst |   hfst-compose -F -2 - -1 src/analyser-gt-norm.hfst |   hfst-project -p lower |   hfst-fst2strings |   sort -u |   sed -e 's/^maja/talo/' |   hfst-lookup -q ../fin/src/generator-gt-desc.hfstol |   grep -v ^$ |   less

talo+N+Der/lt+Adv       talo+N+Der/lt+Adv+?     inf
talo+N+Pl+Abe   taloitta        0,000000
talo+N+Pl+Abl   taloilta        0,000000
talo+N+Pl+Ade   taloilla        0,000000

Issue: The makefile generates optimised hfst transducers (.hfstol). The .hfst ones are removed by automake, as they are seen as intermediate files. They may be defined as goals in themselves, and thus kept. File to keep: analyser-gt-desc.hfst (in addition to the analyser-gt-desc.hfstol. Sjur will add this file for all languages, as default. Update: done

Tag conversion

Status: See above.

Other fst issues

Testing

The fst testing should be tested by doing (do Oahpa), but also via the yaml test procedure.

  Noun - kajava: # Noun 'seagull'
     kajava+N+Sg+Nom: kajava
     kajava+N+Sg+Gen: kajavan
     kajava+N+Sg+Par: kajavaa
     kajava+N+Sg+Tra: kajavaksi
     kajava+N+Sg+Abe: kajavatta
     kajava+N+Sg+Ine: kajavassa

TODO: Heiki-Jaan to find/script key paradigms for yaml testing (stored in test/src/gt-norm/N-maja_gt-norm.yaml etc.)

Hint: Look at other languages for inspiration. sms, sma, fkv, rus, yrk are languages with different approaches.

Tag harmony article

Could this be for NoDaLiDa? (Deadline = jan 19th). We sign up for the article

The article should be about the harmonious tendencies in sme<fin>est tag comparison

CG

r101345 | trond | 2014-10-20 07:53:01 +0200 (Mon, 20 Oct 2014) | 1 line

Moving file mrfdis.cg3 to the standard name disambiguation.cg3
in order to make scripts and setup work. The old file was just
a copy of the fao one, put there to make compilation work.
----------------------------------------------------------------
r101342 | tiina | 2014-10-20 00:31:30 +0200 (Mon, 20 Oct 2014) | 1 line

Current morphological disambiguation rules using Filosoft/UT morphological analyser

Getting rid of inline sets!

Tiina will be looking for a way doing that.

(<*1 X) means "can look in previous windows"

Tag conversion

Tiina's file has close to plamk tags, so we need to do tag conversion here as well. Tiina has a script that converts estmorf output to mrfdis.cg3 input.

Plan

Get rid of inline sets
Tag conversion

TODO: Discussions: Tiina and Fran.

MT

fin-sme

State of bidix.

The work done just after Tartu was very impressive. Some of it is pending (the "unreliable" list). Status is open.

Test cases

Some testcases can be found at:

Tag normalisation

Already discussed, see above, but the Apertium page has some relevant information.

http://wiki.apertium.org/wiki/Finnish_and_Estonian#Tagset_stuff

fin-sme

Here we wait for the Finnish CG, the old cg1. It has been converted to cg3, but need work. This is for next year.

$ findis "Minä tulen."
"<Minä>"
        "minä" Pron Pers Sg Nom @SUBJ→
"<tulen>"
        "tulla" V Act Ind Prs Sg1 @+FMAINV
"<.>"
        "." Punct CLB

We also need tag harmonisation fin-sme, or rather we need to take sme into account when doing fin-est.

Oahpa

Lexicon

Heli has a lexicon of 3000 words, they should have a POS tag. One may analyse all of them with plamk est.fst, or with filosoft, and pick the unambiguous ones. For the rest, one may add the POS tagging of the Finnish and English, and pick the harmony sets.

Next step is then to generate paradigms for the N, A, V in the 3000 list.

Teacher input

Morfa-C: Not giving the corerct case, but turning the task upside down, and ask for the correct case.

The task thus reminds of Morfa-C Mix, but is wider: Includes verb valency, oblique objects, etc.

One of the students should turn these tasks into morfa-C xml frames.

Heli is waiting for more input.

Numra

... is not working. The shoemaker's kids, etc.

Next meeting

Anyone going to SLTC 2014?

Fran, Sjur at least

Not enough people going there, thus a regular meeting.

Next meeting: Tuesday November 11, at 12.00 Norwegian time.

Also Tiina and Kadri should attend

Sjur to send out a (calendar) invitation.