141021
SamEst meeting 21.10.2014
Present: Fran, Heiki, Heli, Jaak, Sjur, Trond
Issues/topics:
- FST
- CG
- MT
- Oahpa
FST
Jaak checked in a todo list in langs/est/doc/, things that we might want to or should do:
- Tag conversion
- Proper gt-style tagging of components (+Cmpnd-stuff)
- Ask HJK for improved verb paradigm
- Decide how to (not to?) encode "defaults"
- Proper gt-style tagging of components (+Cmpnd-stuff)
- Improvements waiting to happen
- proper gt-style punctuation in FST
- stress information in FST
- proper gt-style punctuation in FST
- Known bugs: süüa/sööa*, juua/jooa*, lapse/lapsu*
Status for todo-list issues
- Cmpnd not done yet. An article on this?
- Defaults: not discussed
- Punctuation: Fran has something
- Stress info: For later
- The bugs: Something wrong in the twol rules, probably
Jaak made a command line something that makes it possible
alias dest='$LOOKUP $GTHOME/langs/est/src/generator-gt-desc.xfst' alias hdest='$HLOOKUP $GTHOME/langs/est/src/generator-gt-desc.hfstol' dest sööma+V+Pers+Prs+Ind+Sg2+Aff sööma+V+Pers+Prs+Ind+Sg2+Aff sööd cf. sme: borrat+V+TV+Ind+Prs+Sg2 borat fin: syödä+V+Act+Ind+Prs+Sg2 syöt
Let us work with pronouns and determiners, while waiting
Method:
- Generate all the forms in Finnish, e.g. for tämä
- Replace tämä with ta
- Try to generate the same for Estonian
- Have a look, and correct
- Add +Dem tag to Estonian
- See that Nom and Gen work, note that Tra does not work
- Add +Dem tag to Estonian
- Go to next pronoun, minä -> mina
- etc. (there are 81 pronouns in the Finnish source code)
- etc. (there are 81 pronouns in the Finnish source code)
- Sum up as to what matches and what does not.
While waiting, Jaak could make a fullform list of a Finnish
Regular verbs: sanoa, antaa.
Command + output for checking tag matches:
echo 'm a j a "+N" [ ? - [ "#" | "+Dim/ke" | "+Dim/kene" ] ] *' | hfst-regexp2fst | hfst-compose -F -2 - -1 src/analyser-gt-norm.hfst | hfst-project -p lower | hfst-fst2strings | sort -u | sed -e 's/^maja/talo/' | hfst-lookup -q ../fin/src/generator-gt-desc.hfstol | grep -v ^$ | less talo+N+Der/lt+Adv talo+N+Der/lt+Adv+? inf talo+N+Pl+Abe taloitta 0,000000 talo+N+Pl+Abl taloilta 0,000000 talo+N+Pl+Ade taloilla 0,000000
Issue: The makefile generates optimised hfst transducers (.hfstol).
Tag conversion
Status: See above.
Other fst issues
Testing
The fst testing should be tested by doing (do Oahpa), but also
Noun - kajava: # Noun 'seagull' kajava+N+Sg+Nom: kajava kajava+N+Sg+Gen: kajavan kajava+N+Sg+Par: kajavaa kajava+N+Sg+Tra: kajavaksi kajava+N+Sg+Abe: kajavatta kajava+N+Sg+Ine: kajavassa
TODO: Heiki-Jaan to find/script key paradigms for yaml
Hint: Look at other languages for inspiration.
Tag harmony article
Could this be for NoDaLiDa? (Deadline = jan 19th).
The article should be about the
See also: http://universaldependencies.github.io/docs/.
CG
r101345 | trond | 2014-10-20 07:53:01 +0200 (Mon, 20 Oct 2014) | 1 line Moving file mrfdis.cg3 to the standard name disambiguation.cg3 in order to make scripts and setup work. The old file was just a copy of the fao one, put there to make compilation work. ---------------------------------------------------------------- r101342 | tiina | 2014-10-20 00:31:30 +0200 (Mon, 20 Oct 2014) | 1 line Current morphological disambiguation rules using Filosoft/UT morphological analyser
Getting rid of inline sets!
Tiina will be looking for a way doing that.
(<*1 X) means "can look in previous windows"
Tag conversion
Tiina's file has close to plamk tags, so we need to do tag
Plan
- Get rid of inline sets
- Tag conversion
TODO: Discussions: Tiina and Fran.
MT
fin-sme
State of bidix.
The work done just after Tartu was very impressive.
Test cases
Some testcases can be found at:
-
http://wiki.apertium.org/wiki/Finnish_and_Estonian/Pending_tests
- https://docs.google.com/spreadsheets/d/1kqxyXLqrj3w1F7usDMDRJMgY02zX5oz-9DQhSccshDA/edit?usp=sharing
Tag normalisation
Already discussed, see above, but the Apertium page has some relevant
fin-sme
Here we wait for the Finnish CG, the old cg1. It has been
$ findis "Minä tulen." "<Minä>" "minä" Pron Pers Sg Nom @SUBJ→ "<tulen>" "tulla" V Act Ind Prs Sg1 @+FMAINV "<.>" "." Punct CLB
We also need tag harmonisation fin-sme, or rather we need
Oahpa
Lexicon
Next step is then to generate paradigms for the N, A, V in the
Teacher input
Morfa-C: Not giving the corerct case, but turning
The task thus reminds of Morfa-C Mix, but is wider:
One of the students should turn these tasks into
Heli is waiting for more input.
Numra
... is not working. The shoemaker's kids, etc.
Next meeting
Anyone going to SLTC 2014?
- Fran, Sjur at least
Not enough people going there, thus a regular meeting.
Next meeting: Tuesday November 11, at 12.00 Norwegian time.
Also Tiina and Kadri should attend
Sjur to send out a (calendar) invitation.