Meeting_2010-10-18

Contents:

Meeting setup
Agenda
Opening, agenda review, participants
Sma issues
Forthcoming finsme dictionary
Info topics

Meeting setup

Date: 18.10.2010
Time: 09.30 Norw. time
Place: Internet
Tools: SubEthaEdit, iChat

Agenda

Sma issues
Forthcoming finsme dictionary
Info topics

Cf. one of the following, depending on context:

the upper bar of the SEE window (provided you use the JSPWiki syntax mode)
the TOC in Forrest-rendered output, like HTML and PDF

Opening, agenda review, participants

Opened at 10: 02.
Present: Børre, Ciprian, Lene, Maja, Sjur, Tomi, Trond, Biret Ánne
Absent: Thomas

Sma issues

Need to correct dictionary against speller

Wrong dict forms in speller? There are misspellings in the smanob dictionary, and often those misspellings are found as well in the speller. Check that misspellings in the dictionary are not replicated in the speller.

The fst and the dictionary show different Sg3 forms (where the Sg3 forms (marked p3p) are taken from the Hemnes dictionary. The p3p forms missing in the fst have now been manually marked with XXX in the dictionary.

The XXX marks can mean:

there is a difference between analyser and p3p or wordclass info (comes often from smaswe)
there are several forms given by the analyser, one can correspond with the p3p, and that means that we have to mark more than more verb class in smanob (if both are correct) - how do we do that?

Status quo for narmativity

The letter on loan words was sent to SGM on friday. We also sent a letter in June-July last year. Are there now normativity issues unresolved? Not that we are aware of.

How shall Divvun and Gt cooperate on the sma work during what is left of this year?

Lene has a list of dictionary verbs not in the analyser. During this work some issues have turned up:

missing verbs (and other words)
misspellings and errors in both the transducer and smanob (cf above)

Way of work:

Read through smanob and look for errors (adj, noun)
Check whether these errors are in the normative analyser
mark problematic words as non-includes
reverse sorting the fst lexicon, look at continuation lexicon
unify lemma forms for all fst entries (ie multiple entries for the same word should have the same baseform/lemma)
sort and unique fst entries - double forms in sma lexicon shall be unified
learn typical error patterns from the dictionary correction work (-e vs -ie as stems, -a- vs -aa- or -ie- vs. -ïe- as root vowel), grep similar patterns out of the fst code, and proofread them

The three i-s of sma - is this true?

i as i
ï as i and ï (klinïgke / klinigke)
ï shall not be i (but is made so via spellrelax)

Before we implement it we shall find out whether the second category exists.

TODO:

sort and unique nav entries in sma (Trond)
discuss the possible third i in sma (Maja, Thomas)
go through the XXX in the dictionary verbs ( Maja, Marit, Sissel, Thomas, Toini)
sma linguist meeting - date & time to be decided ( Lene, Maja, Marit, Sissel, Thomas, Toini, Trond)

Forthcoming finsme dictionary

We do not quite know when it comes, but what shall we do with it when it comes? The dictionary is made by KOTUS (fin-sme-fin).

Hvis penger - Johanna Ijäs kan gjøre noe med dette?

Info topics

a new programmer in the sma-oahpa project

Ryan for 2 months.

Cips journey to Iceland

Here is my very minimalistic notes on tools:

corpus_search tool (try it!) - http://corpussearch.sourceforge.net/
DATR (vs. XFST try it!) - http://www.informatics.sussex.ac.uk/research/groups/nlp/datr/
GLOSSA (Anders Nøklestad - GLOSSA is FLOSS, use it)
- http://www.hf.uio.no/iln/tjenester/sprak/glossa/index.html
- http://tekstlab.uio.no/glossa//html/GLOSSA_manual.html
CLAN (Carnegy Mellon University) (try it) - http://childes.psy.cmu.edu/clan/
MOLTO (MT open source Aarne Raanta on GF) - http://www.molto-project.eu/
CORD (corpora historical English) - http://www.helsinki.fi/varieng/CoRD/index.html
TVÄRSLÅ (the scandinavian lexicon is accessible for single-word queries on the net, but not available for developers) - http://ordbok.nada.kth.se/

The amount of work for installing GLOSSA here is noticeable, yes, but it will make data updating easier. Also, we might easier be able to taylor the interface to our need.

The obt uses the Oslo fullform list for morphological analysis, with a separate component for compound analysis. For syntax, they use CG.

In Bergen they are going to implement a separate corpus interface, corpuscle.

Summing up the sma-oahpa week

Aajege var med 3 personer i Tromsø 11-15.10
de har samlet 2500 ord til Oahpa-leksikonet fra lærebøker - nå unifisert med smanob
resultert i 900 nye lemmaer i smanob-dict for subst, verb og adjektiv (de andre ordklassene er ikke telt opp)
Aajege arbeider med Oahpa-lemmaene rett i smanob: forbedrer oversettinger (som er tatt rett fra lærebøkene hulter til bulter), og legger til semantiske sett og evt. info om retning i Leksa - til dels erstatter de den opprinnelige smanob-oversettelsen
vi må finne ut på hvilket nivå vi legger til oversettelser til andre språk i samme leksikonet (sme, swe). Konsekvensen vil bli at vi vil ha sma-fil istenfor smanob, smaswe osv. og nob-fil istedenfor nobsma, nobsme...
vi begynner å bli venner med XMLmind
alfa-MorfaC, beta-MorfaS og beta-Leksa skal være ferdig i løpet av de nærmeste ukene, testes på voksne kurselever 19-21.11 - til da bør også ny versjon av ÅD være kompilert
en del nye funksjoner legges til i Leksa -> kommer også med i smeOahpa:
- pronomener
- oversettingskommentar
- nettordbok inkludert i Oahpa-grensesnitt
neste arbeidsmøte er i Røros 22.11 (Trond og Lene drar dit), dessuten bruker vi ichat-møter - torsdager etter lunsj

Whitespace and empty element diffs:

-              <book name="s4" />
+               <book name="s4"/>
-      <stem class="trisyllabic"></stem>
+      <stem class="trisyllabic"/>

TODO:

compile new version of ÅD (Ciprian)
fix the XMLMind XML Editor whitespace issue (Sjur)