150616

Meeting on smn-dictionaries and cifu-talk

Edmonton, Rotterdam 16.6.

Present: Lene, Marja-Liisa, Trond

  • Status
    • What did we promise
  • Our plan
  • What to do
  • Division of labour
  • time plan
  • Who are going
  • Next meeting

Status

The new dictionary

The new dictionary has arrived, in two files. It is added (in .xlsx and .csv format) in the folders

  • smnfin/inc/2015
  • finsmn/inc/2015

It was made according to the following principles

  • many fin --> one smn
  • one smn --> one fin

Lacunas, e.g. myettiđ: muáttá (to snow)

What did we promise

We will

  1. present a preliminary finite-state transducer for Aanaar Saami, and
  2. combine it with different Aanaar Saami dictionaries and word lists:
    1. A large Aanaar Saami - Finnish dictionary
    2. A North Saami - Aanaar Saami transfer lexicon

For each of the dictionaries / word lists, we will show

  1. what degree of coverage the combination of dictionary and transducer will give on relevant text types, including
    1. school textbooks,
    2. children's fiction,
    3. biblical and other religious texts,
    4. writings on language and
    5. blog/Facebook-type prose.
  2. We will run the coverage tests both on analysers representing the standard language, and on analysers including a component tolerating a certain amount of orthographic variation.

Our plan

What to do

Where to publish this?

  • SDÁ: 31.8.2015 - possibly not the best channel for this
  • Saami scientific article in Finland
  • CIFU proceedings
  • Other channels?

A possible further article

School dictionary adjusted to school children, on the basis of similar dictionary for sme.

This will be more grounded in Saami linguistics, less in language technology, and has thus higher relevance for SDÁ. Example of problem: What is a student dictionary and what that means for Saami lexicography.

The next step would then be to link this dictionary to the corpus. How can we selet the best sentences from the corpus? Cf. the literature on SketchEngine, and in Gothenburg.

Documentation about dictionary work

Division of labour

  • Francis MT,
  • Ciprian sme-fin/fin-smn
  • Ryan: NDS implementation
  • ML/E:
    • translate smesmn missing list
    • translate and adjust NDA tags and configure-files
    • make lists: what smn-texts to use for the evaluation of the dictionary
    • Improve on the smefin dictionary
    • Evaluate the result of the smesmn parallellisation
      • Add translations to words/dicts/smefin/inc
      • Read and correct translations in words/dicts/smefin/src
  • LA: Analyse and write
  • TT: Analyse and write
cd 
cd main/words/dicts/
ls

All smefin etc catalogues contain folders
inc = incoming
bin = gbinary files from doing sh smefin.sh etc
src = 

time plan

  • Parallel first phase
    • Ciprian to build first sme-smn candidate bidix, as soon as possible
    • Ryan to implement ferst version of NDS, have something done before thursday
  • Parallel second phase
    • ML, ES to evaluate and correct Cip's smesmn
    • When phase 1 is done, there will remain residues
      • sme from smefin not found in sme-smn
      • fin from smefin not found in fin-smn
      • smn from new finsmn not found in sme-smn
    • ES, LA to translate residue lists
    • LA, TT, ML to analyse smnfin / finsmn
  • Parallel second phase
    • Evaluate result of Ciprians first run
    • Evaluate Ryan's NDS
  • Third phase
    • Start working on the issues mentioned in the promise list above

Meeting in Tromsø late next week (thursday after lunch?) where we plan the summer.

Who are going

At least ML and Trond, Ryan?

Next meeting

Thursday 18.6. Topic: NDS implementation issues

Notes

f4       f8   
aakkos  järjestys  => aakkosjärjestys
aallon  murtaja    => aallonmurtaja

-f4
alimmainen, alin
eteläisempi, etelämpi 
aapa(suo)               
alkeis-
armelias ~ armias ~ armollinen
ellen, -et, -ei (jne.)
ensi(-)
hallussa, -sta
-hiuksinen
istuallaan, -een 
itsestään, -sään 
jämät (mon.) 
120 commas, 20 parentheses, 21 tilde in column f4

cat finsmn/inc/2015/Suoma_saami_16062015.csv |grep 'aarto'

-f8                                    4       8
-uutiset                            aamu    -uutiset
 (arttukatos)                       aarto    (arttukatos)
-orvokki                            aho     -orvokki
(mänty)                             aihki   (mänty)
                                    alus    lakki, -huppu
                                    etu     puolella, -puolelta
                                    joko     - ta(h)i    => joko - tahi, joko - tai
                                    kaari   sulut (mon.)
                                    kalan   perkeet (mon.)
18 commas, 13 parentheses, 3 tilde in column f8

cat finsmn/inc/2015/Suoma_saami_16062015.csv |cut -f4|grep '-'

Algorithm

Algorithm for building lemmas (resolving formatting) in the columns 4, 8 in finsmn:

TODO list for dictionary processing, how to handle smn-fin:

  • comma: duplicate line
    • eteläisempi, etelämpi => eteläisempi AND etelämpi
  • parentheses written as one word: Expand:
    • aapa(suo) = aapa AND aapasuo
    • joko - ta(h)i => joko - tahi, joko - tai
  • space parenthesis, hence written as two words: Duplicate
    • aarto (arttukatos) => aarto AND arttukatos
  • TODO with tilde: Expand:
    • armelias ~ armias ~ armollinen => armelias AND armias AND armollinen
  • komma space hyphen (sometimes multiple cases, sometimes compounds):
  • Expand for cases: Remove as many letters as you add (3 letters away and add the same 3):
    • tasalla, -lta => tasalla AND tasalta
  • Expand for compounds: Here, the first part is in f4 and the rest in f8
    • aluslakki, -huppu => aluslakki AND alushuppu (aluslakki + aluhuppu
  • hyphen initially in f8: It belongs to the lemma
    • aho -orvokki => aho-orvokki
  • The string '(mon.)' occurs 3 times in f8 and ome in f4, it means (Plural) and should be deleted (we recover it for fin later)
    • sulut (mon.)
    • perkeet (mon.)
    • liinat (mon.)
    • jämät (mon.)
  • (jne.) means "et cetera" and should be ignored

The relevant columns are 4, 8. Ff. readme documentation, cf. also:

A11=CONCATENATE(D11,H11," ",J11," ",K11,)