150616
Meeting on smn-dictionaries and cifu-talk
Edmonton, Rotterdam 16.6.
Present: Lene, Marja-Liisa, Trond
- Status
- What did we promise
- What did we promise
- Our plan
- What to do
- Division of labour
- time plan
- Who are going
- Next meeting
Status
The new dictionary
The new dictionary has arrived, in two files.
- smnfin/inc/2015
- finsmn/inc/2015
It was made according to the following principles
- many fin --> one smn
- one smn --> one fin
Lacunas, e.g. myettiđ: muáttá (to snow)
What did we promise
We will
- present a preliminary finite-state transducer for Aanaar Saami, and
- combine it with different Aanaar Saami dictionaries and word lists:
- A large Aanaar Saami - Finnish dictionary
- A North Saami - Aanaar Saami transfer lexicon
- A large Aanaar Saami - Finnish dictionary
For each of the dictionaries / word lists, we will show
- what degree of coverage the combination of dictionary and transducer will give on relevant text types, including
- school textbooks,
- children's fiction,
- biblical and other religious texts,
- writings on language and
- blog/Facebook-type prose.
- school textbooks,
- We will run the coverage tests both on analysers representing the standard language, and on analysers including a component tolerating a certain amount of orthographic variation.
Our plan
What to do
Where to publish this?
- SDÁ: 31.8.2015 - possibly not the best channel for this
- Saami scientific article in Finland
- CIFU proceedings
- Other channels?
A possible further article
School dictionary adjusted to school children, on the basis of similar dictionary for sme.
This will be more grounded in Saami linguistics, less in language technology, and has thus higher relevance for SDÁ. Example of problem: What is a student dictionary and what that means for Saami lexicography.
The next step would then be to link this dictionary to the corpus. How can we selet the best sentences from the corpus? Cf. the literature on SketchEngine, and in Gothenburg.
Division of labour
- Francis MT,
- Ciprian sme-fin/fin-smn
- Ryan: NDS implementation
- ML/E:
- translate smesmn missing list
- translate and adjust NDA tags and configure-files
- make lists: what smn-texts to use for the evaluation of the dictionary
- Improve on the smefin dictionary
- Evaluate the result of the smesmn parallellisation
- Add translations to words/dicts/smefin/inc
- Read and correct translations in words/dicts/smefin/src
- Add translations to words/dicts/smefin/inc
- translate smesmn missing list
- LA: Analyse and write
- TT: Analyse and write
cd cd main/words/dicts/ ls All smefin etc catalogues contain folders inc = incoming bin = gbinary files from doing sh smefin.sh etc src =
time plan
- Parallel first phase
- Ciprian to build first sme-smn candidate bidix, as soon as possible
- Ryan to implement ferst version of NDS, have something done before thursday
- Ciprian to build first sme-smn candidate bidix, as soon as possible
- Parallel second phase
- ML, ES to evaluate and correct Cip's smesmn
- When phase 1 is done, there will remain residues
- sme from smefin not found in sme-smn
- fin from smefin not found in fin-smn
- smn from new finsmn not found in sme-smn
- sme from smefin not found in sme-smn
- ES, LA to translate residue lists
- LA, TT, ML to analyse smnfin / finsmn
- ML, ES to evaluate and correct Cip's smesmn
- Parallel second phase
- Evaluate result of Ciprians first run
- Evaluate Ryan's NDS
- Evaluate result of Ciprians first run
- Third phase
- Start working on the issues mentioned in the promise list above
Meeting in Tromsø late next week (thursday after lunch?) where we plan the summer.
Who are going
At least ML and Trond, Ryan?
Next meeting
Thursday 18.6. Topic: NDS implementation issues
Notes
f4 f8 aakkos järjestys => aakkosjärjestys aallon murtaja => aallonmurtaja -f4 alimmainen, alin eteläisempi, etelämpi aapa(suo) alkeis- armelias ~ armias ~ armollinen ellen, -et, -ei (jne.) ensi(-) hallussa, -sta -hiuksinen istuallaan, -een itsestään, -sään jämät (mon.) 120 commas, 20 parentheses, 21 tilde in column f4 cat finsmn/inc/2015/Suoma_saami_16062015.csv |grep 'aarto' -f8 4 8 -uutiset aamu -uutiset (arttukatos) aarto (arttukatos) -orvokki aho -orvokki (mänty) aihki (mänty) alus lakki, -huppu etu puolella, -puolelta joko - ta(h)i => joko - tahi, joko - tai kaari sulut (mon.) kalan perkeet (mon.) 18 commas, 13 parentheses, 3 tilde in column f8 cat finsmn/inc/2015/Suoma_saami_16062015.csv |cut -f4|grep '-'
Algorithm
Algorithm for building lemmas (resolving formatting) in the columns 4, 8 in finsmn:
TODO list for dictionary processing, how to handle smn-fin:
- comma: duplicate line
- eteläisempi, etelämpi => eteläisempi AND etelämpi
- eteläisempi, etelämpi => eteläisempi AND etelämpi
- parentheses written as one word: Expand:
- aapa(suo) = aapa AND aapasuo
- joko - ta(h)i => joko - tahi, joko - tai
- aapa(suo) = aapa AND aapasuo
- space parenthesis, hence written as two words: Duplicate
- aarto (arttukatos) => aarto AND arttukatos
- aarto (arttukatos) => aarto AND arttukatos
- TODO with tilde: Expand:
- armelias ~ armias ~ armollinen => armelias AND armias AND armollinen
- armelias ~ armias ~ armollinen => armelias AND armias AND armollinen
- komma space hyphen (sometimes multiple cases, sometimes compounds):
- Expand for cases: Remove as many letters as you add (3 letters away and add the same 3):
- tasalla, -lta => tasalla AND tasalta
- tasalla, -lta => tasalla AND tasalta
- Expand for compounds: Here, the first part is in f4 and the rest in f8
- aluslakki, -huppu => aluslakki AND alushuppu (aluslakki + aluhuppu
- aluslakki, -huppu => aluslakki AND alushuppu (aluslakki + aluhuppu
- hyphen initially in f8: It belongs to the lemma
- aho -orvokki => aho-orvokki
- aho -orvokki => aho-orvokki
- The string '(mon.)' occurs 3 times in f8 and ome in f4, it means (Plural) and should be deleted (we recover it for fin later)
- sulut (mon.)
- perkeet (mon.)
- liinat (mon.)
- jämät (mon.)
- sulut (mon.)
- (jne.) means "et cetera" and should be ignored
The relevant columns are 4, 8. Ff. readme documentation, cf. also:
A11=CONCATENATE(D11,H11," ",J11," ",K11,)