150519

Meeting, Inari Saami project 19.05.2015

Adobe Connect

People: Marja-Liisa, Erika, Miina, Trond, Lene, Francis

Agenda

Status of the analyser
Work onwards on the analyser:
- 2-syll nouns
  - yaml
  - lexc-affixes
  - lexc-stems
  - twolc
- 3-syll nouns
- 4-syll nouns
- verbs
Content discussion:
- 2-syllabic stems & CG patterns
Methods and work practice
- for lexc
- for svn (how to avoid conflicts)
Milestones (time planning)
- Analyser before Canada
- Dictionary before Canada
- Work before summer holiday
- Work before school starts
- Work this year
Overall goal for the two projects
- sme2smX
- AKS project
Principles for research work within the project

Status of the analyser

Overview:

For comparison:

sms: 69% wf 40% uniq
smn: 53% wf 26% uniq

Last week

97% pass on bisyllabic yaml files
First now the yaml files are reliable (due to yaml + analyser work in tandem)
Lexica are still not optimized
Twolc are worse than before, due to the transition period
Still dual trigger system: "surface" vs "linguistic"
Overview of work lacking

Content discussion:

2-syllabic stems & CG patterns

offline: smn/doc/consonantgradation.txt

A list of affix lexicons and examples, not quite updated:

/lang/smn/nounscontlex.html

Correct affixes and make system We may drop lexicons. The first idea was to use at least some lexicons.

Improve the analyser in a top-down way, and model triggers according to vowels Then the CG will follow

Defining the terms:

Linguistic system = Always use triggers for all mph processes, independent on stem shape
- e.g. we have ^WG even if we do not need to change it in the orthography.
- Also, we both shorten (WG) and lengthen (CLEN)
Surface system = Use only the triggers needed to change stems to sufrace formorm, and divide stems in classes accordingly
- we only have the triggers needed for the orthography

The linguistic system implies:

less lexica
more specific and complex twolc rules

One specific point: Should we shorten ss -> s and lengthen s -> ss the consonant groups also when the surface form is identical to the stem?

We then have two alternatives to today's practice:

Linguistic system and few lexica
Surface system and more lexica
Flexible system ("linguistic" in some cases and "surface" in other)

In some lexica, WG is used, in others it could have been used but it is not Example: sukká --> corrected with WG+CLEN, which gave rise to another error

Wednesday 8.30: linguistic meeting

Work onwards on the analyser:

2-syll nouns

Yaml files until tomorrow

3-syll nouns

We want larger yaml files also for 3-syll nouns: One yaml file for each lexicon Status now: PASSES: 278 / FAILS: 516 / TOTAL: 794 (35% pass) perhaps

N-lex-3st-...... (for each lexicon)
N-lex-2st-......
N-lex-4st-......

The triggers for 3syll and 2syll are now different, but should be harmonized. The stem will be the stronggrade stem.

4-syll nouns etc.

Wait until 2st andt 3st are more stable. Yaml could be made ready, though.

Contracted stems

Starting work before Canada.

verbs

Most (all?) verbs from the dictionary is in stems/verb.lexc
We need yaml-files
- For each yaml file: one full paradigm and the rest are core
How many lexica do we need?

other POS

Adjectives

Wait with adjectives

Closed classes without morphology

Just add them (they are partly added already)

Closed classes with morphology, e.g. pronouns

Copy the grammar into lexc

Examples: sme, sms, smj, smn, fkv, ..., and in each folder: src/morphology/stems/pronouns.lexc

Methods and work practice

for svn (how to avoid conflicts)

Vi bør ha et punkt om arbeidsvaner, for å unngå konflikt

alltid være synlig online når man jobber
aldri oppbevare ei fil, men heller sjekke inn etterhvert. Dvs at hvis man tar en halvtimes pause, så sjekker man inn først
alltid lagre filene før svn up !!! (hvis ikke risikerer man overskriving)
les svn log -- og skriv informativ svn log, for å finne ut statusen til fila

Milestones (time planning)

Analyser before Canada

Nouns in place minus contracted stems
Verbs started
Yamls reordered and checked for all N and V contlexes

Dictionary

A version of the dictionary ready before 5.6.?
Implementing a version of NDS with Ryan in Canada

Work before summer holiday

Closed classes
Starting on adjectives

Work during holiday (!)

Work on missing lists

ccat -l smn -r ~/freecorpus/converted/smn|preprocess|usmn|grep '+?'|sort|uniq -c | sort -nr

Collect bilingual texts for MT?
Work on bidix?

Holiday plan

Lene: Canada: 1.6-22.6. Holiday 6.7-16.8
Marja-Liisa: Holiday 13.7.-?9.8.
Trond: Oslo + Canada 1-22.6. Holiday: 6.7. - 3.8.
Erika: Travelling Italy 26.5.-3.6. Seminar 8.-9.6. Holiday 6.7.-3.8.
Miina: Holiday 1.8.-16.8.
Fran: Bloomington, US (June), UK (?-13 July), Malta (13-23 July), Russia (August)
Sjur: Canada ca. 6-14.7

Meeting time while in Canada

European evening meeting: Finland 1600 - 1800 Norway 1500-1700 Bloomington 0900 - 1100 Edmonton 0700 - 0900

Work before school starts

Work on making the dictionary presentable
Oulu congress (workshop August 16. -- on the dictionary) -> next meeting
The analyser covers 90 % of running text Sept. 1st.?

Work this year

Beta version of an Inari Saami spell checker (= useful but with flaws)
An alpha version of sme2smn that actually translates
A dictionary with morphology and corpus search
A sloppy syntactic parser (useful for corpus analysis for corpus search)
Collection of bilingual texts for MT
Bidix sme-smn (bilingual wordlist for MT)

Overall goal

sme2smx:

Goal: an MT system from sme to smn, good enough to

be used as support for manual translation
be used as translation, and then postedited

(2) is of course what we want, but since we had no analyser at the outset I would say that (a) is a realistic and (b) an optimistic goal. It remains to be seen.

AKS:

Realistic goals:

NDS-type dictionary smn-fin-smn, with click-in-text and paradigm generation
NDS-type dictionary smn-sme-smn will be easy to implement as a side product of the MT program
Proofing tool (spellchecker) good enough to be useful for writers
A third possible goal, pedagogical programs, i.e., a version of http://oahpa.no/aanaar/ including morphological programs, is dependent upon cooperation with and interest from teachers. Technically, we are able to implement it as soon as the analyser is ready, though.

Principles for research work within the project

Research output should follow and be based upon the development of the analysers and other programs. There will be no problem finding topics for research articles. Here some brain storming:

General issues

impact of programs
- how does smn language technology change the Inari Saami society
The impact on revitalisation of having grammar generation and dictionary available on Internet

Linguistic issues

contrastive aspects sme-smn
smn morphophonology - perhaps we find some new generalisations?
issues in morphology which is not covered in scientific papers before
how is the morphology in the texts we will analyse?

Orthography and spell checker issues

variation in texts
smn orthography: variation between generations? families?
How does the Inari Saami writing community cope without a spell checker
How does the Inari Saami writing community cope when the spell checker arrives
who is to decide the norm? How much variation can/should spell checker allow
how good must a spellchecker be to be useful? How important that it covers not-smn propernouns and so on?

Dictionary issues

How to make the linguistics: e.g. solve homonymies for morphological paradigms
How to adjust the dictionary for students
Study the pupils' use and usefulness of the dictionary for pupils
How can the dictionary be integrated in learning materials on Internet?
Issues connected to corpus search directly form the dict,
- what kind of sentences do the user need? How to get the most appropriate up on the top? Do the user find the examples she needs for knowing how to use the words?
- the coverage dict vs corpus
- Finnish-Inari Saami - for whom? what kind of dictionary is it? Coverage of Finnish word.
Terminology in the dictionary. Who decides? How to get the new words into use?

MT issues

Are there North Saami texts, which should be translated to smn? What kind of texts?
Evaluation of MT system
MT bidix vs. dictionary
How useful is a smn-sme-smn dictionary?

Corpus issues

What kind of corpus do we have, what is missing, what can it be used for

For examples, see e.g. the publications on lexicography and proofing for North Saami on our publication pages.

Next plenary meeting

Probably while in Canada.

Other meetings as needed.