150519

Meeting, Inari Saami project 19.05.2015

Adobe Connect

People: Marja-Liisa, Erika, Miina, Trond, Lene, Francis

Agenda

  • Status of the analyser
  • Work onwards on the analyser:
    • 2-syll nouns
      • yaml
      • lexc-affixes
      • lexc-stems
      • twolc
    • 3-syll nouns
    • 4-syll nouns
    • verbs
  • Content discussion:
    • 2-syllabic stems & CG patterns
  • Methods and work practice
    • for lexc
    • for svn (how to avoid conflicts)
  • Milestones (time planning)
    • Analyser before Canada
    • Dictionary before Canada
    • Work before summer holiday
    • Work before school starts
    • Work this year
  • Overall goal for the two projects
    • sme2smX
    • AKS project
  • Principles for research work within the project

Status of the analyser

Overview:

For comparison:

  • sms: 69% wf 40% uniq
  • smn: 53% wf 26% uniq

Last week

  • 97% pass on bisyllabic yaml files
  • First now the yaml files are reliable (due to yaml + analyser work in tandem)
  • Lexica are still not optimized
  • Twolc are worse than before, due to the transition period
  • Still dual trigger system: "surface" vs "linguistic"
  • Overview of work lacking

Content discussion:

2-syllabic stems & CG patterns

  • offline: smn/doc/consonantgradation.txt

A list of affix lexicons and examples, not quite updated:

Correct affixes and make system We may drop lexicons. The first idea was to use at least some lexicons.

Improve the analyser in a top-down way, and model triggers according to vowels Then the CG will follow

Defining the terms:

  • Linguistic system = Always use triggers for all mph processes, independent on stem shape
    • e.g. we have ^WG even if we do not need to change it in the orthography.
    • Also, we both shorten (WG) and lengthen (CLEN)
  • Surface system = Use only the triggers needed to change stems to sufrace formorm, and divide stems in classes accordingly
    • we only have the triggers needed for the orthography

The linguistic system implies:

  1. less lexica
  2. more specific and complex twolc rules

One specific point: Should we shorten ss -> s and lengthen s -> ss the consonant groups also when the surface form is identical to the stem?

We then have two alternatives to today's practice:

  1. Linguistic system and few lexica
  2. Surface system and more lexica
  3. Flexible system ("linguistic" in some cases and "surface" in other)

In some lexica, WG is used, in others it could have been used but it is not Example: sukká --> corrected with WG+CLEN, which gave rise to another error

Wednesday 8.30: linguistic meeting

Work onwards on the analyser:

2-syll nouns

Yaml files until tomorrow

3-syll nouns

We want larger yaml files also for 3-syll nouns: One yaml file for each lexicon Status now: PASSES: 278 / FAILS: 516 / TOTAL: 794 (35% pass) perhaps

  • N-lex-3st-...... (for each lexicon)
  • N-lex-2st-......
  • N-lex-4st-......

The triggers for 3syll and 2syll are now different, but should be harmonized. The stem will be the stronggrade stem.

4-syll nouns etc.

Wait until 2st andt 3st are more stable. Yaml could be made ready, though.

Contracted stems

Starting work before Canada.

verbs

  • Most (all?) verbs from the dictionary is in stems/verb.lexc
  • We need yaml-files
    • For each yaml file: one full paradigm and the rest are core
  • How many lexica do we need?

other POS

Adjectives

Wait with adjectives

Closed classes without morphology

Just add them (they are partly added already)

Closed classes with morphology, e.g. pronouns

Copy the grammar into lexc

Examples: sme, sms, smj, smn, fkv, ..., and in each folder: src/morphology/stems/pronouns.lexc

Methods and work practice

for svn (how to avoid conflicts)

Vi bør ha et punkt om arbeidsvaner, for å unngå konflikt

  • alltid være synlig online når man jobber
  • aldri oppbevare ei fil, men heller sjekke inn etterhvert. Dvs at hvis man tar en halvtimes pause, så sjekker man inn først
  • alltid lagre filene før svn up !!! (hvis ikke risikerer man overskriving)
  • les svn log -- og skriv informativ svn log, for å finne ut statusen til fila

Milestones (time planning)

Analyser before Canada

  • Nouns in place minus contracted stems
  • Verbs started
  • Yamls reordered and checked for all N and V contlexes

Dictionary

  • A version of the dictionary ready before 5.6.?
  • Implementing a version of NDS with Ryan in Canada

Work before summer holiday

  • Closed classes
  • Starting on adjectives

Work during holiday (!)

  • Work on missing lists
ccat -l smn -r ~/freecorpus/converted/smn|preprocess|usmn|grep '+?'|sort|uniq -c | sort -nr
  • Collect bilingual texts for MT?
  • Work on bidix?

Holiday plan

  • Lene: Canada: 1.6-22.6. Holiday 6.7-16.8
  • Marja-Liisa: Holiday 13.7.-?9.8.
  • Trond: Oslo + Canada 1-22.6. Holiday: 6.7. - 3.8.
  • Erika: Travelling Italy 26.5.-3.6. Seminar 8.-9.6. Holiday 6.7.-3.8.
  • Miina: Holiday 1.8.-16.8.
  • Fran: Bloomington, US (June), UK (?-13 July), Malta (13-23 July), Russia (August)
  • Sjur: Canada ca. 6-14.7

Meeting time while in Canada

European evening meeting: Finland 1600 - 1800 Norway 1500-1700 Bloomington 0900 - 1100 Edmonton 0700 - 0900

Work before school starts

  1. Work on making the dictionary presentable
  2. Oulu congress (workshop August 16. -- on the dictionary) -> next meeting
  3. The analyser covers 90 % of running text Sept. 1st.?

Work this year

  • Beta version of an Inari Saami spell checker (= useful but with flaws)
  • An alpha version of sme2smn that actually translates
  • A dictionary with morphology and corpus search
  • A sloppy syntactic parser (useful for corpus analysis for corpus search)
  • Collection of bilingual texts for MT
  • Bidix sme-smn (bilingual wordlist for MT)

Overall goal

sme2smx:

Goal: an MT system from sme to smn, good enough to

  1. be used as support for manual translation
  2. be used as translation, and then postedited

(2) is of course what we want, but since we had no analyser at the outset I would say that (a) is a realistic and (b) an optimistic goal. It remains to be seen.

AKS:

Realistic goals:

  • NDS-type dictionary smn-fin-smn, with click-in-text and paradigm generation
  • NDS-type dictionary smn-sme-smn will be easy to implement as a side product of the MT program
  • Proofing tool (spellchecker) good enough to be useful for writers
  • A third possible goal, pedagogical programs, i.e., a version of http://oahpa.no/aanaar/ including morphological programs, is dependent upon cooperation with and interest from teachers. Technically, we are able to implement it as soon as the analyser is ready, though.

Principles for research work within the project

Research output should follow and be based upon the development of the analysers and other programs. There will be no problem finding topics for research articles. Here some brain storming:

General issues

  • impact of programs
    • how does smn language technology change the Inari Saami society
  • The impact on revitalisation of having grammar generation and dictionary available on Internet

Linguistic issues

  • contrastive aspects sme-smn
  • smn morphophonology - perhaps we find some new generalisations?
  • issues in morphology which is not covered in scientific papers before
  • how is the morphology in the texts we will analyse?

Orthography and spell checker issues

  • variation in texts
  • smn orthography: variation between generations? families?
  • How does the Inari Saami writing community cope without a spell checker
  • How does the Inari Saami writing community cope when the spell checker arrives
  • who is to decide the norm? How much variation can/should spell checker allow
  • how good must a spellchecker be to be useful? How important that it covers not-smn propernouns and so on?

Dictionary issues

  • How to make the linguistics: e.g. solve homonymies for morphological paradigms
  • How to adjust the dictionary for students
  • Study the pupils' use and usefulness of the dictionary for pupils
  • How can the dictionary be integrated in learning materials on Internet?
  • Issues connected to corpus search directly form the dict,
    • what kind of sentences do the user need? How to get the most appropriate up on the top? Do the user find the examples she needs for knowing how to use the words?
    • the coverage dict vs corpus
    • Finnish-Inari Saami - for whom? what kind of dictionary is it? Coverage of Finnish word.
  • Terminology in the dictionary. Who decides? How to get the new words into use?

MT issues

  • Are there North Saami texts, which should be translated to smn? What kind of texts?
  • Evaluation of MT system
  • MT bidix vs. dictionary
  • How useful is a smn-sme-smn dictionary?

Corpus issues

  • What kind of corpus do we have, what is missing, what can it be used for

For examples, see e.g. the publications on lexicography and proofing for North Saami on our publication pages.

Next plenary meeting

Probably while in Canada.

Other meetings as needed.