150519
Meeting, Inari Saami project 19.05.2015
Adobe Connect
People: Marja-Liisa, Erika, Miina, Trond, Lene, Francis
Agenda
- Status of the analyser
- Work onwards on the analyser:
- 2-syll nouns
- yaml
- lexc-affixes
- lexc-stems
- twolc
- yaml
- 3-syll nouns
- 4-syll nouns
- verbs
- 2-syll nouns
- Content discussion:
- 2-syllabic stems & CG patterns
- 2-syllabic stems & CG patterns
- Methods and work practice
- for lexc
- for svn (how to avoid conflicts)
- for lexc
- Milestones (time planning)
- Analyser before Canada
- Dictionary before Canada
- Work before summer holiday
- Work before school starts
- Work this year
- Analyser before Canada
- Overall goal for the two projects
- sme2smX
- AKS project
- sme2smX
- Principles for research work within the project
Status of the analyser
Overview:
For comparison:
- sms: 69% wf 40% uniq
- smn: 53% wf 26% uniq
Last week
- 97% pass on bisyllabic yaml files
- First now the yaml files are reliable (due to yaml + analyser work in tandem)
- Lexica are still not optimized
- Twolc are worse than before, due to the transition period
- Still dual trigger system: "surface" vs "linguistic"
- Overview of work lacking
Content discussion:
2-syllabic stems & CG patterns
- offline: smn/doc/consonantgradation.txt
A list of affix lexicons and examples, not quite updated:
Correct affixes and make system
Improve the analyser in a top-down way, and model triggers according to vowels
Defining the terms:
- Linguistic system = Always use triggers for all mph processes, independent on stem shape
- e.g. we have ^WG even if we do not need to change it in the orthography.
- Also, we both shorten (WG) and lengthen (CLEN)
- e.g. we have ^WG even if we do not need to change it in the orthography.
- Surface system = Use only the triggers needed to change stems to sufrace formorm, and divide stems in classes accordingly
- we only have the triggers needed for the orthography
The linguistic system implies:
- less lexica
- more specific and complex twolc rules
One specific point: Should we shorten ss -> s and lengthen s -> ss the consonant groups also when the surface form is identical to the stem?
We then have two alternatives to today's practice:
- Linguistic system and few lexica
- Surface system and more lexica
- Flexible system ("linguistic" in some cases and "surface" in other)
In some lexica, WG is used, in others it could have been used but it is not
Wednesday 8.30: linguistic meeting
Work onwards on the analyser:
2-syll nouns
3-syll nouns
- N-lex-3st-...... (for each lexicon)
- N-lex-2st-......
- N-lex-4st-......
The triggers for 3syll and 2syll are now different, but should be harmonized. The stem will be the stronggrade stem.
4-syll nouns etc.
Contracted stems
verbs
- Most (all?) verbs from the dictionary is in stems/verb.lexc
- We need yaml-files
- For each yaml file: one full paradigm and the rest are core
- For each yaml file: one full paradigm and the rest are core
- How many lexica do we need?
other POS
Adjectives
Closed classes without morphology
Closed classes with morphology, e.g. pronouns
Examples: sme, sms, smj, smn, fkv, ..., and in each folder:
Methods and work practice
for svn (how to avoid conflicts)
- alltid være synlig online når man jobber
- aldri oppbevare ei fil, men heller sjekke inn etterhvert. Dvs at hvis man tar en halvtimes pause, så sjekker man inn først
- alltid lagre filene før svn up !!! (hvis ikke risikerer man overskriving)
- les svn log -- og skriv informativ svn log, for å finne ut statusen til fila
Milestones (time planning)
Analyser before Canada
- Nouns in place minus contracted stems
- Verbs started
- Yamls reordered and checked for all N and V contlexes
Dictionary
- A version of the dictionary ready before 5.6.?
- Implementing a version of NDS with Ryan in Canada
Work before summer holiday
- Closed classes
- Starting on adjectives
Work during holiday (!)
- Work on missing lists
ccat -l smn -r ~/freecorpus/converted/smn|preprocess|usmn|grep '+?'|sort|uniq -c | sort -nr
- Collect bilingual texts for MT?
- Work on bidix?
Holiday plan
- Lene: Canada: 1.6-22.6. Holiday 6.7-16.8
- Marja-Liisa: Holiday 13.7.-?9.8.
- Trond: Oslo + Canada 1-22.6. Holiday: 6.7. - 3.8.
- Erika: Travelling Italy 26.5.-3.6. Seminar 8.-9.6. Holiday 6.7.-3.8.
- Miina: Holiday 1.8.-16.8.
- Fran: Bloomington, US (June), UK (?-13 July), Malta (13-23 July), Russia (August)
- Sjur: Canada ca. 6-14.7
Meeting time while in Canada
European evening meeting:
Work before school starts
- Work on making the dictionary presentable
- Oulu congress (workshop August 16. -- on the dictionary) -> next meeting
- The analyser covers 90 % of running text Sept. 1st.?
Work this year
- Beta version of an Inari Saami spell checker (= useful but with flaws)
- An alpha version of sme2smn that actually translates
- A dictionary with morphology and corpus search
- A sloppy syntactic parser (useful for corpus analysis for corpus search)
- Collection of bilingual texts for MT
- Bidix sme-smn (bilingual wordlist for MT)
Overall goal
sme2smx:
Goal: an MT system from sme to smn, good enough to
- be used as support for manual translation
- be used as translation, and then postedited
(2) is of course what we want, but since we had no analyser at the outset I would say that (a) is a realistic and (b) an optimistic goal. It remains to be seen.
AKS:
Realistic goals:
- NDS-type dictionary smn-fin-smn, with click-in-text and paradigm generation
- NDS-type dictionary smn-sme-smn will be easy to implement as a side product of the MT program
- Proofing tool (spellchecker) good enough to be useful for writers
- A third possible goal, pedagogical programs, i.e., a version of http://oahpa.no/aanaar/ including morphological programs, is dependent upon cooperation with and interest from teachers. Technically, we are able to implement it as soon as the analyser is ready, though.
Principles for research work within the project
Research output should follow and be based upon the development of the analysers and other programs. There will be no problem finding topics for research articles. Here some brain storming:
General issues
- impact of programs
- how does smn language technology change the Inari Saami society
- how does smn language technology change the Inari Saami society
- The impact on revitalisation of having grammar generation and dictionary available on Internet
Linguistic issues
- contrastive aspects sme-smn
- smn morphophonology - perhaps we find some new generalisations?
- issues in morphology which is not covered in scientific papers before
- how is the morphology in the texts we will analyse?
Orthography and spell checker issues
- variation in texts
- smn orthography: variation between generations? families?
- How does the Inari Saami writing community cope without a spell checker
- How does the Inari Saami writing community cope when the spell checker arrives
- who is to decide the norm? How much variation can/should spell checker allow
- how good must a spellchecker be to be useful? How important that it covers not-smn propernouns and so on?
Dictionary issues
- How to make the linguistics: e.g. solve homonymies for morphological paradigms
- How to adjust the dictionary for students
- Study the pupils' use and usefulness of the dictionary for pupils
- How can the dictionary be integrated in learning materials on Internet?
- Issues connected to corpus search directly form the dict,
- what kind of sentences do the user need? How to get the most appropriate up on the top? Do the user find the examples she needs for knowing how to use the words?
- the coverage dict vs corpus
- Finnish-Inari Saami - for whom? what kind of dictionary is it? Coverage of Finnish word.
- what kind of sentences do the user need? How to get the most appropriate up on the top? Do the user find the examples she needs for knowing how to use the words?
- Terminology in the dictionary. Who decides? How to get the new words into use?
MT issues
- Are there North Saami texts, which should be translated to smn? What kind of texts?
- Evaluation of MT system
- MT bidix vs. dictionary
- How useful is a smn-sme-smn dictionary?
Corpus issues
- What kind of corpus do we have, what is missing, what can it be used for
For examples, see e.g. the publications on lexicography and proofing for North Saami on our publication pages.
Next plenary meeting
Probably while in Canada.
Other meetings as needed.