Computational Workshop For Indigineous Languages Summary
Discussion how to cope with multiword constructions, or wordforms
Issues with East Cree and FSTs
If we are allowing for spaces between morphemes, and are expecting preverbs to be followed by a verb stem, medial, and ending, then spelling NEEDS to be proper, or our FST system will not recognize an incorrectly spelled pre-verb and thus will not recognize that there needs to be a stem, medial, etc. following it.
As I understand it, our theory of grammar is modular
FST-based preprocessing (or: tokenisation as part of morphology) presupposes a particular alphabet of well-formed strings between whitespace (referred to as "atoms" below,), and there is a separate module checking the well-formedness of these atoms (the FST implementation of morphology and phonology). The danger of having a mis-spelling is that it will take an atom outside of the language, and so as the FST-based preprocessing comes in to enforce dependencies between well-formed atoms, in the case of mis-spelled atoms it simply has nothing to work with.
In the case of a language like East Cree, a problem emerges in that people happily introduce whitespace between what we consider to be parts of a word.
Because legitimately distinct syntactic words are also separated by whitespace, it would be a major increase of the mission of our spell-checkers to handle the evaluation of sentences rather than atoms. Leaving some intermediate jumps unspecified, I think this is the central challenge of writing systems like that used in East Cree ... syntactically distinct items are being treated in the writing system in the same way as syntactically dependent items, but our spell checker should only reasonably be expected to handle one of these types of data (the close-knit atoms).
(If this understanding is faulty, please let me know : )
We may treat spaces alike, and include all atoms
In order to find the illicit combinations one will then have to lean to an alternative.
The alternative model would be to distinguish betwee real word boundaries
The obvious solution to this is to treat each morpheme that is separated by
Use of CG as a pattern-matcher to enhance spellcheck.
- Your input was: i can not eat
- Our translation of this sentence is: gaawiin nindaa-wiisinisii
Th three things to accomplish in the near future
- BLARK: Dustin & Katie with the help of Arok, Marie-Odile and company to put together a survey of Algonquian language resources and community pull to create language-technological tools. AKA "you give us resources, we help you create tools".
- nehiyawêtân 1: describing the creation of a Plains Cree ICALL application: Lene, Megan, Antti, et co., to be followed by further development based on feedback from Cree instructors (Arok, Jean, Dorothy) and then impact on language learning -> LREC (I don't think LREC is the place for this)
- nehiyawêtân 2: implementation in language courses and evaluation of the students' use of it and their learning outcome
- FST modeling of Plains Cree nouns and an evaluation of the FST based on analysis of corpus, and a special look at Locatives, diminutives, possessives. Atticus, Lene, Trond, Antti, Arok.
- FST modeling of Plains Cree verbs: Atticus, Lene, Trond, Antti, Arok.
- itwewina - why and how combining an existing dictionary with an FST.
We want to be mindful of how to evangelize this work. Finding organizations, conferences where we can talk to people. Communities that are relevant include linguists, teachers, less so computer scientists.
Computer scientists don't necessarily think our work is interesting (not fashionable, too little data ... An interesting topic could be finding the lower limit of data needed for inference with various toolss... perhaps there is a paper here.)