Meeting_2014-06-11
Tastatur og preprosessering
Planar for sumaren og hausten:
- tastatur for iOS8 og Android (Lavangen og India? Utlysing)
- preprosessering
- arbeid til Mike
tastatur for iOS8 og Android (Lavangen og India? Utlysing)
Finansiering: Divvun-potten for ekstra satsingar
Design-mål:
- så lik Apple sitt tastatur som mogleg
- fullføringsforslag og retteforslag frå hfst (men kanskje berre listebasert i
- norske og ikkje-samiske teikn som popup-liste (som Apple-tastatura)
- klårt skilje mellom språkuavhengig og språkspesifikk kode
- tastaturlayout som xml-fil (eller noko liknande)
- vi lagar for nordsamisk no, men skal enkelt kunna lagast for alle språka våre
Moglege framtidsvariantar:
- swipe-inspirert?
- (meir avansert) bruk av hfst-teknologi for stavekontroll og ordfullføring
preprosessering
- hfst-pmatch? https: //kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
- hfst-ataq
- something else?
Possible issues with hfst-pmatch:
- char-by-char processing? Just like any other fst: state-by-state
- processing of formatting? It can deal with any text - as long as the formatting is expressed as text (in-stream markup) it should be no problem
- speed? We don't know yet, but fst speed anyway
- compilation speed? Unknown until we try
Tommi: You cannot get your tokeniser as you analyse with ambiguos readings in middle of the string from pmatch; if "in order to" is lrlm there won't be "in" "order" "to" using pmatch applicator.
Sjur: Can this be changed in the pmatch code to collect all paths up until a common tokenisation point?
Tommi: Wouldn't it in the end be just as much work as rewriting from scratch and probably harder? Like, using pmatch for this with these specs is like having a hammer and trying very hard to use it on screws cause they kind of look like a nail.
See http://www.stanford.edu/~laurik/publications/pmatch for details on how to
arbeid til Mike
Mike to try out hfst-pmatch for a month, then we evaluate the feasibility of hfst-pmatch as an analysing tokeniser.
wishlist for tokeniser
- have whitespace in the middle of words, e.g. \n and softhyphen
- string: lettersequence - whitespacesequence - othersequence - ...
- LR longest match for token-sharing boundaries
- within the token, all the analyses
- input: 12345678901235463; possible tokenisations:
- 12 34 5678 90 12 35 463
- 123 45678 90 12 35 463
- ^12345678/12+34+5678/123+45678$ ^90/90$ ^12/12$ ...
- thus: get both tokenisations between 1 and 8. then analyse
- ^12345678/12+34+5678/123+45678$ ^90/90$ ^12/12$ ...
- 12 34 5678 90 12 35 463
- input: "the cat's mother, in order to", possible tokenisations:
- the cat 's mother, in order to
- the cat's mother, in order to
- ^the/the$ ^cat's/cat+'s/cat's$ ^mother/mother$^,/,$ ^in order to/in+order+to/in order to$
- the cat 's mother, in order to
Two possible tokenisations:
"<in order to>" "in order to" pr "<in>" "pr" "<order>" "order" vblex pres "order" n sg "<to>" "to" pr
- output an ambiguous lattice ?
- do backoff automata ? e.g. analyser -> regex -> unicode database
- Sane handling for Finnic(?) coordinated compounds with hanging hyphen:
- ”koira- ja kissajuttu” ?= koira+juttu ja kissa+juttu
- it'd be neat if hyphenated words were not in morph. analyser.. maybe
- ”koira- ja kissajuttu” ?= koira+juttu ja kissa+juttu
- Case mangling:
- "Thing" -> thing
- an tAerfort -> an t+aerfort
- "Thing" -> thing
Re unicode regexes: "You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}." See http://www.regular-expressions.info/unicode.html for details.
Which tools support Unicode regexes? pcre? Yes, I believe so. Any decent and recent programming language with proper ICU-based Unicode support : )