Meeting_2014-06-11

Contents:

Tastatur og preprosessering
tastatur for iOS8 og Android (Lavangen og India? Utlysing)
preprosessering
arbeid til Mike
wishlist for tokeniser

Tastatur og preprosessering

Planar for sumaren og hausten:

tastatur for iOS8 og Android (Lavangen og India? Utlysing)
preprosessering
arbeid til Mike

tastatur for iOS8 og Android (Lavangen og India? Utlysing)

Finansiering: Divvun-potten for ekstra satsingar Timeplan: ferdig til offentleg lansering av iOS8 (rykte: september) - vi satsar på 15. september for ein beta, ferdig så fort som mogleg etter det

Design-mål:

så lik Apple sitt tastatur som mogleg
fullføringsforslag og retteforslag frå hfst (men kanskje berre listebasert i fyrste versjon for å få han ut)
norske og ikkje-samiske teikn som popup-liste (som Apple-tastatura)
klårt skilje mellom språkuavhengig og språkspesifikk kode
tastaturlayout som xml-fil (eller noko liknande)
vi lagar for nordsamisk no, men skal enkelt kunna lagast for alle språka våre

Moglege framtidsvariantar:

swipe-inspirert?
(meir avansert) bruk av hfst-teknologi for stavekontroll og ordfullføring

preprosessering

Basert på:

hfst-pmatch? https: //kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
hfst-ataq
something else?

Possible issues with hfst-pmatch:

char-by-char processing? Just like any other fst: state-by-state
processing of formatting? It can deal with any text - as long as the formatting is expressed as text (in-stream markup) it should be no problem
speed? We don't know yet, but fst speed anyway
compilation speed? Unknown until we try

Tommi: You cannot get your tokeniser as you analyse with ambiguos readings in middle of the string from pmatch; if "in order to" is lrlm there won't be "in" "order" "to" using pmatch applicator.

Sjur: Can this be changed in the pmatch code to collect all paths up until a common tokenisation point?

Tommi: Wouldn't it in the end be just as much work as rewriting from scratch and probably harder? Like, using pmatch for this with these specs is like having a hammer and trying very hard to use it on screws cause they kind of look like a nail.

See http://www.stanford.edu/~laurik/publications/pmatch for details on how to use (hfst-)pmatch.

arbeid til Mike

Mike to try out hfst-pmatch for a month, then we evaluate the feasibility of hfst-pmatch as an analysing tokeniser.

wishlist for tokeniser

have whitespace in the middle of words, e.g. \n and softhyphen
string: lettersequence - whitespacesequence - othersequence - ...
LR longest match for token-sharing boundaries
within the token, all the analyses
input: 12345678901235463; possible tokenisations:
- 12 34 5678 90 12 35 463
- 123 45678 90 12 35 463
  - ^12345678/12+34+5678/123+45678$ ^90/90$ ^12/12$ ...
  - thus: get both tokenisations between 1 and 8. then analyse
input: "the cat's mother, in order to", possible tokenisations:
- the cat 's mother, in order to
- the cat's mother, in order to
  - ^the/the$ ^cat's/cat+'s/cat's$ ^mother/mother$^,/,$ ^in order to/in+order+to/in order to$

Two possible tokenisations:

"<in order to>"
    "in order to" pr

"<in>"
    "pr"

"<order>"
    "order" vblex pres
    "order" n sg 

"<to>"
    "to" pr

output an ambiguous lattice ?
do backoff automata ? e.g. analyser -> regex -> unicode database
Sane handling for Finnic(?) coordinated compounds with hanging hyphen:
- ”koira- ja kissajuttu” ?= koira+juttu ja kissa+juttu
- it'd be neat if hyphenated words were not in morph. analyser.. maybe
Case mangling:
- "Thing" -> thing
- an tAerfort -> an t+aerfort

Re unicode regexes: "You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}." See http://www.regular-expressions.info/unicode.html for details.

Which tools support Unicode regexes? pcre? Yes, I believe so. Any decent and recent programming language with proper ICU-based Unicode support : )