Grammatikkontrollmøte 14.6.2016

Til stades: Kevin, Sjur

Tema: fleirtydig tokenisering

< {skuvla} 0:" " "@P.Pmatch.Backtrack@" {busse} "+N":0 > ENDLEX;
skuvla+N:skuvla ENDLEX;
busset+V:busse ENDLEX;

Leksikonet over bør kunna gi denne analysen:

"<skuvla busse>"
   "skuvlabusse" N Err/SpaceCmp
   "busset" V
       "skuvla" N

Forslag til pmatch-filter:

define filter_flags(net) net .o. [?* flag:0 ?*]*;
! Would this happen online or during compilation?
! Compilation of the pmatch rules
! (Leads to overgeneration but can that be limited mechanically?)
! (Probably scratch this idea, I forgot about the overgeneration problem)

! the tag LexCmp indicates a lexicalized compound
define lexicalized_compounds Lexicon .o. ?* LexCmp ?* ;
define allowed_prefixes Lexicon .o. ?* PrefixForms ?* ;
define multitoken_surfaces [filter_flags(lexicalized_compounds.i) .o. [[ allowed_prefixes 0:" " ]+ Lexicon ]].o;
define multitokens multitoken_surfaces .o. [ Lexicon " " ]+ ;
define lexicalized_compounds_with_erroneous_spaces multitoken_surfaces .o. [?* " ":0 ?*] .o. lexicalized_compounds ;

Kommentarar frå Kevin i koden:

! Issues: flag diacritics probably both break the morphosemantics and
! cause huge memory consumption
! Idea: if lexicalized_compounds could be made flag-free, that might suffice? What about the flags in the forms ambiguous with lex.cmps? We don't know which forms are ambig. with lex.cmp's, that's why we intersect.
! Below: that still doesn't work
! Krister: have recent developments in restricting the compound correction perhaps made this possible and we
! should try again?
!! Tried heavily restricting, even to just simple nouns, still too much mem
! Another idea: since we really want to start with surface forms, could we just output a text file with
! a list of the lexicalized compound surface forms?
!! Ie. analyse a bunch of forms and then … script the lexc to add a tag to lemmas that are ambiguous?
! Maybe use eg. ospell to generate them
! It's suggested that I (Sam) try to do this for omorfi where there are no flag diacritics just to validate the idea
! I did originally implement this as form-intersection, which worked just fine where there were no flags :) but unfortunately anything interesting in sme has flags …
! I don't see how this is less hacky than online backtracking :/ 
! Neither do I
! so what about this RC mentioned in the email thread
! That's mainly for avoiding doing multitoken analysis even when there's no possibility of misspelled compound
! But thinking again about the rules above ... Isn't it possible to filter out flags from the surfaces
! Our meeting is running out of time... but I'll think some more about this possibility and write it up in an email

Og vidare:

! Trying to remove flag diacritics with foma in order to intersect on forms-only:
! runs out of ram. Trying to grep them out manually: runs out of ram during
! minimize or composition. Even grepping out only parts from only the parts of
! the lexicon we need runs out of ram during later steps of compilation.

define TOP
[ Lexicon |
lexicalized_compounds_with_erroneous_spaces |
RC(lexicalized_compounds_with_erroneous_spaces) multitokens
] RC(boundary);

> [ RC(verb) noun] meaning that the current match is first tested to be in the input
> set of "verb" and then processed as "noun". So you could test that you
> have an ambiguity and then trigger that sort of tokenization
! So something like:
define word lexicon RC([blank|#]) LC([blank|#]);
! All analyses that have the SpaceCmp tag:
define spacecmp lexicon .o. ?* Err/SpaceCmp ?*; 
! If something had the SpaceCmp tag, try reanalysing that string as if it were two tokens with a space in the middle:
define token [ RC(spacecmp) word " " word ]; 

Tankar om løysing:

skuvlabusse:skuvla#busse CONTLEX ;
# (->) " " ; ! And also add +Err/SpaceCmp

Problemstilling: Korleis kan vi få analysene:


i tillegg til:

skuvla busse+N

utan å eksplisitt ha

< {skuvla} "+N":0 "@P.Pmatch.Loc@" busset:busse "+V":0 >

i lexc (alt for mange kombinasjonar til å handtera manuelt), og utan å køyra intersection A∩A" "A på formar (som ikkje går pga. flagg vs RAM), dvs. slik at me berre seier i lexc at «herfrå vil me ha online backtracking».

Me har ikkje noko behov for å spesifikt seia at "der, der er backtrack-punktet", berre at "denne analysestrengen krev at me tek med reanalyser som fleire token", der reanalysen har dei vanlege tokeniseringsgrensene (mellomrom i dette tilfellet).

Abbreviations require the same treatment as compounds unless we want to manually specify paths into PUNCT for all forms ambiguous with punctuated abbreviations:

Abbreviations vs other POS + full stop
su. (sunnuntai or su + .)

We already have a solution for: Numerals: ordinals vs cardinals + full stop as in: 1000., since numbers have fairly unambiguous forms : )

Sitat frå tekstchat under møtet:

June 14, 2016
14:06 Kevin: su. er jo dekka av pmatch_input_mark
14:06 Sjur: men no prøver vi å finna ei løysing som ikkje bruker det merket - eller?
14:06 Kevin: nei, det blir noko anna
14:06 Sjur: og eg får ikkje su. til å funka
14:07 Sjur: ok
14:07 Kevin: me kan sjå på det, men det skal uansett ikkje trenga backtracking
14:07 Sjur: kvifor ikkje? vi vil ha su. = ABBR + su. = Pron & CLB
14:08 Sjur: altså treng vi backtracking etter det eg kan forstå
14:09 Kevin: Åh, viss me skal gjera det utan å spesifisera fullt ut i leksikon, ok. Eg såg på det som ekvivalent med «3.», der det er veldig lett å fullt ut gi begge i leksikon
14:10 Sjur: ok - eg såg for meg ei generell løysing der vi ikkje byggjer leksikonet for tokeniseringa
14:11 Sjur: men om det er enkelt kan vi sjølvsagt gjera det der :)
14:13 Sjur: «og utan å køyra intersection på formar» - meiner du composition?