propnouns-meeting-20051221
Agenda
- finalize the proper name xml structure
- prepare integration of the kvensk project, if SD accepts
Participants: Børre, Linda, Sjur, Tomi, Trond
Questions:
- What (the content):
- make an overview of all info we want to store
- ... and how to organise it
- make an overview of all info we want to store
- How (the xml structure):
- one or two files? (two actually implies three)
- what info to split into common parts and project / language specific parts
- one or two files? (two actually implies three)
Views:
- Iconic id better than arbitrary id.
- Single linking, or automatically made double linking
- pro links in common: lg-specific files are not cluttered
- pro links in lg files: that is where the info is
Work process:
- Timbuktu: Add iconic id and semantics once, to common.
- The machine MAKES the lg entries, based upon these assumpt:
- one form, inherited from iconic id
- one sem, inherited from common
- and we will decide a default declension class for each lg
- we may leave a tag in place saying "untouched by human hands"
- one form, inherited from iconic id
- The machine MAKES the lg entries, based upon these assumpt:
- Helsinki: - same case, but here we need heavy manual editing
- make a tag saying "now touched (by native speaker)"
Conclusions:
Double linking, iconic id
- Place names: pick Norwegian, Swedish, Finnish, English names.
- Other names: pick the most common (the one which give most "identical" hits among our lgs:
With the princ of inheritance (lemma inherited from common file):
- inherit right away / at creation time (= larger files, more duplicate info)
common | swe | fin | India_2 | India | Intia | ->lg=a | ->India_2 | ->India_2 | (->lg=b Intia)| ->India | | sem plc ... Timbuktu | Timbuktu | Timbuktu | ->lg=a id | ->Timbuktu | ->Timbuktu | ->lg=b id | | | sem plc | <ab>Tmb.</ab>| ... | sme: | ... | nor: | fin | swe | eng Tana | Deatnu | ... | Tana | | | ->lg=a id | ->Tana | ... |->Tana | | | ->lg=b id | | ... | | | | sem plc | ... | | | ... ... | | |
What do we store in the "common" file
What is stored in the lang-specific ones? Linguistic info:
- inflection
- stem
- lemma
- derivation class?
- compounding?
- senses (pointers to concepts)
- orthographical variants (incl. (common) misspellings)
- acronym(s) and abbreviation(s):
- as separate entries or as part of the name entry?
NATO => OTAN NRL => NBR, Ap => Bb KRD KRD KRD KRD KRD+N+ACR+Sg+Acc KRD KRD+N+ACR+Sg+Gen KRD KRD+N+ACR+Sg+Nom NATO NATO NATO+N+ACR+Sg+Acc NATO NATO+N+ACR+Sg+Gen NATO NATO+N+ACR+Sg+Nom NATO NATO+N+Prop+Org+Sg+Acc NATO NATO+N+Prop+Org+Sg+Attr NATO NATO+N+Prop+Org+Sg+Gen NATO NATO+N+Prop+Org+Sg+Nom "<NATO>" S:1732, 1732, 1732, 1732, 5423, 5849, 5849, 9980 "NATO" N Prop Org Sg Nom <<< S:1285 @HNOUN
Different aspects of abbreviations and acronyms:
- expansion (requires linking/common entry):
- abbr needs to be expanded for IR and text-to-speech
- translation systems want to transl. them to other lg abbrs (possibly requiring
- abbr needs to be expanded for IR and text-to-speech
- linguistic analysis/properties:
- the preprocessor is concerned about abbr's behaviour wrt. sentence delimitation (TRAB, ITRAB)
- speller programs want to correct them whenever wrongly spelled (possibly
- disambiguators want their underlying POS analysis (in addition to their ABBR
- they have inflections of their own
- St.dieđ. 10 / St. dieđáhus OR St. dieđáhusa... (implicit case)
- NRK: as (explicit case, except for Acc/Gen, who may be left unexpressed)
- St.dieđ. 10 / St. dieđáhus OR St. dieđáhusa... (implicit case)
- can take part in compunding, possibly derivation
- the preprocessor is concerned about abbr's behaviour wrt. sentence delimitation (TRAB, ITRAB)
Lexicon conclusion:
- store abbr. that are coming from names as separate entries?
- store accr. as separate entries in the name database, with type="acr"
- store alternative names as separate entries
- all linked together or to the same concept (open??? If to the concept, forces us to allow
Transducer conclusions:
- Leave things at status quo for the abbreviations and the acr generator
- We will return to the issue of double abbrs if they turn up (They probably don't)
- Double acrs arelaready taken care of in the sme-dis.rle urle set (lexical acronyms
xml example format:
Concept center (common file):
<entry id="India" type="full (default)/abr/acr/alt/err"> <sem> <plc type="xxx" ssrcode="" > <!-- type=5., ssrcode=6. --> <geo> <country>IN</country> <region/> <!-- "fylke" or similar, 11. --> <munic/> <!-- 10. --> <coord /> <!-- 14. --> </geo> <regul> <gnr/> <!-- 7. --> <bnr/> <!-- 7. --> </regul> </plc> </sem> <!-- These links are convenience entries, to speed up processing --> <langentry lang="sme" ref="India"/> <langentry lang="smj" ref="India"/> ... <langentry lang="fin" ref="Intia"/> </entry> <entry id="India_2"> <sem> <fem/> </sem> <langentry lang="sme" ref="India"/> <langentry lang="smj" ref="India"/> ... <langentry lang="fin" ref="India"/> </entry>
Language file for, say, sme:
<entry id="India"> <!-- Do we need the stem, or can it be inferred/inherited from the id? NO, only if different from the id. --> <stem/> <infl lexc="ACCRA">(example?)</infl> <name-parts> <etym/> <rel-name ref="xyz"/> <senses> <sense ref="India_2"/> <sense ref="India"/> </senses> </entry>
Language file for fin:
<entry id="Intia"> <!-- 1. --> <stem lexc="14">(only if different from id/headword)</stem> <!-- 2. and 3. --> <name-parts> <!-- 4. --> <variants> <!-- 15. --> <variant ref="xyz"> </variants> <etym/> <!-- 24. --> <rel-name ref="xyz"/> <!-- 18. --> <senses> <sense ref="India"/> </senses> </entry> <entry id="India"> <stem/> <infl lexc="14">(example?)</infl> <name-parts> <etym/> <rel-name ref="xyz"/> <senses> <sense ref="India_2"/> </senses> </entry>
Language file for kvensk:
<entry id="Porsanki"> <!-- 1. --> <stem lexc="14">(only if different from id/headword)</stem> <!-- 2. and 3. --> <name-parts> <!-- 4. --> <variants> <!-- 15. --> <variant ref="xyz"> </variants> <etym/> <!-- 24. --> <rel-name ref="xyz"/> <!-- 18. --> <senses> <sense ref="Porsanger"> <legal> <status/> <!-- 8. --> <decision/> <!-- 9. --> </legal> <source> <informants> <informant id="some-id"> <!-- 20. --> <explanation date="" /> <!-- 19. --> <explanation date="" /> </informant> </informants> <collectors> <collector id="" year=""/> <!-- 21. --> <collector id="" year=""/> </collectors> <archive/> <other> <print/> </other> </source> <comment/> <!-- 28. --> </sense> </senses> </entry> In the case that stem = lemma, we have the entry: <stem lexc="14"/>
These points from Irene's list are still open:
Print info - do they belong to the common or language-specific sections?: 12. kartprodukt 13. kartblad Unclassified: 25. pilhenvisning, nuoliviite, til annen artikkel -> How is this different from 18.? Multimedia - do they belong to the common or language-specific sections?: 26. lydfil 27. bilde(r), illustrasjone(r)