propnouns-meeting-20051221

Agenda

  • finalize the proper name xml structure
  • prepare integration of the kvensk project, if SD accepts

Participants: Børre, Linda, Sjur, Tomi, Trond

Questions:

  • What (the content):
    • make an overview of all info we want to store
    • ... and how to organise it
  • How (the xml structure):
    • one or two files? (two actually implies three)
    • what info to split into common parts and project / language specific parts

Views:

  • Iconic id better than arbitrary id.
  • Single linking, or automatically made double linking
  • pro links in common: lg-specific files are not cluttered
  • pro links in lg files: that is where the info is

Work process:

  • Timbuktu: Add iconic id and semantics once, to common.
    • The machine MAKES the lg entries, based upon these assumpt:
      • one form, inherited from iconic id
      • one sem, inherited from common
      • and we will decide a default declension class for each lg
      • we may leave a tag in place saying "untouched by human hands"
  • Helsinki: - same case, but here we need heavy manual editing
    • make a tag saying "now touched (by native speaker)"

Conclusions:

Double linking, iconic id Iconic id decided by the following principle:

  • Place names: pick Norwegian, Swedish, Finnish, English names.
  • Other names: pick the most common (the one which give most "identical" hits among our lgs: sme, smj, sma, nor/nob/nno, swe, fin, eng (sms, smn) )

With the princ of inheritance (lemma inherited from common file):

  • inherit right away / at creation time (= larger files, more duplicate info)
common        |     swe          |    fin           |    
India_2       |     India        |    Intia         |    
->lg=a        |     ->India_2    |    ->India_2     |    
(->lg=b Intia)|     ->India      |                  |    
sem plc
...

Timbuktu      |     Timbuktu     |    Timbuktu      |    
->lg=a id     |     ->Timbuktu   |    ->Timbuktu    |    
->lg=b id     |                  |                  |    
sem plc       |     <ab>Tmb.</ab>|
...

              |  sme:            | ... | nor:   | fin | swe | eng
Tana          |   Deatnu         | ... | Tana   |     |     |
->lg=a id     |   ->Tana         | ... |->Tana  |     |     |
->lg=b id     |                  | ... |        |     |     |
sem plc       |                    ... |              |     |
...                                ... |              |     |

What do we store in the "common" file the iconic id the semantics + info about the world (encyclopedic info) links to the lg specific files

What is stored in the lang-specific ones? Linguistic info:

  • inflection
  • stem
  • lemma
  • derivation class?
  • compounding?
  • senses (pointers to concepts)
  • orthographical variants (incl. (common) misspellings)
  • acronym(s) and abbreviation(s):
    • as separate entries or as part of the name entry?
NATO => OTAN
NRL => NBR, Ap => Bb

KRD
KRD     KRD
KRD     KRD+N+ACR+Sg+Acc
KRD     KRD+N+ACR+Sg+Gen
KRD     KRD+N+ACR+Sg+Nom

NATO
NATO    NATO+N+ACR+Sg+Acc
NATO    NATO+N+ACR+Sg+Gen
NATO    NATO+N+ACR+Sg+Nom
NATO    NATO+N+Prop+Org+Sg+Acc
NATO    NATO+N+Prop+Org+Sg+Attr
NATO    NATO+N+Prop+Org+Sg+Gen
NATO    NATO+N+Prop+Org+Sg+Nom

"<NATO>" S:1732, 1732, 1732, 1732, 5423, 5849, 5849, 9980
        "NATO" N Prop Org Sg Nom <<< S:1285 @HNOUN

Different aspects of abbreviations and acronyms:

  • expansion (requires linking/common entry):
    • abbr needs to be expanded for IR and text-to-speech
    • translation systems want to transl. them to other lg abbrs (possibly requiring (intermediate) expansion)
  • linguistic analysis/properties:
    • the preprocessor is concerned about abbr's behaviour wrt. sentence delimitation (TRAB, ITRAB)
    • speller programs want to correct them whenever wrongly spelled (possibly storing misspellings of abbrs)
    • disambiguators want their underlying POS analysis (in addition to their ABBR tag)
    • they have inflections of their own
      • St.dieđ. 10 / St. dieđáhus OR St. dieđáhusa... (implicit case)
      • NRK: as (explicit case, except for Acc/Gen, who may be left unexpressed)
    • can take part in compunding, possibly derivation

Lexicon conclusion:

  • store abbr. that are coming from names as separate entries? (we probably have no dotted abbrs for names)
  • store accr. as separate entries in the name database, with type="acr"
  • store alternative names as separate entries
  • all linked together or to the same concept (open??? If to the concept, forces us to allow more than one entry/language in the common file)

Transducer conclusions:

  • Leave things at status quo for the abbreviations and the acr generator
  • We will return to the issue of double abbrs if they turn up (They probably don't)
  • Double acrs arelaready taken care of in the sme-dis.rle urle set (lexical acronyms are preferred over generated ones)

xml example format:

Concept center (common file):

<entry id="India" type="full (default)/abr/acr/alt/err">
 <sem>
  <plc type="xxx" ssrcode="" > <!-- type=5., ssrcode=6. -->
   <geo>
     <country>IN</country>
     <region/> <!-- "fylke" or similar, 11. -->
     <munic/> <!-- 10. -->
     <coord /> <!-- 14. -->
   </geo>
   <regul>
     <gnr/> <!-- 7. -->
     <bnr/> <!-- 7. -->
   </regul>
  </plc>
 </sem>
 <!-- These links are convenience entries, to speed up processing -->
 <langentry lang="sme" ref="India"/>
 <langentry lang="smj" ref="India"/>
 ...
 <langentry lang="fin" ref="Intia"/>
</entry>

<entry id="India_2">
 <sem>
  <fem/>
 </sem>
 <langentry lang="sme" ref="India"/>
 <langentry lang="smj" ref="India"/>
...
 <langentry lang="fin" ref="India"/>
</entry>

Language file for, say, sme:

<entry id="India">
 <!-- Do we need the stem, or can it be inferred/inherited from the id?
      NO, only if different from the id. -->
 <stem/>
 <infl lexc="ACCRA">(example?)</infl>
 <name-parts>
 <etym/>
 <rel-name ref="xyz"/>
 <senses>
  <sense ref="India_2"/>
  <sense ref="India"/>
 </senses>
</entry>

Language file for fin:

(numbers refer to Irene's draft, see below)

<entry id="Intia"> <!-- 1. -->
 <stem lexc="14">(only if different from id/headword)</stem> <!-- 2. and 3. -->
 <name-parts> <!-- 4. -->
 <variants> <!-- 15. -->
  <variant ref="xyz">
 </variants>
 <etym/> <!-- 24. -->
 <rel-name ref="xyz"/> <!-- 18. -->
 <senses>
  <sense ref="India"/>
 </senses>
</entry>

<entry id="India">
 <stem/>
 <infl lexc="14">(example?)</infl>
 <name-parts>
 <etym/>
 <rel-name ref="xyz"/>
 <senses>
  <sense ref="India_2"/>
 </senses>
</entry>

Language file for kvensk:

(numbers refer to Irene's draft, see the meeting memo from Nov. 28 )

<entry id="Porsanki"> <!-- 1. -->
 <stem lexc="14">(only if different from id/headword)</stem> <!-- 2. and 3. -->
 <name-parts> <!-- 4. -->
 <variants> <!-- 15. -->
  <variant ref="xyz">
 </variants>
 <etym/> <!-- 24. -->
 <rel-name ref="xyz"/> <!-- 18. -->
 <senses>
  <sense ref="Porsanger">
   <legal>
    <status/> <!-- 8. -->
    <decision/> <!-- 9. -->
   </legal>
   <source>
    <informants>
     <informant id="some-id"> <!-- 20. -->
      <explanation date="" /> <!-- 19. -->
      <explanation date="" />
     </informant>
    </informants>
    <collectors>
     <collector id="" year=""/> <!-- 21. -->
     <collector id="" year=""/>
    </collectors>
    <archive/>
    <other>
     <print/>
    </other>
   </source>
   <comment/> <!-- 28. -->
  </sense>
 </senses>
</entry>
In the case that  stem = lemma, we have the entry:
 <stem lexc="14"/>

These points from Irene's list are still open:

Print info - do they belong to the common or language-specific sections?:
12. kartprodukt
13. kartblad

Unclassified:
25. pilhenvisning, nuoliviite, til annen artikkel
    -> How is this different from 18.?

Multimedia - do they belong to the common or language-specific sections?:
26. lydfil
27. bilde(r), illustrasjone(r)