Meeting_2005-05-30

Meeting setup

  • Date: 30.05.2005
  • Time: 10.00 Norw. time
  • Place: Wherever we are : -)
  • Tools: Phone, iChat, SubEthaEdit

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from a week ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Term db
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10.30. Agenda accepted as is.

Present: Sjur, Thomas, Tomi, Trond, Børre

Away: Maaren

Main secretary: Sjur

2. Reviewing the task list from the last meeting

  • Børre:
    • Finish the divvun e-mail alias
      • Leif Åge has added Trond
      • Trond will test it himself.
    • Contact people about Lule Sámí texts
      • Nothing done
    • Contact skolelinux and Knut Hofland about webcrawler.
    • Here is Knut's thoughts on the issue in 1999: http://gandalf.aksis.uib.no/knut/mons/kh-avis.htm
      • Nothing done
    • Finish the setup of divvun.no
      • Børre has now got an account on the server, and will try today to set up Tomcat with our Forrest distro
    • Look at how forrest handles i18n, try to get work done on that.
      • Done some brief work
      • Børre, Tomi and Sjur did quite some work in Guovdageaidnu, found some problems which were reported to the Forrest list. We are encouraged to open a bug report.
    • Contact Roy Dragseth about cvs-commit mailing
      • Sent an e-mail, phoned him a few days later. He is away untill the 26th of May.
      • Børre will contact him today, or cooperate with Sjur or Tomi today to try it out.
      • Trond will ask Roy to give Børre su(per) power.
    • Finish the setup of http://giellatekno.uit.no/
      • Some pages are ready
      • Encoding problems with the form pages (analyzer, generator)
      • Not yet evaluated what should be included in the new site or not
  • Tomi:
    • Install "Tiger" (10.4), document all issues, together with steps to resolve them
      • Postponed to Kautokeino
      • Installation went fine, all have Tiger now
    • Look at the twolc part of how to handle shortening of the middle part in three part compounds. Discuss the data / linguistic side with the linguists.
      • Nothing done, even in Guovdageaidnu
    • Look at decapitalisation of proper nouns when compounded or derived to a general noun
      • Nothing done, as above.
  • Sjur:
    • Discuss with Kimmo:
      • Bring up the kwic-snt issue with Kimmo, make it utf-8 compatible (now, each byte counts as one, instead of each character counting as one (what is needed is a counting mechanism that counts only bytes starting in 01 or 110, and ignoring bytes starting in 10.)
    • Still some work on the Termdb
      • Fix bugs that are coming up
    • Decide upon the public opening of divvun.no with Anne Britt
      • Giellaráđđi will send out a press release together with the opening of risten.no
    • Ask Leif Åge to check the opening/forwarding of ports number needed for iChat voice and video conferencing
      • We believed it was working (one step forward: single voice and video chats are working), but group chats are still not working. Will have to be worked out, or we have to find another solution (Børre is suggesting Skype - which is fine for voice, but does not contain the video part, which isn't group-working anyway: -)
  • Maaren:
    • continue with proper nouns, work on the Bugzilla bug list
    • Translate divvun.no front page to Finnish
      • done?
  • Thomas:
    • continue with verb valency about 2500 verbs left now
  • Trond:
    • Work with the bug list, the lexicon, disambiguation. Prepare Lule Sámi etc. for the meeting.
      • Made a plan for Lule Sámi, did not go into the linguistic details, will wait for them until Thomas has his sme verbs done.
  • All:
    • Have a look at the bug list in Bugzilla (mainly Trond's bugs, but give me a hand, will you?) Bugzilla at uit
      • We are now down at 14 open bugs.

3. Documentation - divvun.no

  • Public opening will be 22.6.2005
  • To be done before that:
    • Technical issues:
      • make sure Tomcat is working
      • make sure cvs updating is working
      • have i18n in place
    • Content of the common (internal) documentation
      • Go through the linguistic documentation, tidy up as many loose ends as possible.
      • Ditto for the technical parts
    • User/public oriented documentation
      • Plan how much there should be, what we should include
      • Write it, translate it.
      • Overall project plan - should this be translated?
      • Official budget - translation?
      • More user oriented documentation: how to give feedback, deliver texts, more?
    • Language policy
      • Our site needs a language policy: What content should be in what languages?
      • Our language policy should be documented on the home page.
      • Project goal: keep the costs and workload down, but the information value up
      • Language policy of the disamb site: Everything in eng and sme, except the sma and smj transducer pages, these are in eng and nno.
      • We continue the discussion of the language policy in the newsgroup
    • Moral, method, whatever re documentation:
      • Document what you do, when you do it.

4. Corpus gathering

Børre has been in contact with Áššu, and they will provide us with both correct and uncorrect versions of their texts each Friday.

Børre and Sjur discussed the legal infrastructure on the way back home on Friday. Among other things, they discussed the cons and pros of the Helsinki and the Oslo model. The Helsinki model is outlined below (the different licenses are numbered 1-3, 4 is an attachment to 3):

Samediggi  ---------- Data central
(collector)      2       (UiT)
  |          \             |
1 |            4\          |3
  |                \       |
Author/               User / researcher
Publisher                (end user)
(text owners)

The Oslo model: Compared to the above, the collector and the data central is one and the same.

Other differences:

  • Oslo:
    • remove some parts of the texts, to make it impossible to duplicate whole works
    • a simpler, one-page contract, less "legaleese" to understand
  • Helsinki:
    • license whole texts
    • a more thorough license text, much more "legaleese", but also a clearer division of responsibility and legal relations between the parties involved
    • easier to regulate the use of the texts

We will go for a variant of the Helsinki model.

Actions:

  • we have received contract 1 in Finnish, Sjur will scan it and give it to Tomi/Maaren for rough translation
  • the rough translation should be given to the Sámediggi lawyer, who will reformulate to correct "legaleese"
  • do we need to have official translations of the contract in all relevant languages? Which languages?
  • Sjur will get copies of 2-4 from Kimmo Koskenniemi or CSC
  • licenses 2 and 3 can wait, 1 is important now for the text collecting

5. Corpus infrastructure

We did some testing in Guovdageaidnu - using the corpus infra is quite easy, some points of improvements were discussed.

Feature requests can be sent to newsgroup. Possible changes to be evaluated:

  • include titles as default?
  • include lists as default?
  • remove text in other languages than the one you are working on? Should be implemented.

Seen from a linguistic part of view, we need as much text as possible. For some tasks, we may analyse them automatically, here we need large quanta. For other tasks we will have to analyse manually, here we will (often) have much text to choose between, and we would like to analyse only the ordinary text paragraphs.

6. Linguistics

Normative questions to be clarified:

  • eastern / western forms to be included / excluded from the norm as in the proofing tools
  • what separator to use between an abbreviation and its case endings when inflected: NRK-as, NRKas, NRK'as, NRK: as, NRKs, NRK: s. We belived NRK: as to be correct, but need either:
    • confirmation on who/where/when if so (and a reference to be incl. in our docu); or
    • an official decision from a normative body
      • Thomas will check the situation
  • more issues?
    • Vocabulary: Are there words that should not be accepted by the speller as Sámi?
    • Loan words: Are there procedures for how to spell loan words (sjarmeret, pensjonista)
    • Foreign place names: how should they be spelled
    • Kárášjoht nisu. Truncated 3-part compounds written as two words. Can they be treated like that, or should they always be written together?
    • Cf. the Bugzilla base..
  • We should bring all outstanding issues to the giellaráđđi meeting in June. Maaren will bring those issues there (and formulate the papers in preparation for the meeting), and Thomas will help her: -)
  • The deadline for adding issues to the Giellaráđđi meeting is next Tuesday, so we have time untill next meeting to find more topics for them.

7. Term db

Bug fixing, cosmetic tuning.

8. Other issues

8.1 Cgi-bin scripts

Pekka tried to use our form-based tools, which didn't work, and the feedback address was to a non-working address. It should go to giellatekno@uit.no. The part "-utf8" was removed from our scripts as part of last weeks update. We must keep an eye at that. The digraph thing is still not working. Will be added to Bugzilla.

8.2 Trond about the faculty:

  • The faculty will give us 2 places in a reading-room, with Macs and net access, it can be used for student workers
  • There were no new office, but they will re-furniture one of the existing ones, to make room for two persons in the same office
  • To be able to get more offices, Trond might need to get outside the faculty
  • The contract with the University will have to be reformulated/renegotiated to reflect the upcoming situation with 3 persons in Tromsø, instead of the planned 2.
  • Most likely a new office will be in the same or a nearby building

9. Summary, task list

Børre:

  • Disamb documentation with Trond.
  • Contact Thor Øivind
  • Add corpus texts that are unproblematic from a juridical point of view
  • Contact people about Lule Sámí texts
  • Contact skolelinux and Knut Hofland about webcrawler.
  • Finish the setup of divvun.no
  • Contact Roy Dragseth about cvs-commit mailing
  • continue Forrest i18n
  • continue http://giellatekno.uit.no, encoding problems

Maaren:

  • continue missing list
  • find and prepare issues for the language board meeting
  • translate Helsinki contract into rough Norwegian; send it to Sjur, then lawyer

Sjur:

  • Scan and OCR the Helsinki contract; mail it to Maaren
  • Ask Anne-Britt to update the contract with the University (we are getting 3, not 2, persons there from the fall)
  • write the LIST option to our test tools, as requested in the discussion group.
  • complete the action summary after our half-year evaluation
  • fix risten.no bugs
  • contact Kimmo Koskenniemi about kwic-snt and UTF-8 bug
  • voice group-chat not working to Sámediggi - should be fixed

Thomas:

  • Find issues to giellaráđđi čoahkkin
  • help Maaren prepare those same issues
  • Continue with the verbs

Tomi:

  • Start looking at the smj infrastructure, turn it into utf-8, look at whether updates and improvements done for sme should be done for smj as well, in order to get a uniform infrastructure.
  • update catxml to remove unwanted language
  • Translate the texts in the sme/corp/correct/ directory to utf-8.
  • Work on the nonrec transducer: Get the nonrec files out of src/ and into int/, get the "make nonrec" process up and running.
  • Look into shortening of three-part compounds, middle part
  • decapitalisation of proper nouns when (if?) compounded, and when derived (the oslolaš issue)

Trond:

  • Disamb documentation with Børre.
  • Systematize the linguistic issues ahead of the Language Board meeting
  • Add cgi-bin errors to Bugzilla.
  • continue the office disc. with the faculty
  • Linguistic work.
  • computational linguistics, with Tomi.

All:

  • The www.divvun.no release.
  • The Bugzilla bugs, and a quip or two?
  • continue the discussion on the language policy of divvun.no in the newsgroup

10. Next meeting, closing

06.06.2005 10.00

Closed at 12.52