Meeting_2005-09-12
Meeting setup
- Date: 12.09.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 12.
Present: Maaren, Saara, Sjur, Thomas, Tomi, Trond
Absent: Børre
Main secretary: Trond
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Finish crontab specification for the cvs update/export script Tomi made
- Worked on it, it progresses, but there's still an issue open.
- Worked on it, it progresses, but there's still an issue open.
- reopen the jspwiki + UTF-8 issue
- Added new issue to the Forrest issue tracker
- Added new issue to the Forrest issue tracker
- Add issue to forrest issue tracker about utf-8 ihtml documents.
- Done
- Done
- Contact Svenska bibelsällskapet
- The Lule Sámi NT is in Olavi Korhonen's computer. There was some uncertainty
- The Lule Sámi NT is in Olavi Korhonen's computer. There was some uncertainty
- discuss with Anders Kintel about possible cooperation
- Not done
- Not done
- Follow up on CVS mailing:
- set up Maaren
- set up Maaren
- Meet up with Trond about directory structure
- Done, but needs more work - the notes from Helsinki did not contain
- Done, but needs more work - the notes from Helsinki did not contain
- Contact oahpahusossodat and the rest of the SD about texts
- Not done
- Not done
- Fixing the machine for the new coworker
- Mostly done
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- done a little bit
- done a little bit
- Got mainly through the missing list from risten.no
- Start working on grammatical issues with Thomas and Trond???
- Not done.
Saara
- Get aquainted with the project status quo
- almost done
- almost done
- Look at the corpus infrastructure issue
- work started
- work started
- Look at the corpus interface issue with Lars
- not done
Sjur
- risten.no bugs and fixes
- Nothing this week
- Nothing this week
- complete the action summary after our half-year evaluation
- Not done
- Not done
- follow up on:
- voice group-chat not working to Sámediggi
- requires a new firewall, Geir Kaaby will ask for a cost
- requires a new firewall, Geir Kaaby will ask for a cost
-
Maaren has problems with SubEthaEdit (can't connect)
- It is now finally working again
- It is now finally working again
- voice group-chat not working to Sámediggi
- To the board:
- write proposal for permanent maintenance organisation
- Done
- Done
- write draft specification for the outsourced tasks
- continued, not finished
- continued, not finished
- write half-yearly project report with progress and bugdet status
- started, not finished
- started, not finished
- write agenda
- Done
- Done
- Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is
- write proposal for permanent maintenance organisation
- project planning with Trond
- Not done
- Not done
- Other things done:
- Contacted Øystein Johannessen about synchronised Sámi - Norwegian
- wrote e-mail to the SFST author with suggestions about how to integrate
- read through and commented an article by Trond aobut our projects
- got a working phone and Internet connection: -)
- Contacted Øystein Johannessen about synchronised Sámi - Norwegian
Thomas
- work on Lule Sami compounding and derivation
- finished with deverbals, now working with denominals
- finished with deverbals, now working with denominals
- Look at Linguistic bugs with Trond.
- looked at some, solved some
- looked at some, solved some
- Work on the name agreement with "Norge digitalt" with Trond
- forwarded it to Sjur
Tomi
- Aspell: Continue working on the affix file
- Problems with affixfile encoding (Latin-6)
- Problems with affixfile encoding (Latin-6)
- three-part compounding
- Not done
- Not done
- Add downcasing to makefile and CVS
- Done
- Done
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
- Removed circularity from non-recursive transducer
Trond
- Work on the bug list (Lule Sámi).
- Done some work.
- Done some work.
- Work on compounds (three-part, with Tomi)
- Not done.
- Not done.
- Work on the corpus interface (with Lars)
- Had a short discussion with lars and Saara, more work needed.
- Had a short discussion with lars and Saara, more work needed.
- Corpus infrastructure: dtd location
- Not done. The issue is not only the dtd-s.
- Not done. The issue is not only the dtd-s.
- Work on the name agreement with "Norge digitalt" with Thomas
- Not done. The problem was that we couldn't read the CD they had sent us,
- Not done. The problem was that we couldn't read the CD they had sent us,
- Look at the linguistic aspects of the speller clitics, with
- Had a look at it, will discuss it with Tomi.
- Had a look at it, will discuss it with Tomi.
- Get the new version of the New Testament
- Have an unoficcial version, still not the off one.
- Have an unoficcial version, still not the off one.
- Check Hans-Ragnars names.
- Done. They are multilingual, and on my machine.
- Done. They are multilingual, and on my machine.
- New coworker
- Work in progress. Vesa Guttorm will be temporarily hired for the rest of this year.
- Work in progress. Vesa Guttorm will be temporarily hired for the rest of this year.
- translate contract
- Done a first translation.
- Done a first translation.
- check the new giellatekno site
- Had a look at it.
- Had a look at it.
- project planning with Sjur
- Not done.
- Not done.
- Most of the week has been on making a presentation of our project for a conference in Bolzano.
3. Documentation
Documentation tasks:
- Add documentation on our corpus infrastructure and our corpus work in general
- add/update Aspell documentation (Tomi)
- finish divvun2web script (Børre)
- as always: document what you're doing: -) (all)
4. Corpus gathering
Since last meeting:
-
Børre has updated info on smj NT
- The Helsinki contract has been translated.
- Someone else than Trond ( Sjur or Børre) reads through the translation, and,
- Someone else than Trond ( Sjur or Børre) reads through the translation, and,
From last meeting:
Børre, Trond and Sjur had their meeting, and the Helsinki contract is quite good
Paths forward: We have a contract suggestion. Sjur and Børre should start the
How to proceed:
- Get the contract suggestion ready
- Translated part 1 ok, part 2 and 3 missing. Done this week
- Get part 4 from Kimmo, and translate it
- Contact our lawyers, at SD and UIT (today, tomorrow).
- When the Norwegian version of the contracts are ready, make
- Translated part 1 ok, part 2 and 3 missing. Done this week
- Approach the text owners (see ordered list below)
Independent of the contract work
- Bible: The new testament (Trond)
- Bureaucratic text:
- Sámi Parliament (Børre)
- Sámi Oahpahusráđđi (Børre)
- KRD (Børre, check whether we miss texts (discuss with Trond))
- the Sámi municipalities (Børre)
- Sámi Parliament (Børre)
- Textbooks
- To the extent that text can be got directly from SO.
After the contracts are ready
Sjur and Børre should probably take a Tour-de-Sápmi, and meet with the
The tour should be planned, not in this meeting, but before the contracts
- Commercially published texts
- Author organisations' meetings
- Key authors one by one
- (list of author names) Kerttu Vuolab, Kirsi Paltto,
- (list of author names) Kerttu Vuolab, Kirsi Paltto,
- Iđut and key authors there (Børre)
- Davvi Girji and key authors there
- Author organisations' meetings
- Newspaper text:
- Sámi Instituhtta's (for the old archive of Min Áigi and Áššu)
- Áššu has been making a CD since the end of may, there should be a pile
- Min Áigi
- Sámi Instituhtta's (for the old archive of Min Áigi and Áššu)
List of texts with lower priority (to be gathered when the above list is
- the Sámi municipalities,
- Authors with smaller production
- Textbooks
5. Corpus infrastructure
Do documentation.
Naming conventions and directory structure
We have a decision from Helsinki:
- have the same directory structure in all three levels, and we also decided
- Path forward: Tomi and Trond to implement the directory structure
We do not have any notes from our Helsinki meeting (they were left on the
There are three directories, with the same substructure (a 6-way partition
orig (substructure) filename.doc, filename.html, filename.pdf int (substructure) filename.int.xml filename.xsl gt (and we want a new name for gt) (substructure) filename.xml
There is a substructure division according to genre:
bible (NT, OT, perhaps other liturgical txts) newspaper Min Áigi Áššu Other fiction administrative central (Oslo, Stockholm, Helsinki) samediggi (Kárášjohka, Giron, Anár) municipalities factual (educational) legal
For the linguistic search interface, all texts will probably be published
Things to do next, and persons to do it:
- Rewrite the corpus directory (Børre)
- Document the corpus directory (Børre)
- Continue the work on translating texts from orig/ via int/ to gt/.
- Make a sister catalogue for smj, but with a completely flat structure within
- corp/sme/(orig/int/gt)
- corp/smj/(orig/int/gt)
- corp/sme/(orig/int/gt)
- Document the xsl conversion and scripts (Tomi)
- Make conversion for html documents (Tomi)
- Start looking at conversion of pdf documents (Saara)
6. Linguistics
Note: The Bugzilla bug categories for lexica and morphophonology are now
North Sámi
- three-part compounds issue still open
- Johnny Andersen has written a letter to us on the treatment of Sámi place
-
Sjur has written an e-mail to the UFD contact person, Øystein Johannessen
-
Sjur has written an e-mail to the UFD contact person, Øystein Johannessen
- New place names received, should be added to our lexicons
- The problem is that the CD from Statens Kartverk is unreadable.
- Other names should be added.
- Place names should keep their cross-lingual alignment in the lexicon
- Propernoun lexicon structure:
- We need to discuss our lexicon structure for the proper nouns. Should we have
- Geographical names
- Personal names
- Geographical names
- The problem is that the CD from Statens Kartverk is unreadable.
Names are inherently multilingual as well as cross-lingual. Cf. Appendices B
Examples of place names:
Karasjok Produkter deatnulačča Nils Porsangera (82) go máhtii eanet deatnulačča Nils Porsangera (82) go máhtii eanet juoigi Nils Porsangera go máhtii eanet Deanu drosjeeaiggát (NAF avd. Hammerfest-Karasjok 1984: 21).
Example of person names:
Báđár - Paadar Guhtur - Guttorm Dámmot - Blind Bieská - Pieska Bieskán/Bieski - Pieski Dommá - Tommi Duomis - Thomas Niilas - Nils Duommá - Thomas
A first step towards an xml infrastructure for a language-independent
named_entity Porsáŋgu semantic class information place name sme: Porsáŋgu continuation lexica norw stem and norw gr info sme stem and sme gr info ... nob: Porsanger continuation lexica norw stem and norw gr info sme stem and sme gr info ... fin: Porsanki continuation lexica norw stem and norw gr info sme stem and sme gr info ... named_entity Porsanger semantic class information person name name_lg1 / all continuation lexica norw stem and norw gr info sme stem and sme gr info ... name_lg2 (-)
Conclusion: We need a name project.
Issues:
- What format do we want for our common base?
- What semantic information do we want to add to the names?
Planning:
- Who shall work on this?
- Name lexicon work group: Sjur, Trond, Maaren?
- Name lexicon work group: Sjur, Trond, Maaren?
- What time plan shall it have?
- Kickoff at Oct 05, in Kautokeino
- Plans ready at some not too later point
- Do what we have to do during the winter
- Implement the name base as input for our parsers at some later point.
- Kickoff at Oct 05, in Kautokeino
Classification:
- Preparatory work
- Talk to Kari Pitkänen in Tampere, who did a semantic classification for
- Look into other projects (Maaren)
- Make a draft (each) of What We Ideally Want (Maaren, Sjur, Trond)
- Talk to Kari Pitkänen in Tampere, who did a semantic classification for
- Substantial work
- Make a semantic theory
- Make a proposal with DTD and examples
- Make a semantic theory
Making the new base
- Identify status quo and a goal
- Write tools for semiautomatic transition
- Do a pilot test
- Move (parts of?) the name lexicon to the new format
- Part of the manual work could perhaps be given to part-time-workers
Incorporating the new base in the parser
- decide on a proper location for the name base
- conversion tool xml -> lexc
TODO:
- A Kickoff meeting in Kautokeino.
- Before the kickoff meeting, Sjur, Maaren and Trond to do some preparatory work
Lule Sámi
Lexicon work
The goal is to establish a mode of work where people in Árran and in the
Work on Lule Sámi in general
We also need input from the other persons working with Lule Sámi when it
Status quo on the parser:
- All the major POSes have been covered
- closed POSes have been checked
- compounding and derivation
-
Thomas has finished with deverbals, now working with denominals
-
Thomas has finished with deverbals, now working with denominals
- Suffix boundary symbol has not been added, we are not sure whether we should
TODO:
- Continue the work on the lexicon (Børre, Thomas, Sjur)
- Plan a meeting between our Lule Sámi team and the people at Árran working
- Carry on the linguistic work (Thomas, Trond)
Numerals
We need
- An empirical overview
- Numeral generation
- Numeral inflection
- Numerals as parts of compounds
- Numeral generation
- A clear concept of how we want to treat them
- Tagging
- Tagging
- A treatment
TODO:
- Make a documentation chapter on numerals, identifying the open linguistic issues
- Look at implementation
Action plan: Trond and Maaren look into it.
7. Speller infrastructure
Aspell
Write documentation here as well.
The munch-list is working, and the affix file is improving. See previous meeting memo.
The problem with the affix file was that it did not accept UTF-8. It accepted Latin 4,
There were problems with the latin 6 encoding of the suffix file. After updating to cvs,
- A possible workaround is to keep the affix list stored as UTF-8, but to
- We should also contact the aspell developer and make them (him) fix this bug.
Issues:
- The phonetic file should be systematically looked into.
- Check that it works
- Add more correspondences on an impressionistic basis
- Check that it works
- Start work on collecting systematic spelling errors:
- Our in-house file typos.txt
- The soon-to-arrive error texts from newspapers
- Our in-house file typos.txt
- The holes in the affix list should be mended
- Adjectives still to be done
- Adjectives still to be done
- The munching process gets killed at cochise today
- Persons to talk to are Roy Dragseth and Steinar Trædal-Henden. Tomi contacts them.
- Persons to talk to are Roy Dragseth and Steinar Trædal-Henden. Tomi contacts them.
- We should, at some point, evaluate whether this is The Correct Approach to
- Affix file UTF-8 problem should be checked and reported.
- Contact the Aspell author and ask for updates/fixes
- Contact the Aspell author and ask for updates/fixes
- The clitics issue: Today we have a manually created affix file in order to
- stems + affixlist + cliticlist
- where all 11 clitics (found in the K lexicon) are mapped onto each and every affix.
- stems + affixlist + cliticlist
- Today, substandard forms are marked as "!SUB". The speller should not include
- Documentation
- We must create subcomponents under the Speller
- one for Aspell, another for MySpell, Hunspell, etc.
-
TODO:
- Investigate
- Write procedures for doing so, for the .
- Investigate
- Directory structure:
- one for Aspell, another for MySpell, Hunspell, etc.
(spell) (src) (bin) (dist)
The conversion from aspell to myspell will work trivially as soon as the myspell
Issue left open.
Hunspell
Hunspell is presently already working with OOo, and is a much better speller
Issue left open.
Other engines
Børre and Sjur had a long discussion with the author of the SFST library/tool
Sjur repeated the suggestion from our SFST man that we could do this ourselves. The whole SFST
The question is:
- Do we want to do that?
- Would Tomi be able to do it alone, or do we need more resources for it?
- The issue has wide-reaching implications - basically that of replacing the
TODO:
8. Other
Technical issues
- Fixing the machine for the new coworker (Børre)
- Mostly done
- Mostly done
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82.
- Not done, these issues are still open.
- Not done, these issues are still open.
-
Sjur has a non-solved Backspace + UTF-8 issue
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82.
- 27 open bugs - 2 down from last week but still too much! Have a look at what you can fix.
9. Summary, task list
Børre
- Finish crontab specification for the cvs update/export script Tomi made
- reopen the jspwiki + UTF-8 issue
- Add issue to forrest issue tracker about utf-8 ihtml documents.
- Contact Svenska bibelsällskapet
- discuss with Anders Kintel about possible cooperation
- Follow up on CVS mailing:
- set up Maaren
- set up Maaren
- Meet up with Trond about directory structure
- Contact oahpahusossodat and the rest of the SD about texts
- Fixing the machine for the new coworker
- Document the corpus infrastructure
- Read through the Helsinki contracts (new translations)
- Reorganise the directory structure
- Continue converting text from input format to our xml
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- shall get mainly through the missing list from risten.no this week
- Start working on grammatical issues with Thomas and Trond
- Work on the name project with Trond and Maaren
- Start looking at normativity issues
- Work on the numerals project with Trond
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- Have a look at the pdf-to-xml issue (known problem: Keep the Sámi
Sjur
- risten.no bugs and fixes
- complete the action summary after our half-year evaluation
- follow up on:
- voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- voice group-chat not working to Sámediggi
- To the board:
- write draft specification for the outsourced tasks
- write half-yearly project report with progress and bugdet status
- Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is
- write draft specification for the outsourced tasks
- project planning with Trond
- Work on the name project with Trond and Maaren
- Prepare for a Lule Sámi meeting with Árran
- Follow up on place names from Norge Digitalt
- Read through the Helsinki contracts (new translations)
- Talk to Bitte about the Lule Sámi lexicon
- Evaluate SFST as speller (and analyzer) lexicon
Thomas
- work on Lule Sami compounding and derivation
- Look at Linguistic bugs with Trond.
- Prepare for a Lule Sámi meeting with Árran
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
- Document aspell and corpus infrastructure
- Add html-to-xml conversion to corpus infra
Trond
- ( Trond will be absent at next week's meeting, or perhaps
- Work on the bug list.
- Work on compounds (three-part, with Tomi)
- Work on the corpus interface (with Lars and Saara)
- Work on the name agreement with "Norge digitalt" with Thomas
- Look at the linguistic aspects of the speller clitics, with
- Get the new version of the New Testament
- Introduce the new coworker to the work routines
- project planning with Sjur
- Work on the name project with Maaren and Sjur
- Prepare for a Lule Sámi meeting with Árran
- Work on the numerals project with Maaren
10. Next meeting, closing
19.09.2005 10: 00
Closed at 13: 20