Meeting_2006-08-28
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Public tender
- 10. Tromsø meeting round-up
- 11. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 28.08.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 47.
Present: Saara, Sjur, Thomas, Børre, Tomi, Trond
Absent: Maaren
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord)
- send contracts to Čálliid Lágádus
- All the above not done
- All the above not done
- Contact Richard Valkepää at NSI about older Min Áigi and Áššu files.
- Not heard anything from him
- Not heard anything from him
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- All above not done
- All above not done
- Move norwegian documents in Min Áigi from sme to nob
- Not done
- Not done
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- not done
- not done
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- Not done
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- My attempt doesn't work
- My attempt doesn't work
- create document & document entry for name double-tagging
- ?
- ?
- Update forrests to latest svn version
- Done before vacations
- Done before vacations
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- tested
- tested
- Implement parallel corpus upload in web upload script
- not done, waiting for decision of the cgi-bin scripts
- not done, waiting for decision of the cgi-bin scripts
- Install Gobby
- done
- done
- Test the aligners once again
- done
- done
- refine the xml output of the xml-tagged analyses
- done
- done
- convert or adapt the received PHP for paradigm generation to our needs
- discussed possible solutions in the news
- discussed possible solutions in the news
- remove headers and footers from antiword documents, other improvements
- antiword done, pdf-documents are to be fixed next
- antiword done, pdf-documents are to be fixed next
- fix bugs!
Sjur
- public tender:
- review letters to tenderers, contract for subcontractor
- contract finished, signed and countersigned. First common meeting held.
- contract finished, signed and countersigned. First common meeting held.
- review letters to tenderers, contract for subcontractor
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- more done, plans for the rest is made.
- more done, plans for the rest is made.
- finalise refactoring for multiple collections (regular search interface)
- mostly done, but a bug in eXist or Cocoon is blocking the final step
- mostly done, but a bug in eXist or Cocoon is blocking the final step
- implement editing functions
- review user and admin documentation for corpus access
- not done
- not done
- write user account form, probably ask for copy of existing ones from the IT
- not done
- not done
- fix bugs!
Thomas
- investigate productivity of even-syllable Actio compounding
- done
- done
- investigate and identify under which conditions even-syllable Actio
- done
- done
- discuss findings with the rest of us
- done
- done
- add proper numeral analysis/treatment to smj
- done to the same extent as in sme
- done to the same extent as in sme
- add loanwords (e.g. latin -ere verbs) to smj
- done
- done
- sme G3 issue
- started with a load of help from my friends
- started with a load of help from my friends
- review user documentation for corpus access
- not done
- not done
- create smj abbr file
- copied the sme abbr file and lulefied it
- copied the sme abbr file and lulefied it
- review the document
- done
- done
- Redirected following three syllable verbs and prevent them from being
- Reflexives on -dit
- Reciprocals on -dit, -(a)lit
- Momentatives on -dit, -(a)lit, -ádit, -ihit
- Frequentatives on -(a)lit, -(u)hit, -dit
- Continuatives on -dit, -(u)hit, -nit
- Inchoatives in -nit
- Translatives on -dit
- Essives on -dit and -stit
- Causatives on -dit, -stit
- done
- done
- Reflexives on -dit
- find and study all derived verbs in our corpus (depends on Trond)
- not done
- not done
- suggest which derivations could be generated (depends on Trond)
- not done
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- No bible work during summer, not received anything, will have to look into this again.
- No bible work during summer, not received anything, will have to look into this again.
- install aligner, test it and give feedback
- Aligner installed, it is good, but slow an operates manually. Will call Bergen today.
- Aligner installed, it is good, but slow an operates manually. Will call Bergen today.
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Not done.
- Not done.
- make shell script wrappers for the most common commands for user friendlyness
- Some done, not others.
- Some done, not others.
- write user account form, probably ask for copy of existing ones from the IT
- not done
- not done
- write documentation for our bound users, with pointers to the ordinary
- not done
- not done
- write documentation on double-tagging names
- not done
- not done
- discuss web-only user access management with Oslo
- not done
- not done
- change tagging of derived stems in the disamb output, to facilitate much
- prepared for the conversion, not done the actual change.
- prepared for the conversion, not done the actual change.
- fix bugs!.
3. Documentation
TODO:
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
4. Corpus gathering
Nothing has happened during the summer.
Olavi Korhonen's Lule Sámi dictionary.
Waiting for an answer.
Bible texts
Will have a second round with the Word versions.
TODO:
- get nob and nno NT and OT in paratext format. (Trond)
- convert smj NT to paratext (Børre)
- convert fin, swe to paratext or directly to our XML (Børre)
Kåfjord
TODO:
- contact Ája (Børre)
- talk to Lene about Kåfjord (Børre)
Sámi Instituhtta
When will we get the corpus? We still don't know, Børre will contact him
TODO:
- contact NSI again (Børre)
Čálliid Lágádus
http://www.calliidlagadus.org/
TODO:
- send contracts (Børre)
Árran
TODO:
- contact Bård Eriksen again (Børre)
5. Corpus infrastructure
General
What we would like: A make-type system that kept track of the file.xml and
At first sight, this sounds like a good idea.
TODO:
- remove headers and footers from antiword documents, other improvements
- give the make issue a second sight (Tomi)
- comment Trond's suggestion in the news (all)
User accounts and access
For details, see a previous meeting memo, as well as the
Shell access
TODO:
- export to /opt (with cron) tools that the project team members find in
- ccat
- ccat
- make shell script wrappers for the most common commands for user friendliness
- (first version of first script, teaksta.sh, was checked in, but it is still
- (first version of first script, teaksta.sh, was checked in, but it is still
- write user account form, probably ask for copy of existing ones from the IT
- possibly deploy the user account form as an HTML form (Børre)
- write documentation for our bound users, with pointers to the ordinary
- write documentation for how to apply for a user account (where's the form, to
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
Web browser access
TODO:
- discuss with Oslo (Trond)
- delay other tasks until we are ready to go public?
- user management for access to bound texts
- short user guide needed before going public (either write one or take whatever has been made in Oslo (Trond)
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
The aligner aligns fine, better than its competitors. Unfortunately it is slow, and dependent upon manual input.
TODO:
- contact Bergen to discuss these issues, ask for a non-manual interface, etc.
Language recognition
Still waiting for more smj and sma text to improve it. We need South
Corpus summary
The time-based statistics is still missing.
TODO:
- add time-based display as a feature request to Bugzilla (Sjur)
6. Infrastructure
Xerox tools wrapped as servers
To improve throughput and response time on heavy loads, it would really be nice
TODO:
- decide the programming language to use (Saara)
- find some (almost-)ready-to-use code to build on (Saara)
- implement it (Saara)
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
Saara has given a report on the PHP code in News. Please read.
Conclusion:
Hyphenator
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Update the sma hyphenator rule set with the insights gained from smj updates
Automatic Bugzilla reminder for untouched bugs
TODO:
- give mail reminders a second try; ask Thor-Øivind for help if needed
M4
Tomi and Saara did a lot in Tromsø. How far is it now? Probably finished today!
TODO:
- finish the work, and check it in (Saara)
7. Linguistics
Derivation and spellers like Aspell
To make it easier to extract all derived stems, we should enhance the tags used
Problematic issue: the disamb output will presently give information only
TODO:
- change tagging of derived stems in the disamb output, to facilitate much
- Issues: Rewrite the sme-lex.txt, sme-dis.rle, sme-tdis.rle, twol-sme.txt
- Issues: Rewrite the sme-lex.txt, sme-dis.rle, sme-tdis.rle, twol-sme.txt
For file i, take taglist 33-37 and replace with 39-43, respectively
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- see source code above, but also consider overgeneration problems, as well as
- see source code above, but also consider overgeneration problems, as well as
- lexicalise the rest (Thomas)
Semantic double-tagging of names
The policy needs documentation. Thus:
TODO:
- Make a section under gt/doc/lang/smi/, add a chapter
- write guidelines for annotators wrt. to name tagging and put them under
Systematic - make sure all linguists is aware of the guidelines (Trond, Sjur)
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
The following already derived verbs (verbs ending in -šit, -skit, etc.) are not
LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and 5-syllables, formerly directed to MUITAL +V+TV: MUITALStem ; !These derived verbs have now been redirected to MUITTASJ and similar lexica. !Reflexives on -dit !Reciprocals on -dit, -(a)lit !Momentatives on -dit, -(a)lit, -ádit, -ihit !Frequentatives on -(a)lit, -(u)hit, -dit !Continuatives on -dit, -(u)hit, -nit !Inchoatives in -nit !Translatives on -dit !Essives on -dit and -stit !Causatives on -dit, -stit
Examples:
- muittašit > *muittašallat
TODO:
- investigate and identify under which conditions Actio compounding is possible
- done
- done
- discuss findings with all of us (Thomas)
- done
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- convert roughly 100 smj names from that file (lines 740-843) to XML
- add inc abbr to a new abbr lexicon file (Thomas)
- copied the sme abbr file and lulefied it
- copied the sme abbr file and lulefied it
- add proper numeral analysis/treatment (Thomas)
- done to the same extent as in sme
- done to the same extent as in sme
- add loanwords (e.g. latin -ere verbs) (Thomas)
- done
8. Name lexicon infrastructure
Decided in Tromsø:
- add smj proper noun lexicon file to the output
- remove
^ # 0}} from the center ID * replace spaces with underscores in all IDs * remove occurence indicator from language IDs: Agalin_1 (the center/concept ID) => Agalin (the language ID), and thus the two Agalin's should become one language entry (but two different concept entries) * store a redundant copy of the center-file semantic information in the language-specific files, for processing speed * add logging facilities * add option to download local copies of the lexicon files directly from the db * batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items) * all names in all languages by default * tag for excluding/including a name from certain applications * hide / display {{^}} during browsing * future epxansion: choose what info to display in the single language browser * search by (single) language ** done * display existing language entries when adding a new language to a record * make searches behave predictable (the hits should be the expected ones) ** done * add editor to change single, existing entries Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html] Language entry example illustrating both the sem-tag on the sense elements, and the removal of occurence indicator: {{{ <entry id="Agalin"> <infl lexc="BERN" /> <senses> <sense sem="plc" ref="Agalin"/> <sense sem="sur" ref="Agalin_1"/> </senses> </entry>
TODO:
- improve lexc2xml conversion (Saara)
- add default smj entries
- exclude ^ # 0 from the center ID
- add an empty <log/> element to all entries (center and lanuage files)
- add a last-update attribute to the root element of all files
- add default smj entries
- finish refactoring for multiple collections in the search interfarce
- waiting for a bug fix (Tomi is investigating it)
- waiting for a bug fix (Tomi is investigating it)
- develop the needed XQueries and UI (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- discussion started on eXist-list, nothing useful came up. We need to
9. Public tender
TODO:
- write a contract (mostly done by Finnut, review by Sjur)
- done
- done
- get it signed (Finnut, Lennart Mikkelsen)
- done
10. Tromsø meeting round-up
TODO:
- check in meeting memos (Sjur)
- Polderland questions. Thomas did already send requested info.
- speller development - see the meeting memo. Separate follow-up next week.
- Lule Sámi linguist - Sjur has tried to call a possible candidate, but no
- order AirPort Express to Tromsø (Sjur)
11. Other
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug
279
Gobby
TODO:
- install Gobby (Trond, Sjur)
- review the document ( Thomas)
- done - document accepted: -)
Task lists as iCal entries
TODO:
- update all forrest installations to r430284 (Børre)
cd $FORREST_HOME svn up -r430284
11. Next meeting, closing
Next meeting 4.9.2006 at 9: 30.
Closed at 11: 25.
Appendix - task lists for the next week
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord), talk to Lene
- send contracts to Čálliid Lágádus
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- contact Bård Eriksen again
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- create document & document entry for semantic double-tagging of names (for
- Update forrests to svn version r430284
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- Implement server of the analysis tools.
- Add more languages to the lexc2xml propernoun conversion.
- Refine the namelex output
- finish M4 work
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- add time-based corpus summary as a feature request to Bugzilla
- check in meeting memos from Tromsø
- start hiring process of linguist and programmer
- order AirPort Express to the Tromsø gang
- install Gobby
- fix bugs!
Thomas
- sme G3 issue
- bug-fixing!
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- review user documentation for corpus access
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement improvements decided upon in Tromsø
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- Do the sme Der/ change (with Trond)
- consider Trond's suggestion of a makefile for corpus conversion
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- contact Bergen about aligner issues
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- write documentation on semantic double-tagging of names
- discuss web-only user access management with Oslo
- change tagging of derived stems in the disamb output, to facilitate much
- do the sme Der/ change (with Tomi)
- write short user guide for the corpus web interface
- install Gobby
- fix bugs!.