Meeting_2006-09-04
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Tromsø meeting round-up
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 04.09.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 47.
Present: Børre, Sjur, Thomas, Tomi, Trond
Absent: Maaren, Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord), talk to Lene
- send contracts to Čálliid Lágádus
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- contact Bård Eriksen again
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- create document & document entry for semantic double-tagging of names (for
- Update forrests to svn version r430284
-
fix bugs!
- None of the above done, due to illness
- Made Forrest font embedding in pdf work by using absolute paths;
- None of the above done, due to illness
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- not done
- not done
- Implement server of the analysis tools.
- not done
- not done
- Add more languages to the lexc2xml propernoun conversion.
- not done
- not done
- Refine the namelex output
- done according to the spec in the last meeting
- done according to the spec in the last meeting
- finish M4 work
- done
- done
- implement tools for locating problems in the corpus files
- done some
- done some
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- done some with Trond: we now have a transducer that produces no
- done some with Trond: we now have a transducer that produces no
- name lexicon:
- implement editing functions
- more work, not finished
- more work, not finished
- finalise refactoring for multiple collections
- waiting for the CInclude bug to be fixed
- waiting for the CInclude bug to be fixed
- implement improvements decided upon in Tromsø
- some done, still more, see below
- some done, still more, see below
- implement editing functions
- review user and admin documentation for corpus access
- not done
- not done
- write user account form, probably ask for copy of existing ones from the IT
- not done
- not done
- add time-based corpus summary as a feature request to Bugzilla
- done
- done
- check in meeting memos from Tromsø
- done
- done
- start hiring process of linguist and programmer
- started
- started
- order AirPort Express to the Tromsø gang
- done
- done
- install Gobby (using DarwinPorts)
- done
- done
- fix e-mail address for Thomas
-
Trond fixed him an address at the Univ., Sjur has asked Roy at SD
-
Trond fixed him an address at the Univ., Sjur has asked Roy at SD
-
fix bugs!
- looked at several bugs, and commented / updated them
Thomas
- sme G3 issue
- nothing this week
- nothing this week
- bug-fixing!
- fixed some
- fixed some
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
- not done
- review user documentation for corpus access
- not done
- not done
- find and study all derived verbs in our corpus (depends on Trond)
- not done
- not done
- suggest which derivations could be generated (depends on Trond)
- not done
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- not done
- not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- implement improvements decided upon in Tromsø
- not done
- not done
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- Do the sme Der/ change (with Trond)
- consider Trond's suggestion of a makefile for corpus conversion
-
fix bugs!
- looked at the CInclude bug
- no luck
Trond
- better smj NT text
- Not done
- Not done
- get fin, swe, nob and nno NT and OT in paratext format
- Not done
- Not done
- contact Bergen about aligner issues
- Done, made progress (many improvements), we still negotiate a commandline version
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- These were two issues, one is fixed, one to go.
- These were two issues, one is fixed, one to go.
- make shell script wrappers for the most common commands for user friendlyness
- Not done.
- Not done.
- write user account form, probably ask for copy of existing ones from the IT
- Not done
- Not done
- write documentation for our bound users, with pointers to the ordinary
- Not done
- Not done
- write documentation on semantic double-tagging of names
- Not done
- Not done
- discuss web-only user access management with Oslo
- Not done
- Not done
- change tagging of derived stems in the disamb output, to facilitate much
- Done
- Done
- do the sme Der/ change (with Tomi)
- Done
- Done
- write short user guide for the corpus web interface
- Not done
- Not done
- install Gobby (using DarwinPorts)
- Not done
- Not done
- fix e-mail address for Thomas
- Done, password and user name ready.
- Done, password and user name ready.
-
fix bugs!.
- Some work.
3. Documentation
The xtdoc/sd part in cvs has a branch i18n-reform that has been i18n'ized.
Also the work on getting the Sámi chars into the PDF output has been done here.
TODO:
- finish i18n work by adding a list of available language versions to each
- make pdf set-up work using relative paths (Børre)
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
4. Corpus gathering
Nothing has happened last week.
5. Corpus infrastructure
General
Saara has made the makefile and written documentation for the process :
cd /usr/local/share/corp make LANGUAGE=sme GENRE=facta or make bound/sme/admin/sd/dc_1_04.doc.xml
The default language is "sme" if not given in command line. GENRE can be omitted
TODO:
- remove headers and footers from antiword documents, other improvements
- fix Min Áigi filenames (Saara)
User accounts and access
For details, see a previous meeting memo, as well as the
Shell access
TODO:
- export to /opt (with cron) tools that the project team members find in
- ccat
- discussion started in news by Saara, please reply and follow-up
- discussion started in news by Saara, please reply and follow-up
- ccat
- make shell script wrappers for the most common commands for user friendliness
- (first version of first script, teaksta.sh, was checked in, but it is still
- (first version of first script, teaksta.sh, was checked in, but it is still
- write user account form, probably ask for copy of existing ones from the IT
- possibly deploy the user account form as an HTML form (Børre)
- write documentation for our bound users, with pointers to the ordinary
- write documentation for how to apply for a user account (where's the form, to
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
Web browser access
TODO:
- discuss with Oslo (Trond)
- delay other tasks until we are ready to go public?
- user management for access to bound texts
- short user guide needed before going public (either write one or take whatever
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
The aligner aligns fine, better than its competitors. Unfortunately it is slow,
TODO:
- contact Bergen to discuss these issues, ask for a non-manual interface, etc.
- discussions with Bergen. The aligner now works automatically, but still
- discussions with Bergen. The aligner now works automatically, but still
Language recognition
Saara has worked on the issue. Short paragraph (e.g. phone numbers) are
TODO:
- Get more text of the poorly covered languages (Trond, Børre)
- study the paragraphs of 50 or less words, where the errors will be (Trond)
- study the mistakes our recogniser makes today (Trond)
- what about paragraphs with mixed content? Needs more investigation (Trond)
Corpus summary
The time-based statistics is still missing.
TODO:
- add time-based display as a feature request to Bugzilla (Sjur)
- done
6. Infrastructure
Xerox tools wrapped as servers
To improve throughput and response time on heavy loads, it would really be nice
TODO:
- decide the programming language to use (Saara)
- find some (almost-)ready-to-use code to build on (Saara)
- implement it (Saara)
- nothing so far
Hyphenator
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- analyser output now without hyphenation marks. The real mark-up of word
- analyser output now without hyphenation marks. The real mark-up of word
- Update the sma hyphenator rule set with the insights gained from smj updates
Automatic Bugzilla reminder for untouched bugs
TODO:
- give mail reminders a second try; ask Thor-Øivind for help if needed
M4
Setup and infra finished. Now we are ready to start using M4.
- What can we use M4 for? (programmers)
- Select and/or exclude different parts of the twol files.
- Specialised make-targets that depend on a profiled twol (=M4-processed)
- Select and/or exclude different parts of the twol files.
- What do we want to use M4 for? (linguists)
- Hyphenation
- Regional diphthong simplification (oahpahe(a)ddjiid)
- shortening in 3-part compounds
- explicit output of G3 mark ' and of allophones e2, o2 etc. for text-to-speech applications
- more?
- Hyphenation
TODO:
- finish the work, and check it in (Saara)
- Done.
- Done.
- make speller and hyphenator make targets that utilise M4 to produce normative
7. Linguistics
Derivation and spellers like Aspell
To make it easier to extract all derived stems, we should enhance the tags used
Problematic issue: the disamb output will presently give information only
Only regard derivations for now. Two ways:
- Remove complex verbs and nouns from the lexicon files.
- search stalla
- search stalla
- Turn the lookup2cg evaluation upside down.
Output from our transducer as it is now found below, showing that strategy
albmástallat albmástallat albmástallat+V+IV+Inf albmástallat albmástallat+V+IV+Ind+Prs+Pl1 Why not: almmái+N+Der/stalla+V... attástallat No derivations N->V, only v->V jeagoheapmi+A+Der/huvva+V jeagohuvvat jeagoheapmi+A+Der/huhtti+V jeagohuhttit jeagoheapmi+A+Der/hudda+V jeagohuddat ^^^^^^ -> disappears. Does this happen to all A's on -heapmi? With all derivations? yes to derivations. When A's compound, only with attr. form. muorahisvuohta goikkis > goikebiergu vs. gievra > gievrras olmmái muorra+N+Der/huvvat+V muorahuvvat muorra+N+Der/heapmi+A muoraheapmi jeagohuvvat jeagohuvvat jeagohuvvat+V+IV+Inf jeagohuvvat jeagohuvvat+V+IV+Ind+Prs+Pl1 jeagohuvvat jeagoheapme+A+Der/huvva+V+IV+Inf jeagohuvvat jeagoheapme+A+Der/huvva+V+IV+Ind+Prs+Pl1 jeagohuvvat jeagoheapmi+A+Der/huvva+V+IV+Inf jeagohuvvat jeagoheapmi+A+Der/huvva+V+IV+Ind+Prs+Pl1 attástallat attástallat attistit+V+TV+Der/alla+Inf attástallat attistit+V+TV+Der/alla+Ind+Prs+Pl1 attástallat attestit+V+TV+Der/alla+Inf attástallat attestit+V+TV+Der/alla+Ind+Prs+Pl1 attástallat attástallat+V+TV+Inf attástallat attástallat+V+TV+Ind+Prs+Pl1 attástallat addit+V+TV+Der/st+Der/alla+Inf attástallat addit+V+TV+Der/st+Der/alla+Ind+Prs+Pl1 "<attástallat>" "attistit" V TV Der/alla Ind Prs Pl1 "addit" V TV Der/st Der/alla Inf "attástallat" V TV Inf "attestit" V TV Der/alla Ind Prs Pl1 "attestit" V TV Der/alla Inf "addit" V TV Der/st Der/alla Ind Prs Pl1 "attástallat" V TV Ind Prs Pl1 "attistit" V TV Der/alla Inf bisuhit IV>TV we have N+Dim+N+Dim+N goađázaš
TODO:
- change tagging of derived stems in the disamb output, to facilitate much
- Done.
- Done.
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- see source code above, but also consider overgeneration problems, as well as
- see source code above, but also consider overgeneration problems, as well as
- lexicalise the rest (Thomas)
- consider the problems of lexicalised derivations schewing the analyses
Semantic double-tagging of names
The policy needs documentation. Thus:
TODO:
- Make a section under gt/doc/lang/smi/, add a chapter
- write guidelines for annotators wrt. to name tagging and put them under
Systematic - make sure all linguists is aware of the guidelines (Trond, Sjur)
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
Nothing this week, but see above re: derivations.
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- convert roughly 100 smj names from that file (lines 740-843) to XML
8. Name lexicon infrastructure
Decided in Tromsø:
- add smj proper noun lexicon file to the output
- remove
^ # 0}} from the center ID ** done * replace spaces with underscores in all IDs ** done * remove occurence indicator from language IDs: Agalin_1 (the center/concept ID) => Agalin (the language ID), and thus the two Agalin's should become one language entry (but two different concept entries) ** done * store a redundant copy of the center-file semantic information in the language-specific files, for processing speed ** done * add logging facilities ** added empty log element during conversion to XML * add option to download local copies of the lexicon files directly from the db * batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items) * all names in all languages by default * tag for excluding/including a name from certain applications * hide / display {{^}} during browsing ** done * future epxansion: choose what info to display in the single language browser * search by (single) language ** done * display existing language entries when adding a new language to a record * make searches behave predictable (the hits should be the expected ones) ** done * add editor to change single, existing entries ** started Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html] Names found containing double inflection definitions: {{{ Genova adding multiple infl classes. Guttorm adding multiple infl classes. Heddy adding multiple infl classes. Heimo adding multiple infl classes. J?vreg?ddi adding multiple infl classes. Klaipeda adding multiple infl classes. Territory adding multiple infl classes.
These are all wrong, and should be corrected. There should be no names with two
TODO:
- improve lexc2xml conversion (Saara)
- add default smj entries
- exclude ^ # 0 from the center ID
- add an empty <log/> element to all entries (center and lanuage files)
- add a last-update attribute to the root element of all files
- all done
- all done
- add default smj entries
- finish refactoring for multiple collections in the search interfarce
- waiting for a bug fix (Tomi is investigating it)
- waiting for a bug fix (Tomi is investigating it)
- develop the needed XQueries and UI (Sjur, Tomi)
- done some
- done some
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- discussion started on eXist-list, nothing useful came up. We need to
- fix multiple inflection for identical names (Trond and Thomas)
- add eXist and the proper noun interface to the G5 using Tomcat
9. Tromsø meeting round-up
TODO:
- check in meeting memos (Sjur)
- done
- done
- Polderland questions. Thomas did already send requested info.
- done. Send more even-syllable VNAs to cover all stem types, with derivations
- done. Send more even-syllable VNAs to cover all stem types, with derivations
- speller development - see the meeting memo. Separate follow-up next week.
- Lule Sámi linguist - Sjur has tried to call a possible candidate, but no
- the response was negative, we need to consider other candidates
- the response was negative, we need to consider other candidates
- order AirPort Express to Tromsø (Sjur)
- done
10. Other
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug
279
Gobby
TODO:
- install Gobby (Trond, Sjur)
- Done by Sjur
Compilation on victorio
... DeverbalVerbsVUORDIL...4, DeverbalVerbsALIST...5, DeverbalVerbsSUOTNJAL...4, DeverbalVerbsBOTNJAS...2, DeverbalVerbsLASSAN...1, DeverbalVerbsCOASKKIT...4, DeverbalVerbsARVIL...3, K...14, K-son...2, ENDLEX...1 Reading from 'sme/src/noun-sme-lex.txt' AspellAffix...21, GuessNoun...1, NounSecond...9, NounRoot... *** Warning: Ignoring info strings. 10000...20000... 22682 Reading from 'sme/src/verb-sme-lex.txt' Negativeverb...1, negmood...3, negind...9, negimp...9, negsup...9, Copula...2, Finitecop...15, Prscop...10, Prtcop...11, Impcop...11, Infinitecop...11, STRAYFORMS...1, VerbRoot...10000...14271 Reading from 'sme/src/adv-sme-lex.txt' Adverb...3005, gadv...2, adv...1, adv-comp...1, adv-sup...1, IHTTAS...10, DABBELAS...2, DABBELACCA-...11, COMPDIRADV...3 Reading from 'sme/src/closed-sme-lex.txt' input in flex scanner failed make: *** [sme/bin/sme.save] Error 2 gt$em sme/src/closed-sme-lex.txt.
Earlier fix: Specify right Xerox tools in makefile, and do make clean.
Hypothesis: closed-sme-lex.txt is broken
TODO:
- Fix compilation on Victorio (Tomi, Trond)
Meetings and Marratech
Now that Tomi has moved to Helsinki, Maaren is back from her sick leave, and we
There are two choices: going back to the old phone conference calls (how stone
TODO:
- download and install newest Marratech
Task lists as iCal entries
TODO:
- update all forrest installations to r430284 (Børre)
cd $FORREST_HOME svn up -r430284
11. Next meeting, closing
Next meeting 11.9.2006 at 9: 30.
Closed at 11: 52.
Appendix - task lists for the next week
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord), talk to Lene
- send contracts to Čálliid Lágádus
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- contact Bård Eriksen again
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- create document & document entry for semantic double-tagging of names (for
- update all Forrest installations to svn version r430284
- finish Forrest i18n and Sámi in PDF work
- Get more sma, smj texts to improve language recognition
- set up Tomcat for use with eXist and the propnouns db on the G5
- download and install latest Marratech
- fix bugs!
Maaren
- On sick leave
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- Implement server of the analysis tools.
- Add more languages to the lexc2xml propernoun conversion.
- Refine the namelex output
- convert roughly 100 smj names from gt/smj/propernoun-smj-lex.txt (lines
- download and install latest Marratech
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- start hiring process of linguist and programmer
- help Børre finish i18n work of Forrest with a language override menu
- consider the problems of lexicalised derivations schewing analyses of
- install eXist and our local copy of risten.no and propnouns on the G5
- speller follo-up from the Tromsø meeting
- fix bugs!
Thomas
- send more even-syllable VNAs to cover all stem types, with derivations
- fix multiple inflection for identical name
- sme G3 issue
- bug-fixing!
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- review user documentation for corpus access
- find and study all derived verbs in our corpus (depends on Trond)
- download and install latest Marratech
- suggest which derivations could be generated (depends on Trond)
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement improvements decided upon in Tromsø
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- make speller and hyphenator make targets using M4
- fix compilation on Victorio
- download and install latest Marratech
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- Continue discussion with Bergen about aligner issues
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- write documentation on semantic double-tagging of names
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- discuss web-only user access management with Oslo
- write short user guide for the corpus web interface
- install Gobby
- Get more sma, smj texts to improve language recognition
- study corpus for language recognition errors, as well as paragraphs with mixed
- consider the problems of lexicalised derivations schewing analyses of
- fix compilation on Victorio
- download and install latest Marratech
- fix bugs!.