Meeting_2009-12-21
Contents:
- Meeting setup
- Agenda
- Opening, agenda review, participants
- Updated task status since last meeting
- Oahpa!
- Corpus gathering
- Promoting Divvun
- Future plans, directions and ideas
- Infrastructure
- Linguistics
- Name lexicon/risten.no infrastructure
- Proofing tools
- Other
- Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 21.12.2009
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
Cf. one of the following, depending on context:
- the upper bar of the SEE window (provided you use the JSPWiki syntax mode)
- the TOC in Forrest-rendered output, like HTML and PDF
Opening, agenda review, participants
Opened at 10: 13.
Present:
Absent: Maja (ill)
Agenda accepted as is.
Updated task status since last meeting
Børre
- implement language switch for static divvun site
- not done
- not done
- upgrade the XServe to Snow Leopard Server
- done
- done
- improve XSL script to transform leaflet Forrest XDocs to an OOo Draw document
- not done
- not done
-
gt/Makefile remake
- nothing done
- nothing done
- get translations of thank-you letter
- not done
- not done
- install vislcg3 and hfst on Majas and Thomas´ computers
- they both have it installed now
- they both have it installed now
-
fix bugs!
- other:
- had to shut down divvun.no/giellatekno.uit.no because some pages spreading
malware was found there. When I/we had a look at it, the source of these pages were html pages attached to bugs in our bugzilla. The html pages contained javascript doing code insertion
- had to shut down divvun.no/giellatekno.uit.no because some pages spreading
Ciprian
- discuss and decide upon tag format
- MT - Autshumato:
- test Autshumato ITE Beta version on Windows
- todo
- todo
- test Autshumato ITE Beta version on Windows
- eXist:
- make eXist running with the SLT-DB
- todo
- todo
- make eXist running with the SLT-DB
- infrastructure
- make a schema/dtd description of the lexC-file (experiment with
Komi and Romanian) - transform sme-lexC files into XML format
- restructure and clean the script catalogue as suggested in the newsgroup
- todo
- todo
- make a schema/dtd description of the lexC-file (experiment with
- GT web:
- coloring output of the disambiguation
- add a tree visualizer for the dependency trees
- add input help for special characters on the tool sites
- automatize the web statistics
- filter (English, German, etc.) input using language detection tools
- put a note on the sites that these are NOT MT tools
- input help for generating wordforms (dropdown menus).
- todo
- todo
- coloring output of the disambiguation
- Bodø Oahpa:
- generalize for semantic classes and pos
- implement a login mechanism
- implement an upload data mechanism
- setup the Bodø version on victorio
- todo
- todo
- generalize for semantic classes and pos
- Sandbox Oahpa:
- finish the db installation for sb_oahpa
- implement and test the Finnish lemmata (and if it works transfer it to the
official OAHPA implementation) - Numra for Skolt Sámi and Finnish
- todo
- todo
- finish the db installation for sb_oahpa
- Running Oahpa:
- add an Oahpa clock and date excercise (cf. Numra)
- Generate fin/xml/{nouns|verbs|adjectives}.xml, and implement the new Leksa
dropdown menu (but first run the sb_oahpa test) - email notification when the server goes down
- todo
- todo
- add an Oahpa clock and date excercise (cf. Numra)
- dictionaries, generally:
- synchronize the source language entries from a specific dictionary with the
entries in the morphology component (now especially for sma): that means nothing then put the entries from dict that are NOT analyzed into the lexC files
- synchronize the source language entries from a specific dictionary with the
- KomEngFin:
- compile Komi dictionary for Mac
- compile only one version of dictionary and use css for user preferences
- generate the first StarDict version of komeng and komfin dictionary
- test the automatic sorting by Komi alphabet in xsl (as discussed
with Trond)- todo
- todo
- compile Komi dictionary for Mac
- Kvensk dictionary for Mac
- internal deadline: depending on the status of the multiplied entries
- internal deadline: depending on the status of the multiplied entries
- SmeNob:
- the StarDict on Windows: try the HTML-plugin
- try to reduce the dict-size on mac: experiment with xPointer, etc.
- incorporate the passives into the last version of the sme: nob
- todo
- todo
- the StarDict on Windows: try the HTML-plugin
- SmaNobSwe:
- finnish the cleaning up of multiplied entries
- done
- done
- extend the smanobswe dictionary (incorporate the data from Maja-Lisa)
- waiting for data from Maja-Lisa
- waiting for data from Maja-Lisa
- check all bug in the sma: nob-swe dictionary
- fix the bug with the string length in the popup window (sma: nob)
- ongoing
- ongoing
- finnish the cleaning up of multiplied entries
- SjdRus:
- continue the work at the Kildin-Russian dictionary, next internal deadline is
the Bodø workshop in March 2010- todo
- todo
- continue the work at the Kildin-Russian dictionary, next internal deadline is
- Corpus:
- check the processing of new corpus documents (error logging during
conversion, conversion quality, documentation, etc.)- todo
- todo
- check the processing of new corpus documents (error logging during
- Lexicon workshop
- evaluate the participants' materials
- ongoing
- ongoing
- write feedback emails to each of participant with an individual electronic
dictionary- todo
- todo
- contact Kimberly Mäkäräinen and ask whether she might be willing to share
data with us- todo
- evaluate the participants' materials
Maja
- discuss and decide upon tag format
- Prepare text´s about normativity issue to SGL/SGM
- more work on sma adjectives
- contact authors of sma aticles in 9-10 eds. of the yearbook
Åarjel-saemieh - correct Sma-Nob entries, add missing words to the analyzer lexicon
- fix bugs!
Sjur
- discuss and decide upon tag format
- difftest for fst and PL speller
- @Barents: plan a meeting/seminar with potential cooperation partners
- @Barents: plan for The Real Thing
- @TTS: start public tender process
- make Leif Åge send out CD's to distribution points
- start Nordplus Sprog project
- send eXist log files to Ciprian
- Write a formal letter to Davvi girji about electronic dictionaries
- make XSL script to transform leaflet Forrest XDocs to an OOo Drawer document
- continue to write a proofing API & implementation specification
- the present version will have to do - Voikka will be the basis for the first
version
- the present version will have to do - Voikka will be the basis for the first
- name db/risten.no
- follow-up on some Polderland-related bugs: 621, 630, 652
- Sámi languages as part of Norsk språkbank
- find and contact the correct person in SD, to get the manuscript for all Sámi
teaching material now being given out to the schools -
fix bugs!
- other things
- application for the Marie Curie Fund(?)
- paper work before the end of the year
- application for the Marie Curie Fund(?)
Thomas
- discuss and decide upon tag format
- not done
- not done
- derivation bugs
- done
- done
- look at buglist and decide whether to close bugs or not
- done
- done
- look at incoming loanwords
- done so much as I can, Majja must now take the rest
- done so much as I can, Majja must now take the rest
- prepare text´s about normativity issue to SGL/SGM
- nothing this week
- nothing this week
- Digitalize south saami books
- scanner now arrived, I will start as soon as possible
- scanner now arrived, I will start as soon as possible
-
fix bugs!
- some done
Tomi
- convert comment etc tags to real lexc tags when the tag discussion is over
- todo
- todo
- document how compounding is controlled in the PLX conversion
- not done
- not done
- fix double hyphen bugs
- not done
- not done
- fix PL smj hyphenator bug
- not done
- not done
- fix PL and Hunspell conversion bugs
- worked
- worked
- install vislcg3
- still problem with installing
- still problem with installing
- sma speller on Friday
- done
- done
- fix bugs!
Trond
- discuss and decide upon tag format
- Not done.
- Not done.
- Komi
- Automaton work
- Yes
- Yes
- sorting
- No
- No
- Automaton work
- Oahpa
- Basic numbers for sms for Numra
- Not done
- Not done
- clock and date for Numra
- Not done
- Not done
- Bodø-oahpa (longer-term)
- Held a plan meeting with L, C.
- Held a plan meeting with L, C.
- Basic numbers for sms for Numra
- Infrastructure
- Mikogo for Lene and Linda
- No.
- No.
- Make-remake, look at tags
- Mikogo for Lene and Linda
- Look at the Kven dictionary before compilation
- fix bugs!.
Oahpa!
Bodø version as soon as Komi and Sma dictionaries are ready in a new version.
Finnish first, at best also other languages (a generalised multilingual Leksa).
This is the Bodø Oahpa!.
Meeting memos can be found at
TODO
- Register oahpa.no (Trond)
- Generate fin/xml/{nouns|verbs|adjectives}.xml, and implement the new Leksa
dropdown menu (Ciprian). - Finding a volunteer to translate the sme Leksa lexicon to Swedish (XXX)
- From the start: sjd_oahpa Leksa in deu and eng as well
- clock and date for Numra (Trond, Ciprian).
- Numra for Skolt Sámi (Trond, Ciprian)
- implement the Bodø-Oahpa (Ciprian, Trond, Lene)
- email notification when the server goes down (Ciprian)
Corpus gathering
Now we have two scanners, as the book scanner arrived on Friday.
Which books are digitized?
- These books should be OCR'ed/digitized: Anna Jacobsens books
Jupmele rïjhke lea gietskesne, Luste lohkedh, Duedtie Novrlaantesne, Naestie-tjoevkesne, Gullie-tjååtsele. - Its not digitized yet, should do it this week.
A. Jacobsen: Goltelidh jih soptestidh and Mojhtesh are already in electronic form - use the corpus gathering document to keep track of each book
TODO:
- continue gathering sma corpus texts (Maja)
- get sma articles in Š-bláđđi
- the Gun Utsi book is almost there - one contract missing (Maja)
- get sma articles in Š-bláđđi
- write formal letter to Davvi Girji (Sjur)
- send a copy of the signed contracts back to the authors, translators and
publishers, accompanied by the thank-you letter - find and contact the correct person in SD, to get the manuscript for all Sámi
teaching material now being given out to the schools (Sjur) - get the sma yearbooks from Saemien sïjhte ( Maja)
Promoting Divvun
TODO:
- make leaflet to inform about the project (Børre)
- add InDesign text (Sjur)
- make XSL script to transform Forrest XDocs to an OOo Drawer document
( Børre, Sjur)
- add InDesign text (Sjur)
- distribute CD version through the library bus, the language centres and common
sami centres in all of Sápmi. Gaaltije in Östersund for example. ( Leif Åge, Sjur)- make him send out CD's accordingly (Sjur)
- make him send out CD's accordingly (Sjur)
- update online download log statistics page (Børre)
Future plans, directions and ideas
See a separate document in plan/strat/5year.jspwiki.
Northern areas project
TODO:
- Attend a beginners' course in Russian (priority: the alphabet!) near you..
- plan a meeting/seminar with potential cooperation partners (Trond, Sjur)
- plan for The Real Thing (Trond, Sjur)
Infrastructure
HFST issues
To check out from svn:
svn co https://hfst.svn.sourceforge.net/svnroot/hfst/trunk hfstsvn
Remove all old Fink (/sw/) and MacPorts (/opt/) installations, then compile
TODO:
- install HFST for Maja (Børre)
- done for all
VislCG 3
Everyone should have this one installed as well.
TODO:
- install VislCG3 (Tomi)
- not yet, problems with the building (linking ICU)
- try using the ICU package from MacPorts
- not yet, problems with the building (linking ICU)
- install VislCG3 for Maja and Thomas (Børre)
- done
Makefile + tag simplification
Present process:
- tailor the source files
(remove lines containing SUB, GG, whatever) - compile the tmp source files
Goal:
- compile the source files
- tailor the resulting transducer
(subtract the strings containing SUB, GG, whatever)
Also replace xfst script files with -e "action" statements.
Old version:
iclock.fst: $(TARGET)/bin/iclock-$(TARGET).fst $(TARGET)/bin/iclock-$(TARGET).fst: $(TARGET)/bin/clock-$(TARGET).fst @echo "*** iclock-$(TARGET).fst ***" @printf "load < $(TARGET)/bin/clock-$(TARGET).fst \n\ invert net \n\ save stack $@ \n\ quit \n" >> tmp/iclock-script-$(TARGET) $(XFST) < tmp/iclock-script-$(TARGET) @rm -f tmp/iclock-script-$(TARGET)
New version:
iclock.fst: $(TARGET)/bin/iclock-$(TARGET).fst $(TARGET)/bin/iclock-$(TARGET).fst: $(TARGET)/bin/clock-$(TARGET).fst @echo @echo "*** $(notdir $@) ***" @echo @$(XFST) -e "load < $<" \ -e "invert net" \ -e "save stack $@" \ -stop
TODO:
- discuss and decide upon tag format (linguists)
- convert comment tags to real lexc tags when the discussion is over (Tomi)
General list
To accomodate future enhancements in different directions (in rough order of
- test bench for all parts of our language technology efforts
- test bench enhanced, but not yet complet
- test bench enhanced, but not yet complet
- set up the Leopard Server features for collaborative support:
- iCal server / group calendars
- wiki
- iCal server / group calendars
- wiki? on G5 (is part of Snow Leopard Server) or other web-based documentation
- improve Forrest stability and i18n support ( the divvun crashes)
- reorganise the documentation:
- differ between target groups
- get better grouping
- decide what to write in forrest and what in wiki
(cf. Apertium and http://xixona.dlsi.ua.es/apertium/) for a similar split) - update/add missing parts
- differ between target groups
- migrate lexc lexicons to XML, splitting the task
- Name lexica (the Name project)
- Dictionaries (already in XML, task is to integrate them)
- At least migrate the lexc open POSes (Komi as a pilot case)
- Name lexica (the Name project)
- change the look of the documentation web
- corpus content moved to Max Planck repositories? Norsk språkbank?
- update infrastructure to allow content-restricted spellers for special target
groups
TODO:
- upgrade the XServe to Snow Leopard Server (Børre)
- done
- done
- ask Lene, Linda to install Mikogo (Trond)
- restructure and clean the script/ directory, using subdirs to categorise
the scripts (Ciprian)- please act as suggested, cf comments in the news group
- please act as suggested, cf comments in the news group
- infrastructure remake: ( Børre, Ciprian, Saara, Sjur, Tomi, Trond)
- make a test-all target that runs all tests we have (Ciprian, Sjur, Trond)
- delayed until we have restructured the make/build process
- delayed until we have restructured the make/build process
- define and document testing routines (Ciprian, Sjur, Trond)
- delayed until we have restructured the make/build process
Linguistics
North Sámi
(nothing new, see proofing bugs below)
Lule Sámi
(nothing new, see proofing bugs below)
South Sámi
Numerals:
- Bergsland, and Spiik for smj as well: gøøkte luhkie gøøkte
- Majja, and Nickel for sme: gøøkteluhkiegøøkte
- Tronds interpretation: Bergsland and Spiik do not reflect the norm, but a
pedagogical convention to make it easier for readers to understand - Linguistically, joined writing is correct (ref: TO, the native southerner)
Classification of pronouns etc:
- personal: manne "I", dïhte "he/she/it"
- demonstrative: daate "this one"
- determiner: daennie gåetesne "to that house"
Personal vs demonstrative:
- dihte Pron Pers
- dihte Pron Dem
- Given that "dihte maana" is out, the analysis should be +Pron+Pers and
not +Det. - We do not want to disambiguate persons and things as referents,
therefore we do not want two analyses +Pron+Pers and +Pron+Dem.
Anna Jacobsen:
Demonstrative vs. determiners
- daate "påpekende pronomen" demonstrative
- gen: daen
- ine: daesnie (Trond: pronoun, sma.fst: +Pron+Dem)
- ine: daennie gåetesne (Trond: determiner, +Det) (sme: Gen)
Here, we should have two paradigms, one pronominal +Pron+Dem, and one
Three types of problem areas:
- missing closed POSes: errors in transducer, or in PL conversion
- missing words found in the smanob dictionary: 110 verbs, 80 adjectives, all
core sma words (Cip: this statistics is NOT complete, as far as I know Trond has forgoten to take the Swedish verb file into account!) - missing list from corpus
We also seem to have problems with derivations:
jealadehtedh jieliedehtedh jieledh+V+Der1+Der/dehte+V+Inf galmadehtedh galmedehtedh galmedidh+V+Der2+Der/ehte+V+Inf
TODO:
- Prepare text´s about normativity issue to SGL/SGM (Maja, Thomas)
- adjectives (Maja with Thomas, Trond, Sjur)
- two competing naming conventions of continuation lexicons
- One naming goes ATTRSUFF-PREDSUFF-STEMTYPE
- One follows the sme convention of naming key adjectives
- There are duplicate lexica
- One naming goes ATTRSUFF-PREDSUFF-STEMTYPE
- The comparative issue open here and there
- The ATTRSUFF-PREDSUFF-STEMTYPE lexica now go to EVENCOMP and
ODDCOMP only.
- The ATTRSUFF-PREDSUFF-STEMTYPE lexica now go to EVENCOMP and
- two competing naming conventions of continuation lexicons
- work meeting on loan words and proper nouns, this week - tomorrow?
( Maja, Ove (?), Sjur, Thomas, Trond)
Name lexicon/risten.no infrastructure
Tomi has played with couchdb as a replacement
Instead of building our own webforms and back-end update scripts, use XForms
From the meeting with the terminology and IT teams last week:
- no major rework on the present search interface now
- no work on the editing section; instead:
- add existing lists of sanctioned terminology as separate term entities
- add a dictionary if we can make one with sufficient quality
This means the following tasks:
- find already approved lists, in paper or electronic form (term team)
- convert paper lists to electronic lists (term team)
- convert lists to standard XML (Sjur, Tomi)
- add prepared lists to risten.no (Sjur, Tomi)
TODO:
- send eXist log files to Ciprian (Sjur)
- fix i18n bug in risten.no/G5 (so they will work without the proper locale
request) (Sjur) - fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob
as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara) - convert propernoun-($lang)-lex.txt to a derived file from common xml files
( Sjur, Tomi, Saara) - implement data synchronisation between risten.no and
the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way) - start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, linguists) - merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists) - publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
( linguists)
Dictionaries
Duplicated entries are now mostly removed, and much of the closed POS content
inc catalogue contains nouns.csv (2132 lines, 2030 unique nouns) and verb.csv
- sma: nob-swe
- over 400 multiplied entries found (these are coused by a not so clean work
with the data) - done: deleted automatically about 300 of them
- ongoing: Maja-Lisa is checking the entries on correctness
- ongoing: Maja-Lisa is translating the missing words from her own diploma
thesis into nob - ongoing: synchronize data from dictionary with that from lexC files so that
we are able to analyze everything from the dictonary (Cip: this should be valuable for ALL dictionaries)
- over 400 multiplied entries found (these are coused by a not so clean work
- fkv: nob and nob: fkv
- started to work on them: waiting for Verena to get rid of multiplied
entries
- started to work on them: waiting for Verena to get rid of multiplied
- kom: fin-eng
- moved the original kom-lex.xml to the inc-dir and froze it
- split it by pos into the working_file dir, the ONLY place to work with
the dictionary entries - now, the lexC files are generated via XSLT sheets, no perl scripts
- adjusted the Makefile
- prepared the pipeline for compiling the mac dict
- todo: make a pipeline for StarDict also (as far as I know, Jaska has)
a Linux machine
- moved the original kom-lex.xml to the inc-dir and froze it
TODO:
- set up risten.no on eXist/XServe (as a beta version site) (Sjur)
- set up required infra for smenob on risten.no/XServe (Sjur)
- Continue the dictionary infrastructure discussion (Ciprian, Sjur, Trond)
- end user documentation (how to download and install) (Ciprian, Trond)
- Contact Davvi Girji about cooperation on electronic dictionaries
( Sjur) - developing the mobile phone version of smenob:
- http: //pfed.info/wksite/
- http: //pfed.info/wksite/
- Komi
- take out the doublets to a separate file (Ciprian)
- merge the doublets (Jaska, Trond)
- Completing the automaton to some state (Trond, Jaska, Paula)
- take out the doublets to a separate file (Ciprian)
- compile Komi dictionary (Cirian)
- compile Kven dictionary (Cirian)
Proofing tools
South Sámi
TODO:
- Go through the phonrules file (Thomas)
- Pick a text for proofreading (Thomas)
- Proofread a text with tre eror$error notation (Maja)
- difftest for fst and PL speller (Sjur)
- Compile a new version by Friday (Tomi)
-
cd $GTHOME/prooftools/; sudo make mac-dmg
-
cd $GTHOME/prooftools/; sudo make mac-dmg
- Look for external beta testers?
- David and Jovsset will get the speller, for other people
we must wait.
- David and Jovsset will get the speller, for other people
- fix the installer build issues (Sjur)
- correct easter egg to say "South Sámi", not "North Sámi" (Sjur)
- fix sma installer bugs (Sjur)
HFST-based proofing tools
Voikko will be our near future proofing solution, using HFST as
Please join the libvoikko mailing
One article linked:
http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/doc/morphological-analysis.txt
TODO:
- continue to write an API specification, and an implementation specification
( Sjur)
Testing
Spelling Error Markup
TODO:
- test new and nested error markup (Sjur)
- nesting still needs to be tested, depends on new ccat feature
Speller testing
TODO:
- test the error type selection feature in ccat (Sjur)
Testing open-source Norwegian spellers
Sjur has invited the open-source group to test their spell-checker using our
We should go to their developer meetings, and present our work and how to work
Speller bugs
List of bugs returned from Polderland:
- 621
- 630
- 652
- 656
- 676
Tag reordering for abbreviations have caused a lot of problems:
smj: hr. hr. hr+ABBR+Acc cand.philol. cand.philol. cand.philol+ABBR+N+Acc Per Per Per+N+Prop+Mal+Sg+Attr sme: hr. hr. hr+N+ABBR+Acc Per Per Per+N+Prop+Mal+Sg+Attr
Open issues based on test results:
sme
- 399 - missing numerals (plural forms) - still OPEN
- 425 - X not recognised; single letters were left out - still OPEN
- 435 - roman numbers - inflection of single letter numbers
rejected, as well as some complex numbers (but is ok in smj) - still OPEN- we should pregenerate all numbers once and for all, and store them in a
separate lexicon file
- we should pregenerate all numbers once and for all, and store them in a
- 461 - REGRESSION: missing suggestion (sáhkki)
- 508 - REGRESSION: accepts smj entries (most likely abbreviation missing)
- 520 - REGRESSION: r9 and š9 not defined (abbr. missing)
- 595 - prefix+name without hyphen (ovdaLot instead of ovda-Lot) - still
OPEN - 603 - suomabealdi accepted - still OPEN
- 606 - compound-tags LEXICON VUOHTA - still OPEN
- 613 - short gen. as second compound part - still OPEN
- 619 - numerals and pronouns to NAMÁK and SASJ fails - vihttasoarttat
remaining - still OPEN - 629 - a taking part in compounding without hyphen - still OPEN
- only open case has word A-finálaid compounded
- only open case has word A-finálaid compounded
- 647 - numerals+NOUN - still OPEN, open case has uppercase letters
- 648 - unmotivated suggestions with numeral+noun - still OPEN
- 661 - REGRESSION: abbr. not recognized
- 709 - sámedikkeválga accepted - OPEN
- 728 - vowel shortening GenCmp+Left-tagged - still OPEN
- 779 - caseforms of pronoun okatahat - still OPEN
- 785 - does not recognize alphabet-abbr+noun - OPEN
- 802 - NEW: multiword propernouns
- 803 - NEW: FINJU- words accepted single-handed
- 804 - NEW: guovttilogát, njealjilogát
- 805 - NEW: Nouns+acronyms
smj
- 435 - roman number - single letter numbers now recognised
- we should pre-generate all numbers once and for all, and store them in a
separate lexicon file - please note that inflection of single letter numerals is fine in
smj, as opposed to sme
- we should pre-generate all numbers once and for all, and store them in a
- 482 - polardutkamin not recognized - FIXED
- 496 - REGRESSION: unrecognised clitics
- 556 - non-existent word accepted - FIXED
- 594 - lågenanguoktáj not recognized - still OPEN
- 595 - REGRESSION: prefix+name as split comp without hyphen
- 596 - C-giellan is not accepted - still OPEN
- 600 - Gen+hyph compound - FIXED
- 627 - prefix + hyhpen does not get accepted - FIXED
- 647 - numerals+NOUN - still OPEN, open case has uppercase letters
- 648 - unmotivated suggestions with numeral+noun - still OPEN
- 650 - REGRESSION: noun prefix+name compound without hyphen
- 652 - UPPERCASE-typos only get acronym-suggestions - still OPEN
- 692 - numeral-variants - all but one fixed (gáktsalågenantjuotakta), but
still OPEN - 744 - REGRESSION: numerals + clitic
- 803 - NEW: VINJU- words accepted single-handed
- 805 - NEW: Nouns+acronyms
TODO:
- document how compounding is controlled in the PLX conversion (Tomi)
Hyphenator bugs
Open issues based on test results :
sme
No known issues!
smj
- 670 - Hard hyphen replaced with soft hyphen: 10-biejvvásattja (the word is
not rec.; Bug #711) - still OPEN
sma
Command to test the hyphenator:
preprocess dev/corp/pressemelding.txt | lookup bin/hyph-sma.fst | cut -f2 | \ lookup bin/hyphrules-sma.fst | grep -v '^$' | cut -f2 | uniq | see
TODO:
- fix PL hyphenator errors (Tomi)
- almost done - one smj bug left
Installer changes
TODO:
- test InDesign installer (Sjur)
User documentation
TODO:
- InDesign documentation (Sjur)
- Norwegian translation received from Davvi Girji
1.2 release
Content:
- several smj bug fixes
- lexicalisations
- InDesign Mac & Win
- new OOo beta
- improved installers, at least for Mac, preferably also for Windows
Other
Spring planning
Topics:
- speller test project -> Sjur, Børre, Thomas, X
- speech synthesis -> Sjur, Trond, BA as a starter
- dictionaries: smanob, komfineng, fkvnob -> Ciprian, Sjur, Trond
- risten.no -> Ciprian, Sjur, Tomi
- Barents project follow-up meeting
- sme and smj proofing tools, next version -> Thomas, Tomi, Sjur
- HFST-based proofing tools -> Sjur
Dates:
- Feb. 1 and 2: Sjur is presenting Divvun and Oahpa! at Helsinki univ.
- March 1-14: Bodø
Text to speech
TODO:
- refine syntax / dependency rules (Biret Ánne)
- start public tender process (Sjur)
- find appropriate voices (Biret Ánne)
MT and CAT
TODO:
- make A-ITE compile and run on Windows (Ciprian)
Open source
The repository is properly closed/open now, and the availability of the source
TODO:
- announce the availability of our repository on relevant linguistic lists
( Sjur)
Christmas vacation / non-working days
SD: 24.12 and 31.12 are half days, and one of them off. One additional day in
UiT: all days in between are half days.
| Who | Days off |
|---|---|
| Børre | off 30th and 31st |
| Ciprian | no vacation planned, non-working on Christmas eve |
| Maja | on sick leave the whole week before Christmas (21-24) |
| Sjur | no vacation planned, non-working on Christmas eve |
| Thomas | no vacation planned, non-working 28 and 31 |
| Tomi | no vacation planned |
| Trond | off 23 (?), 24, leaving slightly earlier 21-22 |
Next meeting, closing
The next meeting is 28.12.2009, 9.30 Norwegian time.
The meeting was closed at 12: 08.
Appendix - task lists for the next week
Boerre
- implement language switch for static divvun site
- improve XSL script to transform leaflet Forrest XDocs to an OOo Draw document
-
gt/Makefile remake
- get translations of thank-you letter
- fix bugs!
Ciprian
- discuss and decide upon tag format
- MT - Autshumato:
- test Autshumato ITE Beta version on Windows
- test Autshumato ITE Beta version on Windows
- eXist:
- make eXist running with the SLT-DB
- make eXist running with the SLT-DB
- infrastructure
- make a schema/dtd description of the lexC-file (experiment with
Komi and Romanian) - transform sme-lexC files into XML format
- restructure and clean the script catalogue as suggested in the newsgroup
- make a schema/dtd description of the lexC-file (experiment with
- GT web:
- coloring output of the disambiguation
- add a tree visualizer for the dependency trees
- add input help for special characters on the tool sites
- automatize the web statistics
- filter (English, German, etc.) input using language detection tools
- put a note on the sites that these are NOT MT tools
- input help for generating wordforms (dropdown menus).
- coloring output of the disambiguation
- Bodø Oahpa:
- generalize for semantic classes and pos
- implement a login mechanism
- implement an upload data mechanism
- setup the Bodø version on victorio
- generalize for semantic classes and pos
- Sandbox Oahpa:
- finish the db installation for sb_oahpa
- implement and test the Finnish lemmata (and if it works transfer it to the
official OAHPA implementation) - Numra for Skolt Sámi and Finnish
- finish the db installation for sb_oahpa
- Running Oahpa:
- add an Oahpa clock and date excercise (cf. Numra)
- Generate fin/xml/{nouns|verbs|adjectives}.xml, and implement the new Leksa
dropdown menu (but first run the sb_oahpa test) - email notification when the server goes down
- add an Oahpa clock and date excercise (cf. Numra)
- dictionaries, generally:
- synchronize the source language entries from a specific dictionary with the
entries in the morphology component (now especially for sma): that means nothing then put the entries from dict that are NOT analyzed into the lexC files
- synchronize the source language entries from a specific dictionary with the
- KomEngFin:
- compile Komi dictionary for Mac
- compile only one version of dictionary and use css for user preferences
- generate the first StarDict version of komeng and komfin dictionary
- test the automatic sorting by Komi alphabet in xsl (as discussed
with Trond)
- compile Komi dictionary for Mac
- Kvensk dictionary for Mac
- internal deadline: depending on the status of the multiplied entries
- internal deadline: depending on the status of the multiplied entries
- SmeNob:
- the StarDict on Windows: try the HTML-plugin
- try to reduce the dict-size on mac: experiment with xPointer, etc.
- incorporate the passives into the last version of the sme: nob
- the StarDict on Windows: try the HTML-plugin
- SmaNobSwe:
- extend the smanobswe dictionary (incorporate the data from Maja-Lisa)
- check all bug in the sma: nob-swe dictionary
- fix the bug with the string length in the popup window (sma: nob)
- extend the smanobswe dictionary (incorporate the data from Maja-Lisa)
- SjdRus:
- continue the work at the Kildin-Russian dictionary, next internal deadline is
the Bodø workshop in March 2010
- continue the work at the Kildin-Russian dictionary, next internal deadline is
- Corpus:
- check the processing of new corpus documents (error logging during
conversion, conversion quality, documentation, etc.)
- check the processing of new corpus documents (error logging during
- Lexicon workshop
- evaluate the participants' materials
- write feedback emails to each of participant with an individual electronic
dictionary - contact Kimberly Mäkäräinen and ask whether she might be willing to share
data with us
- evaluate the participants' materials
Maja
- discuss and decide upon tag format
- Prepare text´s about normativity issue to SGL/SGM
- more work on sma adjectives
- contact authors of sma aticles in 9-10 eds. of the yearbook
Åarjel-saemieh - correct Sma-Nob entries, add missing words to the analyzer lexicon
- look at incoming loanwords
- fix bugs!
Sjur
- discuss and decide upon tag format
- difftest for fst and PL speller
- @Barents: plan a meeting/seminar with potential cooperation partners
- @Barents: plan for The Real Thing
- @TTS: start public tender process
- make Leif Åge send out CD's to distribution points
- start Nordplus Sprog project
- send eXist log files to Ciprian
- Write a formal letter to Davvi girji about electronic dictionaries
- make XSL script to transform leaflet Forrest XDocs to an OOo Drawer document
- name db/risten.no
- follow-up on some Polderland-related bugs: 621, 630, 652
- Sámi languages as part of Norsk språkbank
- find and contact the correct person in SD, to get the manuscript for all Sámi
teaching material now being given out to the schools - fix bugs!
Thomas
- make south saami numeral generator
- discuss and decide upon tag format
- prepare text´s about normativity issue to SGL/SGM
- Digitalize south saami books
- fix bugs!
Tomi
- convert comment etc tags to real lexc tags when the tag discussion is over
- document how compounding is controlled in the PLX conversion
- fix double hyphen bugs
- fix PL smj hyphenator bug
- fix PL and Hunspell conversion bugs
- install vislcg3
- fix bugs!
Trond
- discuss and decide upon tag format
- Komi
- Automaton work
- sorting
- Automaton work
- Oahpa
- Basic numbers for sms for Numra
- Bodø-oahpa (longer-term)
- Basic numbers for sms for Numra
- Infrastructure
- Mikogo for Lene and Linda
- Make-remake, look at tags
- Mikogo for Lene and Linda
- Look at the Kven dictionary before compilation
- fix bugs!.

