Meeting_2010-06-14
Contents:
- Meeting setup
- Agenda
- Opening, agenda review, participants
- Updated task status since last meeting
- Oahpa!
- Corpus gathering
- Promoting Divvun
- Future plans, directions and ideas
- Infrastructure
- Linguistics
- Name lexicon/risten.no infrastructure
- Proofing tools
- Other
- Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 14.6.2010
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
Cf. one of the following, depending on context:
- the upper bar of the SEE window (provided you use the JSPWiki syntax mode)
- the TOC in Forrest-rendered output, like HTML and PDF
Opening, agenda review, participants
- Opened at 10: 00.
- Present: Børre, Ciprian, Maja, Sjur, Thomas, Tomi, Trond
- Absent: none
- Agenda accepted with the following additions:
-
sma seminar in August/september
-
sma speller beta/update from Knowledge Concepts
- Barents: keyboards, localisation
- CLARIN meeting
-
sma seminar in August/september
Updated task status since last meeting
Børre
- get all the latest Sámi Parliament protocols into our repository
- not done
- not done
- contact the Sámi translators at the government and ministeries to get more
- not done
- not done
- corpus infra:
- make the conversion scripts work with the new svn repository
- add check for almost empty content
- add processing of svg files
- add check for almost empty content
- make the conversion scripts work with the new svn repository
- formulate license header
- done
- done
- add license header to all files
- done
- done
- make restricted access to our svn repository work (fit adaption)
- not done
- not done
- corpus access on the XServe
- not done
- not done
- convert or move the files in the upload dir to the real corpus repo
- done
- done
- turn on corpus summary crontab
- not done
- not done
- contact Ávvir about renewed corpus cooperation
- not done, but I fetched all the articles that are available on avvir.no
- not done, but I fetched all the articles that are available on avvir.no
- implement language switch for static divvun site
- not done
- not done
- improve XSL script to transform leaflet Forrest XDocs to an OOo Draw document
- not done
- not done
- get translations of thank-you letter
- not done
- not done
- make the new SL Server services functional:
- group calendars
- not done
- not done
- group calendars
- set up corpus mirroring on the XServe again
- not done
- not done
- give Maja a WEbSak intro
- not done
- not done
-
fix bugs!
- other
- Fetched docs from nav.no
Ciprian
- Skolt-Oahpa:
- fine tuning of semantic tags
- todo
- todo
- integrating Numra if fst available
- done (if Trond's fst is up-to-date)
- done (if Trond's fst is up-to-date)
- finish localization
- todo (Michael Riessler)
- todo (Michael Riessler)
- fine tuning of semantic tags
- terminology:
- merged the 2 doc files with the xls file containing the 2006 law terminology
- todo (deadline delayed)
- todo (deadline delayed)
- prepare a StarDict test version of the merged data for demo
- compile a dictionary in StarDict format with all data featuring the
- low priority
- merged the 2 doc files with the xls file containing the 2006 law terminology
- add license header to all files
- todo
- todo
- make restricted access to our svn repository work (fit adaption)
- todo
- todo
- PhD-Projektbeskrivelse (internal deadline end of June 2010):
- search, read, write (execute): high priority!
- attending the Workshop on Germanic Syntax (very useful insights)
- attending the Workshop on Germanic Syntax (very useful insights)
- search, read, write (execute): high priority!
- read and answer the newsgroups messages
- todo
- todo
- update corpora (both monolingual and parallel) for Oslo (Glossa)
- check the correctness of parallelity between sme and nob files
- sentence-align sme with nob: fix tca2 problem
- analyse/disambiguate: todo -- waiting for the last version of FSTs
- this task is planed for the summer time
- this task is planed for the summer time
- check the correctness of parallelity between sme and nob files
- corpus infra:
- make the conversion scripts work with the new svn repository
- add check for almost empty content
- add processing of svg files
- todo
- add check for almost empty content
- reorganise subdirs as needed
- todo
- todo
- make the conversion scripts work with the new svn repository
- infrastructure
- test cwb
- continue with restructuring and cleaning the script catalogue as suggested
- transform sme-lexC files into XML format
- make a schema/dtd description of the lexC-file (experiment with
- todo
- todo
- test cwb
- GT web:
- add a tree visualizer for the dependency trees
- add input help for special characters on the tool sites
- automatise the web statistics
- filter (English, German, etc.) input using language detection tools
- put a note on the sites that these are NOT MT tools
- input help for generating wordforms (dropdown menus).
- todo
- todo
- add a tree visualizer for the dependency trees
- Sandbox Oahpa:
- integrate reCAPTCHA into Django (as Lene suggested)
- Numra for Skolt Sámi and Finnish
- debug the installed sb_oahpa
- update and correct the Oapha docu site
- todo
- todo
- integrate reCAPTCHA into Django (as Lene suggested)
- Running Oahpa:
- fix leksa_n (proper nouns) related bug after adding Finnish to Leksa
- ongoing
- ongoing
- implement a testbench for Vasta and Sahka (as Lene needs)
- for Sahka done, for Vasta todo
- for Sahka done, for Vasta todo
- integrate reCAPTCHA after the SB-test
- add an Oahpa clock and date excercise (cf. Numra)
- email notification when the server goes down
- check the XXX?
- todo
- todo
- fix leksa_n (proper nouns) related bug after adding Finnish to Leksa
- dictionaries, generally:
- synchronize the source language entries from a specific dictionary with the
- the StarDict on Windows: try the HTML-plugin (that means that users can use
- try to reduce the dict-size on mac: experiment with xPointer, etc.
- synchronize the source language entries from a specific dictionary with the
- Fkv: Nob - Nob: Fkv:
- re-compile the dictionaries incorporating the stem information and novel glossary (deadline 7. June)
- done, however without glossary (this is a summer time exercies)
- done, however without glossary (this is a summer time exercies)
- try to implement a web version of the dictionaries using the Odense method
- did some tests towards an implementation of HTML-dicts
- did some tests towards an implementation of HTML-dicts
- re-compile the dictionaries incorporating the stem information and novel glossary (deadline 7. June)
- KomEngFin:
- test the automatic sorting by Komi alphabet in xsl (as discussed
- todo
- todo
- test the automatic sorting by Komi alphabet in xsl (as discussed
- SmeNob:
- incorporate the passives into the last version of the sme: nob
- start a new compilation of SmeNob and improve it based on the experience
- todo
- todo
- incorporate the passives into the last version of the sme: nob
- SmaNobSwe:
- extend the smanobswe dictionary: waiting for data (incorporate the data from
- todo
- todo
- extend the smanobswe dictionary: waiting for data (incorporate the data from
- SjdRus:
- continue the work at the Kildin-Russian dictionary, next internal deadline
- todo
- working on the correct localization of sjd and enountered a weird bug
- todo
- continue the work at the Kildin-Russian dictionary, next internal deadline
- Lexicon workshop
- contact Kimberly Mäkäräinen and ask whether she might be willing to share
- todo
- todo
- contact Kimberly Mäkäräinen and ask whether she might be willing to share
- MT
- embed gt_dicts in A_ITE (deadline 10. June)
- internat deadline tomorrow 00 a.m.
- internat deadline tomorrow 00 a.m.
- test A-ITE on Windows
- todo
- todo
- embed gt_dicts in A_ITE (deadline 10. June)
- Permanent education
- prepare/update XLS course materials
- learn UML
- stil uneducated
- prepare/update XLS course materials
Maja
- Prepare text´s about normativity issue to SGL/SGM
- done
- done
- more work on sma adjectives
- not done
- not done
- look at incoming loanwords - do missinglist
- not now
- not now
- continue gathering sma corpus texts
- not prio.
- not prio.
- finish compound tags for adjectives
- Not prio.
- Not prio.
-
fix bugs!
- not prio.
Sjur
- test the sma speller on the gold standard document
- still waiting for the gold standard document
- still waiting for the gold standard document
- formulate license header
- done
- done
- add voikko support to our proofing test bench
- still delayed
- still delayed
- add all our Sámi analysers and test them as spellers
- still not done
- still not done
- run tests using Hunspell, Voikko, Polderland for our Sámi lexicons
- still not done
- still not done
- install & configure the Unison news reader
- still not done
- still not done
- difftest for fst and PL speller
- still not done
- still not done
- Northern areas
- plan a meeting/seminar in Tromsø
- make a plan for the first two years, update overall plan
- plan a meeting/seminar in Tromsø
- @TTS: continue public tender process
- make Leif Åge send out CD's to distribution points
- contintue Nordplus Sprog project
- announced position as summer trainee (sommarjobb) - got several
- announced position as summer trainee (sommarjobb) - got several
- Write a formal letter to Davvi girji about electronic dictionaries
- make XSL script to transform leaflet Forrest XDocs to an OOo Drawer document
- name db/risten.no
- follow-up on some Polderland-related bugs: 621, 630, 652
- find and contact the correct person in SD, to get the manuscript for all Sámi
- write new build commands for make
- started to look at Tomi's work, but will need some more time; first
- others could have a look as well
- started to look at Tomi's work, but will need some more time; first
- when the new build infrastructure works as it should, delete the old ones
- read through and comment by Wednesday afternoon
- test the new build commands
- fix bugs!
Thomas
- prepare text´s about normativity issue to SGL/SGM
- not worked
- not worked
- Digitalize south saami books
- worked hard
- worked hard
-
fix bugs!
- worked some
Tomi
- run tests using Hunspell, Voikko, Polderland for our Sámi lexicons
- not done
- not done
- put together the TTS preprocessing transducers and scripts
- not done
- not done
- write new build commands
- working
- working
- try to compile voikko
- this is not compiling
- this is not compiling
- add all our Sámi analysers and test them as spellers
- not done
- not done
- when the new build infrastructure works as it should, delete the old ones
- not done, though no complaints about it not working. or is anyone using new
- not done, though no complaints about it not working. or is anyone using new
- document how compounding is controlled in the PLX conversion
- fix double hyphen bugs
- fix PL smj hyphenator bug
- fix PL conversion bugs
- fix bugs!
Trond
- corpus infra:
- reorganise subdirs as needed
- Not done
- reorganise subdirs as needed
- Northern areas
- plan a meeting/seminar in Tromsø
- Barely started
- make a plan for the first two years, update overall plan
- Not done
- plan a meeting/seminar in Tromsø
- MT/Terminology
- Worked quite a lot with fin-sme. Tag harmonising is needed.
- Worked quite a lot with fin-sme. Tag harmonising is needed.
- install updated corpus files in Oslo
- Not done. Here, we will need to do things more first.
- Not done. Here, we will need to do things more first.
- sms number generator
- Done
- Done
-
fix bugs!.
- Active on Bugzilla this week...
Oahpa!
Trond: A lot has happened iwht localisation. The place name list have been
Ciprian: added Kildin localisation, but it doesn't work - it isn't possible
The Finnish place names are now translated (should be proofread by name
The test bench is made for testing Sahka, and soon Vasta. It makes it possible
TODO
- Register oahpa.no (Trond)
- From the start: sjd_oahpa Leksa in deu and eng as well (Ciprian)
- delayed
- delayed
- clock and date for Numra (Ciprian)
- Numra for Skolt Sámi (Trond, Ciprian)
- Done, needs to be commented in.
- Done, needs to be commented in.
- email notification when the server goes down (Ciprian)
- Finding a volunteer to translate the sme Leksa lexicon to Swedish (Trond)
- found, but he has no time to do it
- found, but he has no time to do it
- add Captcha for the feedback e-mail address (Ciprian)
- check sms number fst for completeness (Trond)
- done
Corpus gathering
Børre: made a python script to fetch articles from avvir.no as they are
TODO:
- get all the latest Sámi Parliament protocolls into our repository (Børre)
- contact the Sámi translators at the government and ministeries to get more
- continue gathering sma corpus texts (Maja)
- get sma articles in Š-bláđđi
- the Gun Utsi book is almost there - one contract missing (Maja)
- get sma articles in Š-bláđđi
- write formal letter to Davvi Girji (Sjur)
- send a copy of the signed contracts back to the authors, translators and
- find and contact the correct person in SD, to get the manuscript for all Sámi
- get the sma yearbooks from Saemien sïjhte ( Maja)
- contact certain sma writers (Børre)
- contact Ávvir about renewed corpus cooperation (Børre)
- contact Inga Margrethe Bjørn Eira (Maja)
- give Maja a WebSak intro (Børre)
- restart the letter mailing thing using WebSak (Maja)
Promoting Divvun
TODO:
- make leaflet to inform about the project (Børre)
- add InDesign text (Sjur)
- make XSL script to transform Forrest XDocs to an OOo Drawer document
- add InDesign text (Sjur)
- distribute CD version through the library bus, the language centres and
- make him send out CD's accordingly (Sjur)
- make him send out CD's accordingly (Sjur)
- update online download log statistics page (Børre)
Future plans, directions and ideas
See a separate document in plan/strat/5year.jspwiki.
Northern areas project
First major obstacle: make working keyboards and fonts.
- Choice of fonts
- Rendering of fonts in MS Word and other important programs
- Keyboard layout
- Empirical phase: What keyboards are around
- Design phase: Make the optimal keyboards -- for all OSes
- Empirical phase: What keyboards are around
Write trustworthy and detailed documentation (in Russian)
What we know:
- you can type on some computers (Michael R: it works), not on others. Q: what are the differences between the two types of computers?
- some fonts can't display the chars, others can (but usually uggly)
TODO:
- Report from journey to FM/UD (Trond)
- Course plan for meeting/seminar, october. Financing must be in place.
- Attend a beginners' course in Russian (priority: the alphabet!) near you..
- make a plan for the first two years, update overall plan (Trond, Sjur)
Infrastructure
Out of the box experiences:
- fao could not be compiled out of the box (the Oslo gang tried) because of a
- our list of software to install is incomplete or lacking installation
- external tools require manual installation
- there are bugs in the gtsetup.sh script
Updated corpus online
See Ciprian´s document about the corpus content in
Issues:
- filenames
- organisation of subdirs?
- content (or lack thereof) of original and converted files
- svn repo reorganisation
facta$convert2xml.pl --nolog --corpdir=/usr/local/share/corp L1allOrt.correct.txt
Error message:
sh: /home/sjur/gtmain/gt/script/text_cat: No such file or directory L1allOrt.correct.txt: ERROR errors in /home/sjur/gtmain/gt/script/text_cat -q \ -x -d /home/sjur/gtmain/gt/script/LM "/usr/local/share/corp/tmp/L1allOrt.correct.txt.tmp0":
text_cat isn't part of our repository, we need to add it - we are using a
TODO:
- make the conversion scripts work with the new svn repository
- add check for almost empty content
- already implemented in convert2xml.pl. Still empty or almost empty files
- already implemented in convert2xml.pl. Still empty or almost empty files
- add processing of svg files
- not done
- not done
- add check for almost empty content
- reorganise subdirs as needed (Ciprian, Trond)
- not yet done
- not yet done
- identify parallel nob files - should be automatic, but needs to be checked
- sentence-align sme with nob (Ciprian)
- the aligner wasn't working, Børre has tried to fix it
- not yet fixed - bug not found
- not yet fixed - bug not found
- the aligner wasn't working, Børre has tried to fix it
- analyse/disambiguate (Ciprian)
- preferably also dep -> check with Oslo if it can be used (Ciprian)
- install in Oslo (Ciprian, Trond)
- add better handling of unknown strings in our analysers (???)
Corpus infra remake
Børre: Had a look at convert2xml.pl. As it now is expected to work on
As for access to converted corpus, there is a user apache_corpus on
TODO:
- convert or move the files in the upload dir to the real corpus repo
- almost finished, moved files to both bound/ and free/, will soon
- almost finished, moved files to both bound/ and free/, will soon
- turn on corpus summary crontab (Børre)
License
Børre and Sjur agreed on a first version of license header. Børre
TODO:
- install & configure the Unison news reader (Sjur)
- read and comment the license discussion (all)
- done
- done
- formulate license header (Børre, Sjur)
- done
- done
- add license header to all files (Børre, Ciprian, everybody)
- done
Corpus interface
This depends on the infrastructure cleanup.
TODO:
- make a simple web search form for the UiT corpus repository (Ciprian, X)
- check out the new version of CWB with Unicode support (Ciprian)
- fall
Makefile + tag simplification
Problems with the proofing tools compilation, now solved.
TODO:
- test latest proofing tools, compare results with previous version (Tomi)
- write new build commands (Sjur, Tomi)
- make new targets in parallell to the old ones, not by remaking them
- use a prefix or suffix to make the new targets easily identifyable during
- a first version commited, prefix is NEW-*
- a first version commited, prefix is NEW-*
- make new targets in parallell to the old ones, not by remaking them
- test the new build commands (Sjur, Trond, Lene, Thomas)
- kommando: make TARGET=sme NEW-fst etc.
- kommando: make TARGET=sme NEW-fst etc.
- when the new build infrastructure works as it should, delete the old ones
General list
Meänkieli adaptions in our infrastructure.
Requirements:
- separate Subversion repository
- structured roughly as our main repository
- limited access to the closed repo, in the sense that one language group
- all tools, dtd's, configs work as in and from the main Subversion, such that
- we should have scripts that move a full language dir from one repo to the
- the amount of work for setting this up should be minimal, or we have to ask
Tentative task list
- figure out the svn dir structure in the closed lang repo, set it up
- figure out how to make the infrastructure in the open/main repo available for
- when we have an idea of the amount of work, decide whether we just do it, or
- implement the previous point
- transfer the language files, preferably using a script for automatic transfer
- inform them about the changes
To accommodate future enhancements in different directions (in rough order of
- test bench for all parts of our language technology efforts
- test bench enhanced, but not yet complete
- test bench enhanced, but not yet complete
- improve Forrest i18n support with static sites
- reorganise the documentation:
- differ between target groups
- get better grouping
- decide what to write in Forrest and what in wiki
- update/add missing parts
- differ between target groups
- migrate lexc lexicons to XML, splitting the task
- Name lexica (the Name project)
- Dictionaries (already in XML, task is to integrate them)
- At least migrate the lexc open POSes (Komi as a pilot case)
- Name lexica (the Name project)
- change the look of the documentation web
- corpus content moved to Max Planck repositories? Norsk språkbank?
- update infrastructure to allow content-restricted spellers for special target
TODO:
- make restricted access to our svn repository work (fit adaption)
- did some more work during the weekend
- did some more work during the weekend
- make the new SL Server services functional: ( Børre)
- group calendars
- group calendars
- set up corpus mirroring on the XServe again (Børre)
- finish the restructuring and cleaning of the script/ directory
- infrastructure remake: ( Børre, Ciprian, Sjur, Tomi, Trond)
- more modularised make / build infra (prepare for smn, sms, sjd, others)
- look at omorfi for ideas of how to modularise
- look at omorfi for ideas of how to modularise
- merge gt, kt and st into one
- modularised preprocess and spellrelax
- alternatives to make:
- more modularised make / build infra (prepare for smn, sms, sjd, others)
- make a test-all target that runs all tests we have (Ciprian, Sjur, Trond)
- delayed until we have restructured the make/build process
- delayed until we have restructured the make/build process
- define and document testing routines (Ciprian, Sjur, Trond)
- delayed until we have restructured the make/build process
Linguistics
North Sámi
(nothing new, see proofing bugs below)
Lule Sámi
(nothing new, see proofing bugs below)
South Sámi
TODO:
- read through and comment by Wednesday afternoon (Sjur)
- adjectives (Maja with Thomas, Trond, Sjur)
- two competing naming conventions of continuation lexicons
- One naming goes ATTRSUFF-PREDSUFF-STEMTYPE
- One follows the sme convention of naming key adjectives
- There are duplicate lexica
- One naming goes ATTRSUFF-PREDSUFF-STEMTYPE
- The comparative issue open here and there
- The ATTRSUFF-PREDSUFF-STEMTYPE lexica now go to EVENCOMP and
- The ATTRSUFF-PREDSUFF-STEMTYPE lexica now go to EVENCOMP and
- two competing naming conventions of continuation lexicons
- finish compound tags for adjectives (Maja)
Name lexicon/risten.no infrastructure
TODO:
- find already approved lists, in paper or electronic form (term team)
- convert paper lists to electronic lists (term team)
- convert lists to standard XML (Sjur, Tomi)
- add prepared lists to risten.no (Sjur, Tomi)
- fix i18n bug in risten.no/G5 (so they will work without the proper locale
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- implement data synchronisation between risten.no and
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
Dictionaries
Ciprian has recomopiled fkv-nob-fkv, but without glossa... due to
StarDict is useless for Cyrillic languages, mainly because of the scanning
- we need a Russian Windows machine, bought in Russia, using the software they
- we also of course need a western/Norwegian Windows machine for testing
Released:
- FKV dictionary release: we need a web page (see task list below).
- SME: NOB update release
Other things dictionary-related:
- risten.no data as part of our dictionaries
- dictionaries as part of risten.no
- dictionaries and risten.no as part of Autshumato ITE
TODO:
- fkv: nob and nob: fkv is now scheduled for an April release:
- content and webpage update (Trond, Verena)
- re-release the MacDict and StarDict versions with bugfix and version info
- content and webpage update (Trond, Verena)
- kom: fin-eng
- moved the original kom-lex.xml to the inc-dir and froze it
- split it by pos into the working_file dir, the ONLY place to work with
- now, the lexC files are generated via XSLT sheets, no perl scripts
- adjusted the Makefile
- prepared the pipeline for compiling the mac dict
- todo: make a pipeline for StarDict also (as far as I know, Jaska has)
- moved the original kom-lex.xml to the inc-dir and froze it
- set up risten.no on eXist/XServe (as a beta version site) (Sjur)
- set up required infra for smenob on risten.no/XServe (Sjur)
- Continue the dictionary infrastructure discussion (Ciprian, Sjur, Trond)
- end user documentation (how to download and install) (Ciprian, Trond)
- Contact Davvi Girji about cooperation on electronic dictionaries
- developing the mobile phone version of smenob:
- Komi
- take out the doublets to a separate file (Ciprian)
- merge the doublets (Jaska, Trond)
- Completing the automaton to some state (Trond, Jaska, Paula)
- take out the doublets to a separate file (Ciprian)
- make the sort XSL script available for all languages to keep the source files
Proofing tools
Spelling feedback from Malta:
- use FST to model suggestions (Krister et al)
- Lene is evaluating the output of the Polderland speller
- first-letter errors rare, but the PLX speller still changes it quite often
- Oahpa and Divvun approaching each other
- first-letter errors rare, but the PLX speller still changes it quite often
- using Wikipedia as a source for spelling errors and test material (a French
South Sámi
Beta release: June 15. We should be getting the Polderland (now Knowledge
TODO:
- test the sma speller on the gold standard document (Sjur)
- difftest for fst and PL speller (Sjur)
- External beta testers:
- David
- Jovsset
- the Røros group
- David
- gold standard testing (Sjur)
HFST- and Voikko-based proofing tools
Sjur met with the HFST people yesterday. Two things happening in parallell:
- hfst3 - being made ready for public release, with proper inclusion of
- speller/lookup library:
- speller/lookup library
- voikko integration of this library
- speller/lookup library
TODO:
- Change the license tag to GPL for voikko inclusion. (see above)
- high priority
- high priority
- run tests using Hunspell, Voikko, Polderland for our Sámi lexicons
- check out the voikko code, see this page
- try to compile it (development is done on Linux, no MacOS X testing so far)
- add voikko support to our proofing test bench (Sjur)
- add all our Sámi analysers and test them as spellers (Tomi, Sjur)
Testing
Testing open-source Norwegian spellers
Sjur has invited the open-source group to test their spell-checker using
Speller bugs
List of bugs returned from Polderland:
- 621
- 630
- 652
- 656
- 676
Tag reordering for abbreviations have caused a lot of problems:
smj: hr. hr. hr+ABBR+Acc cand.philol. cand.philol. cand.philol+ABBR+N+Acc Per Per Per+N+Prop+Mal+Sg+Attr sme: hr. hr. hr+N+ABBR+Acc Per Per Per+N+Prop+Mal+Sg+Attr
Open issues based on test results:
sme
- 399 - missing numerals (plural forms) - still OPEN
- 425 - X not recognised; single letters were left out - still OPEN
- 435 - roman numbers - inflection of single letter numbers
- we should pregenerate all numbers once and for all, and store them in a
- we should pregenerate all numbers once and for all, and store them in a
- 461 - REGRESSION: missing suggestion (sáhkki)
- 508 - REGRESSION: accepts smj entries (most likely abbreviation missing)
- 520 - REGRESSION: r9 and š9 not defined (abbr. missing)
- 595 - prefix+name without hyphen (ovdaLot instead of ovda-Lot) -
- 603 - suomabealdi accepted - still OPEN
- 606 - compound-tags LEXICON VUOHTA - still OPEN
- 613 - short gen. as second compound part - still OPEN
- 619 - numerals and pronouns to NAMÁK and SASJ fails - vihttasoarttat
- 629 - a taking part in compounding without hyphen - still OPEN
- only open case has word A-finálaid compounded
- only open case has word A-finálaid compounded
- 647 - numerals+NOUN - still OPEN, open case has uppercase letters
- 648 - unmotivated suggestions with numeral+noun - still OPEN
- 661 - REGRESSION: abbr. not recognized
- 709 - sámedikkeválga accepted - OPEN
- 728 - vowel shortening GenCmp+Left-tagged - still OPEN
- 779 - caseforms of pronoun okatahat - still OPEN
- 785 - does not recognize alphabet-abbr+noun - OPEN
- 802 - NEW: multiword propernouns
- 803 - NEW: FINJU- words accepted single-handed
- 804 - NEW: guovttilogát, njealjilogát
- 805 - NEW: Nouns+acronyms
smj
- 435 - roman number - single letter numbers now recognised
- we should pre-generate all numbers once and for all, and store them in
- please note that inflection of single letter numerals is fine
- we should pre-generate all numbers once and for all, and store them in
- 482 - polardutkamin not recognized - FIXED
- 496 - REGRESSION: unrecognised clitics
- 556 - non-existent word accepted - FIXED
- 594 - lågenanguoktáj not recognized - still OPEN
- 595 - REGRESSION: prefix+name as split comp without hyphen
- 596 - C-giellan is not accepted - still OPEN
- 600 - Gen+hyph compound - FIXED
- 627 - prefix + hyhpen does not get accepted - FIXED
- 647 - numerals+NOUN - still OPEN, open case has uppercase letters
- 648 - unmotivated suggestions with numeral+noun - still OPEN
- 650 - REGRESSION: noun prefix+name compound without hyphen
- 652 - UPPERCASE-typos only get acronym-suggestions - still OPEN
- 692 - numeral-variants - all but one fixed (gáktsalågenantjuotakta), but
- 744 - REGRESSION: numerals + clitic
- 803 - NEW: VINJU- words accepted single-handed
- 805 - NEW: Nouns+acronyms
TODO:
- document how compounding is controlled in the PLX conversion (Tomi)
Hyphenator bugs
Open issues based on test results :
sme
No known issues!
smj
- 670 - Hard hyphen replaced with soft hyphen: 10-biejvvásattja (the word
sma
Command to test the hyphenator:
preprocess dev/corp/pressemelding.txt | lookup bin/hyph-sma.fst | cut -f2 | \ lookup bin/hyphrules-sma.fst | grep -v '^$' | cut -f2 | uniq | see
TODO:
- fix PL hyphenator errors (Tomi)
- almost done - one smj bug left
Installer changes
TODO:
- test InDesign installer (Sjur)
User documentation
TODO:
- InDesign documentation (Sjur)
- Norwegian translation received from Davvi Girji
1.2 release
Content:
- several smj bug fixes
- lexicalisations
- InDesign Mac & Win
- new OOo beta
- improved installers, at least for Mac, preferably also for Windows
Other
{{sma}}seminar in August/September
Theme:
- editing dictionaries
- beta tools for sma
- orthographic questions
Who can? | yes/no |
---|---|
Børre | yes |
Ciprian | yes, in September |
Maja | yes |
Sjur | yes |
Thomas | yes |
Tomi | yes |
Trond | yes, 30.8-3.9 (v 35), 13.9.-> |
Should we do this? YES.
- Place: Trondheim
- Date: as suggested above, primary choice week 35.
TODO:
- make preliminary program (Sjur, Maja, Ciprian, Trond)
- find rooms (univ.) (Inger Johansen, Maja)
- write inv. letter (Maja)
- send letter (Maja)
CLARIN meeting
Meeting on Thursday in Oslo, about texts and voice corpora. Trond and Lene will
Thursday inhouse seminar
Next time suggestion list:
- introduction to xslt - Ciprian to start out
- relevant xslt issues:
- basic principles of xslt …
- sorting in xslt … (have a look at the dictionary sort xslt script)
- converting from one xml format to another wilt xslt (sugg: convert from
- relevant xslt issues:
Future seminars:
- XQuery
- More XML (needs concretisation)
- UML
- other suggestions?
Summer planning
Topics:
- speller test project -> Sjur, Børre, Thomas, X
- speech synthesis -> Sjur, Trond, BA as a starter
- risten.no -> Ciprian, Sjur, Tomi
- Barents project follow-up meeting -> Trond, Sjur
- sme and smj proofing tools, next version -> Thomas, Tomi, Sjur
- HFST-based proofing tools -> Sjur
- MT terminology project -> Trond, Linda, Fran, Kevin
- HSL centre status -> Trond
Dates:
- June 17th: Sjur in Karasjok, presenting Autshumato, discussing risten.no and
- July: ACL Uppsala
- August: IceTAL, Reykjavik
- September 8: Konvens-workshop Saarbrücken: LT and text-technological methods
- Lene will try to get an abstract accepted, deadline 16.5
- Trond?
- Lene will try to get an abstract accepted, deadline 16.5
Text to speech
There will be regular meetings in this project from now on, every second week.
TODO:
- put together the preprocessing transducers and scripts (Tomi)
- refine syntax / dependency rules (Biret Ánne)
- continue public tender process (Sjur)
CAT
TODO:
- make our dictionaries work in A-ITE (Ciprian)
- OmegaT has basic support for Stardict dictionaries, but it seems it only
- OmegaT has basic support for Stardict dictionaries, but it seems it only
Summer vacations
Name | Dates |
---|---|
Børre | 28/6-11/7, 2/8-15/8 |
Ciprian | Dates |
Maja | Dates |
Sjur | Dates |
Thom | 21/6-23/7 |
Tomi | 5 weeks between 21/6-13/8 |
Trond | Dates |
Next meeting, closing
The next meeting is 21.6.2010, 09: 30 Norwegian time.
The meeting was closed at 12: 18.
Appendix - task lists for the next week
Boerre
- get all the latest Sámi Parliament protocolls into our repository
- contact the Sámi translators at the government and ministeries to get more
- corpus infra:
- make the conversion scripts work with the new svn repository
- add check for almost empty content
- add processing of svg files
- add check for almost empty content
- make the conversion scripts work with the new svn repository
- make restricted access to our svn repository work (fit adaption)
- corpus access on the XServe
- turn on corpus summary crontab
- contact Ávvir about renewed corpus cooperation
- implement language switch for static divvun site
- improve XSL script to transform leaflet Forrest XDocs to an OOo Draw document
- get translations of thank-you letter
- make the new SL Server services functional:
- group calendars
- group calendars
- set up corpus mirroring on the XServe again
- give Maja a WEbSak intro
- fix bugs!
Ciprian
-
sma seminar in August/September:
- make preliminary program
- make preliminary program
- Skolt-Oahpa:
- fine tuning of semantic tags
- finish localization
- fine tuning of semantic tags
- terminology:
- merged the 2 doc files with the xls file containing the 2006 law terminology
- prepare a StarDict test version of the merged data for demo
- compile a dictionary in StarDict format with all data featuring the
- merged the 2 doc files with the xls file containing the 2006 law terminology
- add license header to all files
- make restricted access to our svn repository work (fit adaption)
- PhD-Projektbeskrivelse (internal deadline end of June 2010):
- search, read, write (execute): high priority!
- search, read, write (execute): high priority!
- read and answer the newsgroups messages
- update corpora (both monolingual and parallel) for Oslo (Glossa)
- check the correctness of parallelity between sme and nob files
- sentence-align sme with nob: fix tca2 problem
- analyse/disambiguate: todo -- waiting for the last version of FSTs
- check the correctness of parallelity between sme and nob files
- corpus infra:
- make the conversion scripts work with the new svn repository
- add check for almost empty content
- add processing of svg files
- add check for almost empty content
- reorganise subdirs as needed
- make the conversion scripts work with the new svn repository
- infrastructure
- test cwb
- continue with restructuring and cleaning the script catalogue as suggested
- transform sme-lexC files into XML format
- make a schema/dtd description of the lexC-file (experiment with
- test cwb
- GT web:
- add coloring for pos analysis
- add a tree visualizer for the dependency trees
- add input help for special characters on the tool sites
- automatise the web statistics
- filter (English, German, etc.) input using language detection tools
- put a note on the sites that these are NOT MT tools
- input help for generating wordforms (dropdown menus).
- add coloring for pos analysis
- Sandbox Oahpa:
- integrate reCAPTCHA into Django (as Lene suggested)
- debug the installed sb_oahpa
- update and correct the Oapha docu site
- integrate reCAPTCHA into Django (as Lene suggested)
- Running Oahpa:
- fix leksa_n (proper nouns) related bug after adding Finnish to Leksa
- implement a testbench for Vasta (as Lene needs)
- integrate reCAPTCHA after the SB-test
- add an Oahpa clock and date excercise (cf. Numra)
- email notification when the server goes down
- check the XXX?
- fix leksa_n (proper nouns) related bug after adding Finnish to Leksa
- Sjd/Kom/Sms/Etc-Oahpa
- try to implement and embed virtual keyboards for each specific Oahpa
- try to learn something about compiling keyboard in general
- try to implement and embed virtual keyboards for each specific Oahpa
- dictionaries, generally:
- synchronize the source language entries from a specific dictionary with the
- the StarDict on Windows: try the HTML-plugin (that means that users can use
- try to reduce the dict-size on mac: experiment with xPointer, etc.
- synchronize the source language entries from a specific dictionary with the
- Fkv: Nob - Nob: Fkv:
- incorporate novel glossary into the dict
- try to implement a web version of the dictionaries in HTML
- incorporate novel glossary into the dict
- KomEngFin:
- test the automatic sorting by Komi alphabet in xsl (as discussed
- test the automatic sorting by Komi alphabet in xsl (as discussed
- SmeNob:
- incorporate the passives into the last version of the sme: nob
- start a new compilation of SmeNob and improve it based on the experience
- incorporate the passives into the last version of the sme: nob
- SmaNobSwe:
- extend the smanobswe dictionary: waiting for data (incorporate the data from
- extend the smanobswe dictionary: waiting for data (incorporate the data from
- SjdRus:
- continue the work at the Kildin-Russian dictionary, next internal deadline
- continue the work at the Kildin-Russian dictionary, next internal deadline
- Lexicon workshop
- contact Kimberly Mäkäräinen and ask whether she might be willing to share
- contact Kimberly Mäkäräinen and ask whether she might be willing to share
- MT
- embed gt_dicts in A_ITE (deadline 15. June)
- test A-ITE on Windows
- embed gt_dicts in A_ITE (deadline 15. June)
- Permanent education
- prepare/update XLS course materials
- learn UML
- prepare/update XLS course materials
Maja
-
sma seminar in August/September:
- make preliminary program
- find rooms (univ.)
- write inv. letter
- send letter
- make preliminary program
- Prepare text´s about normativity issue to SGL/SGM
- more work on sma adjectives
- look at incoming loanwords
- continue gathering sma corpus texts
- finish compound tags for adjectives
- fix bugs!
Sjur
-
sma seminar in August/September:
- make preliminary program
- make preliminary program
- Course plan for meeting/seminar, october. Financing must be in place.
- test the sma speller on the gold standard document
- add voikko support to our proofing test bench
- add all our Sámi analysers and test them as spellers
- run tests using Hunspell, Voikko, Polderland for our Sámi lexicons
- install & configure the Unison news reader
- difftest for fst and PL speller
- Northern areas
- plan a meeting/seminar in Tromsø
- make a plan for the first two years, update overall plan
- plan a meeting/seminar in Tromsø
- @TTS: continue public tender process
- make Leif Åge send out CD's to distribution points
- contintue Nordplus Sprog project
- Write a formal letter to Davvi girji about electronic dictionaries
- make XSL script to transform leaflet Forrest XDocs to an OOo Drawer document
- name db/risten.no
- follow-up on some Polderland-related bugs: 621, 630, 652
- find and contact the correct person in SD, to get the manuscript for all Sámi
- write new build commands for make
- when the new build infrastructure works as it should, delete the old ones
- read through and comment by Wednesday afternoon
- test the new build commands
- fix bugs!
Thomas
- test the new build commands
- prepare text´s about normativity issue to SGL/SGM
- Digitalize south saami books
- fix bugs!
Tomi
- run tests using Hunspell, Voikko, Polderland for our Sámi lexicons
- put together the TTS preprocessing transducers and scripts
- write new build commands
- try to compile voikko
- add all our Sámi analysers and test them as spellers
- when the new build infrastructure works as it should, delete the old ones
- document how compounding is controlled in the PLX conversion
- fix double hyphen bugs
- fix PL smj hyphenator bug
- fix PL conversion bugs
- fix bugs!
Trond
-
sma seminar in August/September:
- make preliminary program
- make preliminary program
- Report from journey to FM/UD
- Course plan for meeting/seminar, october. Financing must be in place.
- test the new build commands
- corpus infra:
- reorganise subdirs as needed
- reorganise subdirs as needed
- Northern areas
- plan a meeting/seminar in Tromsø
- make a plan for the first two years, update overall plan
- plan a meeting/seminar in Tromsø
- MT/Terminology
- install updated corpus files in Oslo
- fix bugs!.