meeting-2009-10-28
Formalities
Participants:
- Martti
- Antti
- Biret Ánne
- Trond
- Sjur
Tentative agenda:
- software / tools we need to build and test our tts
- how much time you can spend on this, and the payment for the time
- overall project schedule, and how it fits with your plans
Project overview
Financed by unused money from the first Divvun project. Financed by the Sámi Parliament. The work is done in cooperation with UiT and HU. Targeted release date is sometime next year.
Recording
Voices: Do both a male and female voice - at least the recordings. Using the same text will simplify some of the perparations / postrprocessing, but not that much
The text should be something between rich and balanced. 2 hours of recordings, rather more than less. 120 - 150 000 letters is a rough esitmate, at least for Finnish. This would approximate perhaps 2000 sentences?
Whole paragraphs with intersentence dependencies. Even though it isn't possible to model that yet, it will come in the future, and it is good to be preprared.
The text needs to as much as possible reflect the text types we are expecting in final use. Things to consider:
- from single words and numbers to whole paragraphs
- also single letters
- dates
- with statistical synthesis you need several/many examples of each type
The microphone should be a high quality, phase linear condenser microphone. Sampling frequency = basic CD quality is good enough. At least 16 bit resolution. As noise free as possible.
Software / tools needed
easier to generate
start to plan a larger project with more independence for the UiT group, including a new software engineer position.
Output products
Three areas come up as most relevant:
- The society will be interested in web solutions of the kind known for nor, eng:
- Disabled people will need the usual stuff (screen readers etc on your local computer)
- Pedagogical programs and dictionaries would like integrated tts.
There are other future usages that is hard to envisage now, the synthetic voice has to be as general as possible, to be able to cope with as many different text types and usage scenarios as possible.
Tools for testing ling input
Tools for testing prosody prediction
- Data tagged for prosody
- A sharp ear
Testing procedure:
Until now:
- Ling analysis
- Feed to B-Á
- Look at output
- Rewrite ling analysis
- Etc.
Soon:
- Ling analysis
- Feed to syntesizer
- Listen to output <==== (use the prototype for this? No)
- Rewrite ling analysis
- Etc.
Work capacity at the HU
synthesizer made in Hki : -)
Now: Antti in a (too?) key position. Knowledge transfer is thus high on the priority list. It is not about money as long as we do not extend to a new person.
Intermediate solution:
A person working both for Hki and for Tromsø. This person would be
Work plan
UiT - speech corpus:
- Collect text
- make recordings
- cut recordings into sentenes (like for the prototype)
- transcribe recordings
UiT - Building the textual input for training the synthesizer:
- letter to sound (automatic)
- annotation of:
- phrase breaks (3-4 levels, manual work)
- word prominences (3-4 levels, manual work)
- phrase breaks (3-4 levels, manual work)
ortography Mon lean okta sápmelaš, guhte lean bargan visot sámi bargguid -> -> lts rules (automatic) mun leæn okː.htɑ sɑːp.me.lɑʃ , kuh.te leæn pɑrə.kɑn vi.soh sɑː.miː pɑrk.kujht -> manual labeling of prominence and breaks (for training the model) "mun leæn 'okː.htɑ "sɑːp.me.lɑʃ , || kuh.te leæn 'pɑrə.kɑn "vi.soh 'sɑː.miː 'pɑrk.kujht
UiT: enhance the lts system to automatically add full prosody labeling as input to the synthesis. Match as close as possible the trained data, but a perfect match is not required to get a good synthesis.
Needed: A theory of Sámi prosody
- a. a perfect theory of Sámi prosody (from ToBi and onwards)
- b. a theory of tts-relevant Sámi prosody
b. 3-level word prominence, 3-4 levels of phrase breaks
The Hki functional approach:
Prominence:
Antti has a proof of concept ruleset for Finnish
- + of linguistic interest
- - difficult to implement
Machine learning:
- + easy to implement
- - needs larger corpus than 1 speaker training material
Build a spec based upon what has been done for Finnish
- Look at actual Finnish input
- Get a copy of the Finnish orthography-to-syntinput machine
Todolist
- work on the recordings:
- building a suitable text corpus
- prepare recordings (contact voice owners, studio, etc)
- record, split, annotate
- building a suitable text corpus
- Antti wants ot be informed about the annotation and the other work.
- Tromsø wants to get the proof of concept ruleset and some Finnish annotated
- Speech sample forthcoming from Tromsø
Then work proceeds as agreed.