Speech Corpus Assembly
This document is an overview of the work of assembling and editing texts for reading, ie the texts used in recording the voices.
Assembling different types of texts
Considering that we want our end product to be able to read "everything", the texts must range from formal language to colloqial language. The different styles show different preferences for long words, possessive suffixes and particles, which in turn has different implications for prosody. We need a good mixture of these styles.
- Formal language: translated, 'bureaucratic' texts. These usually have much longer sentences than texts that are written originally in Sámi. A whole paragraph can be one sentence. Other characteristics: many subordinate clauses, passive sentences, mostly 3 person singular and 1 person plural verb forms (short suffixes), participle constructions preferred to relative clauses (sentences are initially heavy), possessive suffixes are common, particles are uncommon. Bureacratic, political vocabulary with long words. Compounds in which the first element is trisyllabic are common. Many abbreviations and parentheses.
- Semi-formal language: Original language is Sámi, scholar/text book style. Slightly shorter sentences, relative clauses preferred to participial constructions, passive sentences, more verb forms such as dual verbs (disyllabic suffixes). Possessive suffixes are common, particles occur. Everyday vocabulary with elements of technical and traditional terminology, mixture of long and short words. Compounds in which the first element is trisyllabic are common. Many numbers, such as dates, years and amounts. Numbers also come in inflected forms, with colons. Many abbreviations and parentheses. Listings, like nouns separated by commas.
- Neutral language: Original language is Sámi, children's text book style: Short sentences, core word order, use of particles, possessive suffixes occur but are not common, everyday vocabulary, mostly short words. Numbers occur, mostly amounts and some years. Many listings, such as nouns separated by commas.
- Semi-colloqial language: Original language is Sámi, conversation style. Long sentences, with relative clauses or non-finite small clauses. Non-finite small clauses with gerunds and agentive preferred to relative clauses. Subjectless sentences. Sentences prefer to have more material towards the end. Extensive use of particles, possessive suffixes occur. Traditional vocabulary with many derived verb forms, mixture of long and short words. Numbers occur, mostly years. Uncommon consonants and consonant centres most likely to occur in these texts.
- List of single words, letters, numbers and dates.
- Fairytales: Fairytales are probably not good for text to speech purposes, because of the exaggerated prosody. However, there are some consonants and consonant clusters/groups that are rather uncommon in the other types of text, such as the voiceless sonorants hl, hr, hj, hn. These seem to be more frequent in traditional texts, such as fairytales. I have divided the fairytales between neutral and semi-colloqial language. The voice talents must be instructed to read the fairytales in an ordinary conversational style, and not stay true to fairytale style.
Editing texts
The texts that are chosen are not totally authentic. They have been altered to accommodate reading fluency. Some texts have not been proofread properly before publishing:
- Most parentheses with English or Norwegian words have been removed, as are parentheses with year and page numbers.
- Norwegian sentences are translated to North Sámi.
- Word order in long sentences has been changed to improve intelligibility.
- Correction of non-fluent sentences. (In long sentences grammar is often messed up, for example that a list of nouns that should be in the illative are listed partly in illative and partly in the nominative, or lack of subject-verb agreement when verb does not immediately follow subject, wrong case in relative clauses, active and passive moods mixed up in the same sentence with the result that the required subject is left out etc.)
- Illicit sentences have been improved, for example all sentences have a finite verb. (List continuations are not kept as separate sentences, although that is how they are in the original text. Elliptic 'passages' are sometimes filled out. Semi-colloqial texts have extensive use of ellipsis, which impedes intelligibility.)
- Sometimes comma is replaced by full stop, because the two sentences that are connected this way will be read as two completely separate sentences anyway.
- Some non-guovdageaidnu words/suffixes have been replaced with guovdageaidnu-words/suffixes for ease of recognition, for example 'goappaš' instead of 'guktuid'.
- Some Guovdageaidnu words are replaced with other words to ensure that all combinations are attested in the texts. See below.
- Extra passages are added here and there in the middle of authentic texts. Texts have been created for this reading, in order to ensure all consonant centres, onsets etc. Words with rare combinations must be spread throughout the texts, so that they don't all appear in the same passage, making that passage particularly difficult to read.