This page documents metadata categories and subcategories as well as labels we use for these metadata in the Freiburg-Tromsø Speech Corpora.
Project-internally we collect different kinds of metadata. Not all of them can be made public due to ethical and legal reasons. Here we document only metadata categories relevant for the corpora published through Korp. Main metadata categories describe:
- Actors (e.g. a recorded speaker, author, translator or annotator)
- Sessions (e.g. an annotated recording or an annotated written text)
- Texts (e.g. modality or genre)
All publicely available metadata is stored in files separated from the ELAN annotations in IMDI format on the Session node in the TLA. A script (which does not yet exist) converts IMDI into a structure useful to be read into the Korp interface.
- Speakers (e.g. informants/consultants recorded and transcribed or authors/translators of written text included in the corpora)
- Annotators (e.g. PIs or assistants transcribing, translating or otherwise annotating recordings or written text included in the corpora)
- oral (e.g. speech which we have recorded on audio or audio+video and transcribed or speech which is transcribed, but where there is no audio available because it is lost or the speech was transcribed without being recorded)
- written (e.g. handwritten or printed texts, texts published online)
Another potential values (not relevant for our projects) are:
Note that the kind of perception by a receiver is not relevant for our metadata categories (a written text can be received oraly if we use text-to-speech, etc.) Neither does _Modality_ in our sense refer to the actual medium (paper, video, etc.)
The-letter code in accordance with ISO 639-3
- fairy tale
Note that also file names used by us inlcude some metadata already. For instance:
XXX - ???