Metadata

Contents:

Actors
Sessions
Texts
- Actors
- Date
- Language(s)
- Modality
- Language
- Genre
- Register
- Medium
Other conventions
See also

This page documents metadata categories and subcategories as well as labels we use for these metadata in the Freiburg-Tromsø Speech Corpora.

Project-internally we collect different kinds of metadata. Not all of them can be made public due to ethical and legal reasons. Here we document only metadata categories relevant for the corpora published through Korp. Main metadata categories describe:

Actors (e.g. a recorded speaker, author, translator or annotator)
Sessions (e.g. an annotated recording or an annotated written text)
Texts (e.g. modality or genre)

All publicely available metadata is stored in files separated from the ELAN annotations in IMDI format on the Session node in the TLA. A script (which does not yet exist) converts IMDI into a structure useful to be read into the Korp interface.

Actors

Speakers (e.g. informants/consultants recorded and transcribed or authors/translators of written text included in the corpora)
Annotators (e.g. PIs or assistants transcribing, translating or otherwise annotating recordings or written text included in the corpora)

Sessions

Actors
Date
Equipment
Media
Place
Project
Languages

Texts

Actors

Date

Language(s)

Modality

As a label for this category we use _Modality_ and mean here the way by which signs are transmitted by a sender. This catory has two values:

oral (e.g. speech which we have recorded on audio or audio+video and transcribed or speech which is transcribed, but where there is no audio available because it is lost or the speech was transcribed without being recorded)
written (e.g. handwritten or printed texts, texts published online)

Another potential values (not relevant for our projects) are:

gestured
signed

Note that the kind of perception by a receiver is not relevant for our metadata categories (a written text can be received oraly if we use text-to-speech, etc.) Neither does _Modality_ in our sense refer to the actual medium (paper, video, etc.)

Language

The-letter code in accordance with ISO 639-3

Genre

poetry
fiction
ritual
advertisement
biography
fairy tale
facta
idiom
narrative
teaching
story

Register

formal
informal
neutral

Medium

Other conventions

Note that also file names used by us inlcude some metadata already. For instance:

sms19610000lagercrantz318
sjd20150609aaa-sport where the first three letters sms or sjd - in accordance with ISO 639-3 - always mark the language (or main language) of a given session, the following eight digits 19610000 or 20150609 always mark the date of a given session in the format YYYYMMDD. If the exact date is unknown or cannot be specified (e.g. in a book publication were only the year is given) we use the digit 0.

Corpus

Overview and important links

Corpus collection/maintenance

Sentence alignment

Meetings

Korp

Ordbilde

Spoken corpora

LIA

ELAN

Metadata

Actors

Sessions

Texts

Actors

Date

Language(s)

Modality

Language

Genre

Register

Medium

Other conventions

See also