Kontatto is a corpus of transcribed and annotated spoken data collected by Silvia Dal Negro at the Free University of Bozen/Bolzano. It consists of almost 150,000 orthographic words divided into 55 recordings involving 97 different speakers for a total of 18 hours of speech. The corpus is multilingual and contains a variety of spontaneously occurring code-mixing patterns. However, language distribution is not even: 80.4% of the corpus is made of Tyrolean words, 11.5% of Italian, 2.6% of the words were classified as Trentino, another 0.8% involved other languages (e.g. Ladin, English, etc.) and, finally, 4.7% of the words are not confidently attributable to any language in particular (e.g. proper names, widespread loanwords, some interjections, etc.).
This repository contains the Kontatto-MT corpus subset. The data was collected using a collaborative Map Task, during which two speakers and an interviewer interacted to navigate a physical map in order to reach a given destination. This subcorpus documents a variety of languages and dialects in the dolomite region, including (some) Tyrolean and Trentino dialects, Italian, Cimbrian, Ladin, usually combined in the same dialogue. At present it consists of 35,453 tokens, 73% classified as local German dialect.
Kontatto was created within the scope of two projects financed by the Autonomous Province of Bozen-Bolzano between 2011-2014, “Italiano-tedesco: aree storiche di contatto in Sudtirolo e Trentino”, and 2016-2019, “Germanico-Romanzo: discorsi e strutture in contatto nell’area dolomitica”. Over the years, many research assistants and students have contributed to the annotation of the data: Katrin Tartarotti, Mara Leonardi, Marta Ghilardi, Nicole Giaier, Adriana Rasa, Lucia Rossaro, Luigi Parisi and Jay Hevelone. The CLARIN deposit was prepared by Greta Franzini and Luca Ducceschi of Eurac Research.
This deposit consists of: - audio recordings (.flac
) -
textual annotation files (.eaf
) with 5 tiers of annotation
each: - main transcription line (ELAN Type: default-it) - tokenization
(ELAN Type: Word) - part-of-speech (ELAN Type: POS) - language (ELAN
Type: Language) - lemmatization in Standard German or Standard Italian
(ELAN Type: Lemma)
- metadata and tagset files (.csv
) - map files
(.pdf
)
├── flac │ ├── Kontatto_MT_IT_01.flac │ ├── Kontatto_MT_IT_02.flac │ ├── Kontatto_MT_TR_01.flac │ ├── Kontatto_MT_TR_02.flac │ ├── Kontatto_MT_TR_03.flac │ ├── Kontatto_MT_TYR_01.flac │ ├── Kontatto_MT_TYR_02.flac │ ├── Kontatto_MT_TYR_03.flac │ ├── Kontatto_MT_TYR_04.flac │ ├── Kontatto_MT_TYR_05.flac │ ├── Kontatto_MT_TYR_06.flac │ ├── Kontatto_MT_TYR_07.flac │ ├── Kontatto_MT_TYR_08.flac │ ├── Kontatto_MT_TYR_09.flac │ ├── Kontatto_MT_TYR_10.flac │ ├── Kontatto_MT_TYR_11.flac │ ├── Kontatto_MT_TYR_12.flac │ ├── Kontatto_MT_TYR_13.flac │ ├── Kontatto_MT_TYR_14.flac │ └── Kontatto_MT_TYR_15.flac ├── eaf │ ├── Kontatto_MT_IT_01.eaf │ ├── Kontatto_MT_IT_02.eaf │ ├── Kontatto_MT_TR_01.eaf │ ├── Kontatto_MT_TR_02.eaf │ ├── Kontatto_MT_TR_03.eaf │ ├── Kontatto_MT_TYR_01.eaf │ ├── Kontatto_MT_TYR_02.eaf │ ├── Kontatto_MT_TYR_03.eaf │ ├── Kontatto_MT_TYR_04.eaf │ ├── Kontatto_MT_TYR_05.eaf │ ├── Kontatto_MT_TYR_06.eaf │ ├── Kontatto_MT_TYR_07.eaf │ ├── Kontatto_MT_TYR_08.eaf │ ├── Kontatto_MT_TYR_09.eaf │ ├── Kontatto_MT_TYR_10.eaf │ ├── Kontatto_MT_TYR_11.eaf │ ├── Kontatto_MT_TYR_12.eaf │ ├── Kontatto_MT_TYR_13.eaf │ ├── Kontatto_MT_TYR_14.eaf │ └── Kontatto_MT_TYR_15.eaf ├── Kontatto_MT_metadata.csv ├── Kontatto_MT_tagsets.csv ├── Map_Task_1A.pdf ├── Map_Task_1B.pdf └── Map_Task_2.pdf