VinKo (Varieties in Contact) Corpus

Description

VINKO is a spoken corpus based on crowd-sourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto and Friuli-Venezia Giulia, with particular focus on the so-called ‘language contact’ between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano) and Romance (Ladin, Trentino and Veneto dialects). The data collection took place from June 2017 to May 2021.

Part of the recordings can be visualized on the open-access section “Listen and Explore” of the project web page, which can be used by participants and anyone interested in exploring the collected data.

URL: https://www.vinko.it/index.php Contact: vinko@ateneo.univr.it

Authors

Stefan Rabanus (University of Verona)
Alessandra Tomaselli (University of Verona)
Andrea Padovan (University of Verona)
Anne Kruijt (University of Verona)
Birgit Alber (Free University of Bozen-Bolzano)
Patrizia Cordin (University of Trento)
Roberto Zamparelli (University of Trento)
Barbara Maria Vogt (University of L’Aquila)

Readme structure

General
Abbreviations
Data Structure
Additional Information
Error Reporting

1. General

Creator Anne Kruijt
Date issued 2021-07
Last Updated 2021-07
Size 37806 audio files; 37878 tokens
License Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Italy (CC BY-NC-SA 3.0 IT) https://creativecommons.org/licenses/by-nc-sa/3.0/it/
Acknowledgments

PROJECT OF EXCELLENCE DiLLS: Digital Humanities applied to Foreign Languages and Literatures 2018-2022, University of Verona Project code: CUP n. B31I18000250006 Funding organization: Ministero dell’Università e della Ricerca;
Project PRIN 2015 “TREiL” (Technologies for Research and Education in Linguistics) 2017-2020, University of Trento (CIMeC/DIPSCO) Project code: Prot. 2015MNX5ZE - CUP n. E72F16001620001 Funding organization: Ministero dell’Università e della Ricerca;
Project AThEME (Advancing the European Multilingual Experience) Project grant: n. 613465 Funding organization: European Seventh Framework Programme for Research, Technological Development and Demonstration (EU)

Reference: Please cite the following Open Access paper if you use this dataset http://www.kit.gwi.uni-muenchen.de/?p=13739&v=2 *Authors: Cordin, Patrizia; Rabanus, Stefan; Alber, Birgit; Mattei, Antonio; Casalicchio, Jan; Tomaselli, Alessandra; Bidese, Ermenegildo; Padovan, Andrea *Title: VinKo, Versione 2 *Date of online publication: 2018 *Journal: Korpus im Text, Serie A, 13739

2. Abbreviations

Languages

“lldan” = Ampezzano (Ladin)
“lldba” = Badiot (Ladin)
“cim” = Cimbrian
“lldfa” = Fassan (Ladin)
“lldfo” = Fodom (Ladin)
“lldgh” = Gardenese (Ladin)
“mhn” = Mòcheno
“zah” = Saurano
“tre” = Trentino
“tir” = Tyrolean
“vec” = Veneto

Phonological phenomena

“obstr” = obstruent consonants
“sch” = /s/ retraction
“r” = realization of /r/

3. Data Structure

File structure under each language variety is identical and organized as follows:

VINKO
│   README.md
│ 
└─── Audio folders
│      └───cim
│       │   | S0027_cim_U0056.flac
│       │   | S0034_cim_U0056.flac
│       │   | T0101_cim_U0056.flac
│       │   | T0101_cim_U0140.flac
│       │   | W0099_cim_U0056.flac
│       │   | W0102_cim_U0140.flac
│       |   | ...
│       |
│       └───lldan (equivalent to "cim")
│       └───lldba (equivalent to "cim")
│       └───lldfa (equivalent to "cim")
│       └───lldfo (equivalent to "cim")
│       └───lldgh (equivalent to "cim")
│       └───mhn (equivalent to "cim")
│       └───zah (equivalent to "cim")
│       └───tre (equivalent to "cim")
│       └───tir (equivalent to "cim")
│       └───vec (equivalent to "cim")
│
└─── Metadata
│     └─── Users.ods
│     └─── Sentences.ods 
│     └─── Tales.ods
│     └─── Words.ods
│
└─── Images 
    │ 001_image.png
    | 002_image_IT.png
    │ 002_image_DE.png
    | ...

As can be seen, the VinKo Corpus consists of three main folders: the audio folder, containing raw audio recordings collected from speakers; the metadata folder, containing tables with relevant linguistic information as well as sociolinguistic information about speakers; the images folder, containing the pictures employed as visual context for the morphology section. In addition, there is this readme file with the main information about the corpus and the VinKo project.

The audio files collected in this repository come from two versions of the VinKo web platform which have been implemented over the years: version 1 covers the two-year-span 2017/2018, whereas version 2 covers from 2019 to 2021.

Audio folders

There are audio recordings in six out of the seven language varieties (to date there are no recordings of Saurano yet) for three different levels of linguistic analysis: phonology, morphology, and syntax. The investigation of each linguistic domain involves a different type of linguistic stimulus: single words for phonology; short stories with a visual context for morphology; sentences for syntax. All linguistic domains are available in the different language varieties and they are comparable across these language varieties.

The audio file name always mentions the stimulus ID, the language variety and the user ID (e.g., S0027_cim_U0056). The first letter of the stimulus ID indicates the linguistic domain under investigation:

*“S” = Syntax (sentences) *“T” = Morphology (tales) *“W” = Phonology (words)

Some speakers recorded more than one audio file for the same stimulus. These files are reported in the corpus as follows: S001_cim_U0056a, S001_cim_U0056b.

Phonology

This section investigates three main phonological phenomena across language varieties:

the obstruent consonants
/s/ retraction before consonant
the realization of /r/

These phenomena have been investigated within specific contexts. This information is reported in the corresponding table in the metadata folder (Words.ods).

Morphology

This section investigates the following morphological phenomena:

Case syncretism, especially in personal pronouns and articles
Realization of subject and object clitics

The specific linguistic variables investigated are reported in the corresponding table in the metadata folder (Tales.ods).

Syntax

This section investigates the following syntactic topics:

Syntax of the adjective within DP
Syntax of clitics
Negative concord
Pro drop
Complementizers
Locative particles
Auxiliary selection
Pronouns
Proper name syntax

Each of these broad topics includes the analysis of different linguistic variables which are reported in the corresponding table in the metadata folder (Sentences.ods).

Metadata folder

This folder contains four tables with the relevant information about the speakers and the linguistic stimuli:

Users

The speaker information includes:

USER ID (included in the audio file name)
Language variety
Geographic location
Personal information (gender, age)
Linguistic profile: language proficiency, frequency and contexts of use – VinKo questionnaire version
Period of data collection

Words

The metadata for the phonology section includes the following information:

STIMULUS ID (included in the audio file name)
Language variety
Graphical rendering of item (Graphy)
Gloss in Standard German and/or Standard Italian
Investigated phenomenon
Word context (e.g., initial, medial, final position)
Target phoneme/specific context under investigation
VinKo questionnaire version

Sentences

The metadata for the syntax section includes the following information:

STIMULUS ID (included in the audio file name)
Sentence in Standard German and/or Standard Italian
Syntactic topic
Linguistic variable under investigation
VinKo questionnaire version
Code (for internal use only)

Tales

The metadata for the morphology section includes the following information:

STIMULUS ID (included in the audio file name)
Image used as visual context
Text in Standard German and/or in Standard Italian
Task (Translation; guided free-speech Production)
Linguistic variable under investigation
VinKo questionnaire version

4. Additional information

Websites

VinKo Atlas

https://www.vinko.it/listen-explore.php

Scientific Publications

Cordin, P., Rabanus, S., Alber, B., Mattei, A., Casalicchio, J., Tomaselli, A., Bidese, E., & A. Padovan (2018). Vinko, Versione 2. Korpus im Text, Serie A, 13739 http://www.kit.gwi.uni-muenchen.de/?p=13739&v=2

Dissemination

Vita Trentina (06/2017) https://www.vitatrentina.it/2017/06/29/parli-dialetto-fatti-sentire/
La Usc di Ladins (04/2021) https://www.lausc.it/valedes-ladines/fascia/15356-vinko-n-projet-per-valoriser-i-lengac-mendres
Di Sait vo Lusérn (04/2021) http://mediateca.istitutocimbro.it/applications/webwork/site_bibliocim/media/sait_vo_lusern/2021_04_02.pdf
Zimbar Earde (27/03/2021 - min 7:00) https://www.youtube.com/watch?v=U8rBYsAlpa0

5. Error reporting

The collected files are raw audio data and some may be missing or empty. If you spot any inconsistency, error, or corrupted recording please contact us at vinko@ateneo.univr.it.