VinKo (Varieties in Contact) Corpus v1.2

Description

VINKO is a spoken corpus based on crowd-sourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto and Friuli-Venezia Giulia, with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano, and Sappadino) and Romance (Ladin, Trentino and Veneto dialects). The data collection took place from June 2017 to May 2023.

Part of the recordings can be visualized on the open-access section “Listen and Explore” of the AlpiLinK project web page, which can be used by participants and anyone interested in exploring the collected data.

URL: https://alpilink.it/vinko/
Contact: vinko@ateneo.univr.it

Authors

Stefan Rabanus (University of Verona)
Anne Kruijt (University of Verona)
Marta Tagliani (University of Verona)
Alessandra Tomaselli (University of Verona)
Andrea Padovan (University of Verona)
Birgit Alber (Free University of Bozen-Bolzano)
Patrizia Cordin (University of Trento)
Roberto Zamparelli (University of Trento)
Barbara Maria Vogt (University of L’Aquila)

Readme structure

General
Abbreviations
Data Structure
Additional Information
Error Reporting
Updates

1. General

Creator Anne Kruijt
Date of creation v1.0 2021-07
Date of creation v1.1 2022-05
Date of creation v1.2 2023-08
Last updated 2023-08
Size 189679 audio files; 189752 tokens
License Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) https://creativecommons.org/licenses/by-nc-sa/4.0/

Acknowledgments

PROJECT OF EXCELLENCE DiLLS: Digital Humanities applied to Foreign Languages and Literatures 2018-2022, University of Verona Project code: CUP n. B31I18000250006 Funding organization: Ministero dell’Università e della Ricerca;
Project PRIN 2015 “TREiL” (Technologies for Research and Education in Linguistics) 2017-2020, University of Trento (CIMeC/DIPSCO) Project code: Prot. 2015MNX5ZE - CUP n. E72F16001620001 Funding organization: Ministero dell’Università e della Ricerca;
Project AThEME (Advancing the European Multilingual Experience) Project grant: n. 613465 Funding organization: European Seventh Framework Programme for Research, Technological Development and Demonstration (EU)

How-to cite

Reference: Please cite the following paper if you use this dataset:
Authors: Kruijt, Anne; Rabanus, Stefan; Tagliani, Marta
Title: The VinKo-Corpus: Oral data from Romance and Germanic local varieties of Northern Italy.
Date: 2023
Book Title: Neue Entwicklungen in der Korpuslandschaft der Germanistik: Beiträge zur IDS-Methodenmesse 2022
Editors: Kupietz, Marc; Schmidt, Thomas
Series: Korpuslinguistik und Interdisziplinäre Perspektiven auf Sprache - Corpus linguistics and Interdisciplinary perspectives on Language (CLIP) 11
Publisher: Narr Francke Attempto
Location: Tübingen
Pages: 203-212

2. Abbreviations

Languages

“cim” = Cimbrian
“lldan” = Ampezzano (Ladin)
“lldba” = Badiot (Ladin)
“lldfa” = Fassan (Ladin)
“lldfo” = Fodom (Ladin)
“lldgh” = Gardenese (Ladin)
“mhn” = Mòcheno
“plo” = Sappadino
“tir” = Tyrolean
“tre” = Trentino
“vec” = Veneto
“zah” = Saurano

Phonological phenomena

“obstr” = obstruent consonants
“r” = realization of /r/
“sch” = /s/ retraction

3. Data Structure

File structure under each language variety is identical and organized as follows:

          VINKO
        |--- README.txt
        ¦ 
        +--- cim_mhn_plo_zah
        ¦    +--- cim
        ¦         |--- S0027_cim_U0056.flac
        ¦         |--- W0098_cim_U0056.flac
        ¦         |--- T0101_cim_U0056.flac  
        ¦         |...
        ¦    +--- plo
        ¦         |--- ... equivalent to "cim"
        ¦    +--- mhn
        ¦         |--- ... equivalent to "cim"
        ¦    +--- zah
        ¦         |--- ... equivalent to "cim"
        ¦       
        +--- lldan_lldfa_lldfo
        ¦    +--- lldan
        ¦         |--- S0027_lldan_U0056.flac
        ¦         |--- W0098_lldan_U0056.flac
        ¦         |--- T0101_lldan_U0056.flac  
        ¦         |...
        ¦    +--- lldfa
        ¦         |--- ... equivalent to "lldan"
        ¦    +--- lldfo
        ¦         |--- ... equivalent to "lldan"
        ¦      
        +--- lldba_lldgh
        ¦    +--- lldba
        ¦         |--- S0027_lldba_U0056.flac
        ¦         |--- W0098_lldba_U0056.flac
        ¦         |--- T0101_lldba_U0056.flac  
        ¦         |...
        ¦    +--- lldgh
        ¦         |--- ... equivalent to "lldba"
        ¦      
        +--- tir_S01 (from S0001 to S0089)
        ¦    +--- tir
        ¦    	 |--- S0001_tir_U0392.flac
        ¦    	 |--- S0001_tir_U0360.flac
        ¦    	 |--- S0089_tir_U0391.flac
        ¦     	 |...
        +--- tir_S02 (equivalent to "tir_01", from S0090 to S0159)
	+--- tir_T01 (from T0101 to T0313)
        ¦    +--- tir
        ¦    	 |--- T0101_tir_U0392.flac
        ¦    	 |--- T0101_tir_U0360.flac
        ¦    	 |--- T0308_tir_U0391.flac
        ¦   	 |...
        +--- tir_T02 (equivalent to "tir_01", from T0314 to T0403)
        +--- tir_W01 (from W0394 to W0449)
        ¦    +--- tir
        ¦   	  |--- W0394_tir_U0392.flac
        ¦   	  |--- W0394_tir_U0360.flac
        ¦   	  |--- W0448_tir_U0391.flac
        ¦   	  |...
        +--- tir_W02 (equivalent to "tir_01", from W0450 to W0464)
        ¦ 
        +--- tre_S_T
        ¦    +--- tre
        ¦    	 |--- S0115_tre_U0077.flac
        ¦    	 |--- S0115_tre_U0169.flac
        ¦     	 |--- T0316_tre_U0607.flac
        ¦    	 |...
        +--- tre_W
        ¦    +--- tre
        ¦    	 |--- W0001_tre_U0077.flac
        ¦    	 |--- W0001_tre_U0169.flac
        ¦     	 |--- W0025_tre_U0607.flac
        ¦    	 |...
        ¦ 
        +--- vec_S01 (from S0004 to S0037)
        ¦    +--- vec
        ¦    	 |--- S0004_vec_U0307.flac
        ¦    	 |--- S0004_vec_U0515.flac
        ¦    	 |--- S0006_vec_U0449.flac
        ¦    	 |...
        +--- vec_S02 (equivalent to "vec_S01", from S0037 to S0051)
        +--- vec_S03 (equivalent to "vec_S01", from S0052 to S0070)
        +--- vec_S04 (equivalent to "vec_S01", from S0071 to S0079)
        +--- vec_S05 (equivalent to "vec_S01", from S0080 to S0088)
        +--- vec_S06 (equivalent to "vec_S01", from S0098 to S0111)
        +--- vec_S07 (equivalent to "vec_S01", from S0112 to S0119)
        +--- vec_S08 (equivalent to "vec_S01", from S0120 to S0129)
        +--- vec_S09 (equivalent to "vec_S01", from S0130 to S0140)
        +--- vec_S10 (equivalent to "vec_S01", from S0141 to S0147)
        +--- vec_S10 (equivalent to "vec_S01", from S0148 to S0153)
        +--- vec_T01 (from T0101 to T0104)
        ¦    +--- vec
        ¦    	 |--- T0101_vec_U0307.flac
        ¦    	 |--- T0101_vec_U0515.flac
        ¦    	 |--- T0102_vec_U0449.flac
        ¦    	 |...
        +--- vec_T02 (equivalent to "vec_T01", from T0105 to T0110)
        +--- vec_T03 (equivalent to "vec_T01", from T0111 to T0115)
        +--- vec_T04 (equivalent to "vec_T01", from T0116 to T0202)
        +--- vec_T05 (equivalent to "vec_T01", from T0203 to T0302)
        +--- vec_T06 (equivalent to "vec_T01", from T0302 to T0305)
        +--- vec_T07 (equivalent to "vec_T01", from T0306 to T0309)
        +--- vec_T08 (equivalent to "vec_T01", from T0310 to T0314)
        +--- vec_T09 (equivalent to "vec_T01", from T0315 to T0403)
        +--- vec_W01 (from W0465 to W0474)
        ¦    +--- vec
        ¦    	 |--- W0465_vec_U0307.flac
        ¦    	 |--- W0466_vec_U0515.flac
        ¦    	 |--- W0467_vec_U0449.flac
        ¦    	 |...
        +--- vec_W02 (equivalent to "vec_W01", from W0475 to W0484)
        +--- vec_W03 (equivalent to "vec_W01", from W0485 to W0493)
        +--- vec_W04 (equivalent to "vec_W01", from W0494 to W0501)
        ¦
        +--- Metadata
        ¦    |--- Users.ods
        ¦    |--- Sentences.ods 
        ¦    |--- Tales.ods
        ¦    |--- Words.ods
        ¦
        +--- Images 
             |--- 001_image.png
             |--- 002_image_IT.png
             |--- 002_image_DE.png
             |...

As can be seen, the VinKo Corpus consists of:

1 readme file: contains the main information about the corpus and the VinKo project
1 Metadata folder: contains tables with relevant linguistic information as well as sociolinguistic information about speakers;
1 Images folder: contains the pictures employed as visual context for the morphology section;
34 Audio folders containing raw audio recordings collected from speakers. The audio files were organized based on language variety and stimuli sentence. For ease of dowloading, multiple folders (max size 2GB) were created for Veneto and Tyrolean. In these subfolders, the audio files were split along stimulus IDs (see Sentences, Tales, and Words tables).

The audio files collected in this repository come from two versions of the VinKo web platform which have been implemented over the years: version 1 covers the two-year-span 2017/2018, whereas version 2 covers from 2019 to 2023.

There are audio recordings in eight language varieties for three different levels of linguistic analysis: phonology, morphology, and syntax. The investigation of each linguistic domain involves a different type of linguistic stimulus: single words for phonology; short stories with a visual context for morphology; sentences for syntax. All linguistic domains are available in the different language varieties and they are comparable across these language varieties.

The audio file name always mentions the stimulus ID (e.g. S0027) followed by the abbreviation of the language variety (e.g., cim) and ending in the user ID (e.g., U0056). This means that audio file S0027_cim_U0056 is a Cimbrian translation of stimulus S0027 by speaker U0056. The first letter of the stimulus ID indicates the linguistic domain under investigation:

“S” = Syntax (sentences)
“T” = Morphology (tales)
“W” = Phonology (words)

Some speakers recorded more than one audio file for the same stimulus. These files are reported in the corpus as follows: S001_cim_U0056a, S001_cim_U0056b.

Phonology

This section investigates three main phonological phenomena across language varieties:

the obstruent consonants
/s/ retraction before consonant
the realization of /r/

These phenomena have been investigated within specific contexts. This information is reported in the corresponding table in the metadata folder (Words.ods).

Morphology

This section investigates the following morphological phenomena:

Case syncretism, especially in personal pronouns and articles
Realization of subject and object clitics

The specific linguistic variables investigated are reported in the corresponding table in the metadata folder (Tales.ods).

Syntax

This section investigates the following syntactic topics:

Syntax of the adjective within DP
Syntax of clitics
Negative concord
Pro drop
Complementizers
Locative particles
Auxiliary selection
Pronouns
Proper name syntax

Each of these broad topics includes the analysis of different linguistic variables which are reported in the corresponding table in the metadata folder (Sentences.ods).

Metadata folder

This folder contains four tables with the relevant information about the speakers and the linguistic stimuli:

Users

The speaker information includes:

USER ID (included in the audio file name)
Language variety
Geographic location (name of municipality, GID, and coordinates): The GID codes can be used to import data into the GIS system of the REDE (regionalsprache.de) project of the University of Marburg, a collaborating partner of the VinKo project. The coordinates column refers to the centre of the shapefile of the municipality in latitude, longitude-format.
Personal information (gender, age)
Linguistic profile: language proficiency, frequency and contexts of use - VinKo questionnaire version
Period of data collection
Note: comments on participants and questionnaire administration

Words

The metadata for the phonology section includes the following information:

STIMULUS ID (included in the audio file name)
Language variety
Graphical rendering of item (Graphy)
Gloss in Standard German and/or Standard Italian
Investigated phenomenon
Word context (e.g., initial, medial, final position)
Target phoneme/specific context under investigation
VinKo questionnaire version

Sentences

The metadata for the syntax section includes the following information:

STIMULUS ID (included in the audio file name)
Sentence in Standard German and/or Standard Italian
Syntactic topic
Linguistic variable under investigation
VinKo questionnaire version
Code (for internal use only)

Tales

The metadata for the morphology section includes the following information:

STIMULUS ID (included in the audio file name)
Image used as visual context
Text in Standard German and/or in Standard Italian
Task (Translation; guided free-speech Production)
Linguistic variable under investigation
VinKo questionnaire version

4. Additional information

Websites

VinKo Atlas
https://alpilink.it/en/ascoltaeesplora/

Scientific Publications

Casalicchio, Jan, and Patrizia Cordin (2020). Grammar of Central Trentino: A Romance Dialect from North-East Italy. Grammar of Central Trentino. Grammars and Sketches of the World’s Languages, Romance Languages 13. Leiden/Boston: Brill. https://brill.com/view/title/55777.
Cordin, Patrizia, Stefan Rabanus, Birgit Alber, Antonio Mattei, Jan Casalicchio, Alessandra Tomaselli, Ermenegildo Bidese, and Andrea Padovan (2018). Vinko, Versione 2. Korpus im Text, Serie A, 13739. http://www.kit.gwi.uni-muenchen.de/?p=13739&v=2.
Kruijt, Anne (2022). Crowdsourcing Language Contact: Pronoun and Article Morphology in Trentino-South Tyrol and Veneto. PhD Thesis, Verona: University of Verona.
Kruijt, Anne, Stefan Rabanus, and Marta Tagliani {2023). “The VinKo-Corpus: Oral data from Romance and Germanic local varieties of Northern Italy.” In Neue Entwicklungen in der Korpuslandschaft der Germanistik: Beiträge zur IDS-Methodenmesse 2022, edited by Marc Kupietz and Thomas Schmidt. Korpuslinguistik und Interdisziplinäre Perspektiven auf Sprache - Corpus linguistics and Interdisciplinary perspectives on Language (CLIP) 11. Tübingen: Narr Francke Attempto,203-212.
Kruijt, Anne, Patrizia Cordin, and Stefan Rabanus (2023). “On the Validity of Crowdsourced Data.” In Corpus Dialectology, edited by Elissa Pustka, Carmen Quijada Van den Berghe, and Verena Weiland, 10–33. Studies in Corpus Linguistics. John Benjamins Publishing Company, 10-33. Accessed May 30, 2023. https://benjamins.com/catalog/scl.110.01kru.
Rabanus, Stefan (2023a). “Two modes of contact-induced change in minority languages: phonology and syntax vs. inflectional morphology.” In Speakers and structures in contact-related change and variation: understudied issues and pluralistic approaches, edited by Hans-Bianchi, Barbara, Barbara Vogt & Chiara Truppi. Berlin/Boston: De Gruyter.
Rabanus, Stefan (2023b). Nome di battesimo e articolo espletivo – crowd-sourcing e cartografica linguistica nello studio della variazione linguistica in Trentino-Alto Adige e Veneto. In: Schöntag, Rober & Laura Linzmeier (Hrsg.): Neue Ansätze und Perspektiven zur sprachlichen Raumkonzeption und Geolinguistik.
Tomaselli, Alessandra, and Ermenegildo Bides (2023). “Fortune and Decay of Lexical Expletives in Germanic and Romance along the Adige River.” Languages 8 (1): 44. https://doi.org/10.3390/languages8010044.

Dissemination

Vita Trentina (06/2017) https://www.vitatrentina.it/2017/06/29/parli-dialetto-fatti-sentire/
La Usc di Ladins (04/2021) https://www.lausc.it/valedes-ladines/fascia/15356-vinko-n-projet-per-valoriser-i-lengac-mendres
Di Sait vo Lusérn (04/2021) http://mediateca.istitutocimbro.it/applications/webwork/site_bibliocim/media/sait_vo_lusern/2021_04_02.pdf
Zimbar Earde (27/03/2021 - min 7:00) https://www.youtube.com/watch?v=U8rBYsAlpa0
RAI Radio1 Interview - Alla sorgente del sapere (10/2021) http://www.urly.it/3gva6

5. Error reporting

The collected files are raw audio data and some may be missing or empty. If you spot any inconsistency, error, or corrupted recording please contact us at vinko@ateneo.univr.it.

6. Updates

Changes v1.1 > v1.2

Upload of the audio data collected from January 2022 to May 2023. Number of total audio files from 63863 to 189679
Metadata folder > Users: new participants included (U0901-U1903)
Metadata folder > Sentences: fixed a typo for item S0031 (him > ihm)
Casalicchio/Cordin 2020, Kruijt 2022, Kruijt/Rabanus/Tagliani 2023, Kruijt/Cordin/Rabanus 2023, Rabanus 2023a, Rabanus 2023b, Tomaselli/Bidese 2023 added to 4. Additional information > Scientific publications.
ReadMe file updated to reflect changes in project URL (Description and 4. Additional Information) and Reference in 1. General, and changes in the folder structure of the corpus (3. Data structure).

Changes v1.0 > v1.1

Addition of Sappadino to the language varieties under investigation
Metadata folder > Words: phonology section now includes new items for Sappadino. For each word, the glosses are provided both in Standard German and Standard Italian.
Metadata folder > Sentences: fixed a typo for item S0031 (him > ihm)
Upload of the audio data collected from June to December 2021. Number of total audio files from 37806 to 63863.
Audio data split into multiple folders for Veneto and Tyrolean due to large amount of data (see #3 Data Structure for explanation about the split)
Metadata folder > Users: new participants included
Metadata folder > Users: column ‘note’ included - here you can find comments on participants and questionnaire administration, e.g., if the participant selected the wrong language variety
User U0648 has been removed from Metadata folder > Users as it contains old linguistic data without precise information about the speaker. The corresponding audio files can still be accessed in Audio folder > lldgh (Gardenese)
RAI Radio1 Interview added to 4. Additional information > Dissemination