VinKo (Varieties in Contact) Corpus v1.2

Description

VINKO is a spoken corpus based on crowd-sourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto and Friuli-Venezia Giulia, with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano, and Sappadino) and Romance (Ladin, Trentino and Veneto dialects). The data collection took place from June 2017 to May 2023.

Part of the recordings can be visualized on the open-access section “Listen and Explore” of the AlpiLinK project web page, which can be used by participants and anyone interested in exploring the collected data.

URL: https://alpilink.it/vinko/
Contact: vinko@ateneo.univr.it

Authors

Readme structure

  1. General
  2. Abbreviations
  3. Data Structure
  4. Additional Information
  5. Error Reporting
  6. Updates

1. General

Acknowledgments

How-to cite

2. Abbreviations

Languages

Phonological phenomena

3. Data Structure

File structure under each language variety is identical and organized as follows:

          VINKO
        |--- README.txt
        ¦ 
        +--- cim_mhn_plo_zah
        ¦    +--- cim
        ¦         |--- S0027_cim_U0056.flac
        ¦         |--- W0098_cim_U0056.flac
        ¦         |--- T0101_cim_U0056.flac  
        ¦         |...
        ¦    +--- plo
        ¦         |--- ... equivalent to "cim"
        ¦    +--- mhn
        ¦         |--- ... equivalent to "cim"
        ¦    +--- zah
        ¦         |--- ... equivalent to "cim"
        ¦       
        +--- lldan_lldfa_lldfo
        ¦    +--- lldan
        ¦         |--- S0027_lldan_U0056.flac
        ¦         |--- W0098_lldan_U0056.flac
        ¦         |--- T0101_lldan_U0056.flac  
        ¦         |...
        ¦    +--- lldfa
        ¦         |--- ... equivalent to "lldan"
        ¦    +--- lldfo
        ¦         |--- ... equivalent to "lldan"
        ¦      
        +--- lldba_lldgh
        ¦    +--- lldba
        ¦         |--- S0027_lldba_U0056.flac
        ¦         |--- W0098_lldba_U0056.flac
        ¦         |--- T0101_lldba_U0056.flac  
        ¦         |...
        ¦    +--- lldgh
        ¦         |--- ... equivalent to "lldba"
        ¦      
        +--- tir_S01 (from S0001 to S0089)
        ¦    +--- tir
        ¦    	 |--- S0001_tir_U0392.flac
        ¦    	 |--- S0001_tir_U0360.flac
        ¦    	 |--- S0089_tir_U0391.flac
        ¦     	 |...
        +--- tir_S02 (equivalent to "tir_01", from S0090 to S0159)
	+--- tir_T01 (from T0101 to T0313)
        ¦    +--- tir
        ¦    	 |--- T0101_tir_U0392.flac
        ¦    	 |--- T0101_tir_U0360.flac
        ¦    	 |--- T0308_tir_U0391.flac
        ¦   	 |...
        +--- tir_T02 (equivalent to "tir_01", from T0314 to T0403)
        +--- tir_W01 (from W0394 to W0449)
        ¦    +--- tir
        ¦   	  |--- W0394_tir_U0392.flac
        ¦   	  |--- W0394_tir_U0360.flac
        ¦   	  |--- W0448_tir_U0391.flac
        ¦   	  |...
        +--- tir_W02 (equivalent to "tir_01", from W0450 to W0464)
        ¦ 
        +--- tre_S_T
        ¦    +--- tre
        ¦    	 |--- S0115_tre_U0077.flac
        ¦    	 |--- S0115_tre_U0169.flac
        ¦     	 |--- T0316_tre_U0607.flac
        ¦    	 |...
        +--- tre_W
        ¦    +--- tre
        ¦    	 |--- W0001_tre_U0077.flac
        ¦    	 |--- W0001_tre_U0169.flac
        ¦     	 |--- W0025_tre_U0607.flac
        ¦    	 |...
        ¦ 
        +--- vec_S01 (from S0004 to S0037)
        ¦    +--- vec
        ¦    	 |--- S0004_vec_U0307.flac
        ¦    	 |--- S0004_vec_U0515.flac
        ¦    	 |--- S0006_vec_U0449.flac
        ¦    	 |...
        +--- vec_S02 (equivalent to "vec_S01", from S0037 to S0051)
        +--- vec_S03 (equivalent to "vec_S01", from S0052 to S0070)
        +--- vec_S04 (equivalent to "vec_S01", from S0071 to S0079)
        +--- vec_S05 (equivalent to "vec_S01", from S0080 to S0088)
        +--- vec_S06 (equivalent to "vec_S01", from S0098 to S0111)
        +--- vec_S07 (equivalent to "vec_S01", from S0112 to S0119)
        +--- vec_S08 (equivalent to "vec_S01", from S0120 to S0129)
        +--- vec_S09 (equivalent to "vec_S01", from S0130 to S0140)
        +--- vec_S10 (equivalent to "vec_S01", from S0141 to S0147)
        +--- vec_S10 (equivalent to "vec_S01", from S0148 to S0153)
        +--- vec_T01 (from T0101 to T0104)
        ¦    +--- vec
        ¦    	 |--- T0101_vec_U0307.flac
        ¦    	 |--- T0101_vec_U0515.flac
        ¦    	 |--- T0102_vec_U0449.flac
        ¦    	 |...
        +--- vec_T02 (equivalent to "vec_T01", from T0105 to T0110)
        +--- vec_T03 (equivalent to "vec_T01", from T0111 to T0115)
        +--- vec_T04 (equivalent to "vec_T01", from T0116 to T0202)
        +--- vec_T05 (equivalent to "vec_T01", from T0203 to T0302)
        +--- vec_T06 (equivalent to "vec_T01", from T0302 to T0305)
        +--- vec_T07 (equivalent to "vec_T01", from T0306 to T0309)
        +--- vec_T08 (equivalent to "vec_T01", from T0310 to T0314)
        +--- vec_T09 (equivalent to "vec_T01", from T0315 to T0403)
        +--- vec_W01 (from W0465 to W0474)
        ¦    +--- vec
        ¦    	 |--- W0465_vec_U0307.flac
        ¦    	 |--- W0466_vec_U0515.flac
        ¦    	 |--- W0467_vec_U0449.flac
        ¦    	 |...
        +--- vec_W02 (equivalent to "vec_W01", from W0475 to W0484)
        +--- vec_W03 (equivalent to "vec_W01", from W0485 to W0493)
        +--- vec_W04 (equivalent to "vec_W01", from W0494 to W0501)
        ¦
        +--- Metadata
        ¦    |--- Users.ods
        ¦    |--- Sentences.ods 
        ¦    |--- Tales.ods
        ¦    |--- Words.ods
        ¦
        +--- Images 
             |--- 001_image.png
             |--- 002_image_IT.png
             |--- 002_image_DE.png
             |...

As can be seen, the VinKo Corpus consists of:

The audio files collected in this repository come from two versions of the VinKo web platform which have been implemented over the years: version 1 covers the two-year-span 2017/2018, whereas version 2 covers from 2019 to 2023.

There are audio recordings in eight language varieties for three different levels of linguistic analysis: phonology, morphology, and syntax. The investigation of each linguistic domain involves a different type of linguistic stimulus: single words for phonology; short stories with a visual context for morphology; sentences for syntax. All linguistic domains are available in the different language varieties and they are comparable across these language varieties.

The audio file name always mentions the stimulus ID (e.g. S0027) followed by the abbreviation of the language variety (e.g., cim) and ending in the user ID (e.g., U0056). This means that audio file S0027_cim_U0056 is a Cimbrian translation of stimulus S0027 by speaker U0056. The first letter of the stimulus ID indicates the linguistic domain under investigation:

Some speakers recorded more than one audio file for the same stimulus. These files are reported in the corpus as follows: S001_cim_U0056a, S001_cim_U0056b.

Phonology

This section investigates three main phonological phenomena across language varieties:

These phenomena have been investigated within specific contexts. This information is reported in the corresponding table in the metadata folder (Words.ods).

Morphology

This section investigates the following morphological phenomena:

The specific linguistic variables investigated are reported in the corresponding table in the metadata folder (Tales.ods).

Syntax

This section investigates the following syntactic topics:

Each of these broad topics includes the analysis of different linguistic variables which are reported in the corresponding table in the metadata folder (Sentences.ods).

Metadata folder

This folder contains four tables with the relevant information about the speakers and the linguistic stimuli:

Users

The speaker information includes:

Words

The metadata for the phonology section includes the following information:

Sentences

The metadata for the syntax section includes the following information:

Tales

The metadata for the morphology section includes the following information:

4. Additional information

Websites

Scientific Publications

Dissemination

5. Error reporting

The collected files are raw audio data and some may be missing or empty. If you spot any inconsistency, error, or corrupted recording please contact us at vinko@ateneo.univr.it.

6. Updates

** Changes v1.1 > v1.2**

** Changes v1.0 > v1.1**