VinKo (Varieties in Contact) Corpus v1.1
========================================

Description
-----------

VINKO is a spoken corpus based on crowd-sourced audio recordings that
has been designed to provide relevant linguistic information about the
minority languages and dialects spoken in the area between Innsbruck and
the Po Valley. The corpus contains audio recordings from local languages
and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto
and Friuli-Venezia Giulia, with particular focus on the so-called
'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean,
Saurano, and Sappadino) and Romance (Ladin, Trentino and Veneto
dialects). The data collection took place from June 2017 to December
2021.

Part of the recordings can be visualized on the open-access section
"Listen and Explore" of the project web page, which can be used by
participants and anyone interested in exploring the collected data.

**URL:** <https://www.vinko.it/index.php>\
**Contact:** <vinko@ateneo.univr.it>

Authors
-------

-   Stefan Rabanus (University of Verona)
-   Anne Kruijt (University of Verona)
-   Marta Tagliani (University of Verona)
-   Alessandra Tomaselli (University of Verona)
-   Andrea Padovan (University of Verona)
-   Birgit Alber (Free University of Bozen-Bolzano)
-   Patrizia Cordin (University of Trento)
-   Roberto Zamparelli (University of Trento)
-   Barbara Maria Vogt (University of L'Aquila)

**Readme structure**
====================

1.  [General](#general)
2.  [Abbreviations](#abbreviations)
3.  [Data Structure](#data-structure)
4.  [Additional Information](#additional-information)
5.  [Error Reporting](#error-reporting)
6.  [Updates](#updates)

1. General
----------

-   **Creator** Anne Kruijt
-   **Date of creation v1.0** 2021-07
-   **Date of creation v1.1** 2022-05
-   **Last updated** 2022-05
-   **Size** 63863 audio files; 63935 tokens
-   **License** Creative Commons Attribution-NonCommercial-ShareAlike
    3.0 Italy (CC BY-NC-SA 3.0 IT)
    <https://creativecommons.org/licenses/by-nc-sa/3.0/it/>

### **Acknowledgments**

-   *PROJECT OF EXCELLENCE DiLLS: Digital Humanities applied to Foreign
    Languages and Literatures* 2018-2022, University of Verona Project
    code: CUP n. B31I18000250006 Funding organization: Ministero
    dell'Università e della Ricerca;

-   *Project PRIN 2015 "TREiL" (Technologies for Research and Education
    in Linguistics)* 2017-2020, University of Trento (CIMeC/DIPSCO)
    Project code: Prot. 2015MNX5ZE - CUP n. E72F16001620001 Funding
    organization: Ministero dell'Università e della Ricerca;

-   *Project AThEME (Advancing the European Multilingual Experience)*
    Project grant: n. 613465 Funding organization: European Seventh
    Framework Programme for Research, Technological Development and
    Demonstration (EU)

### How-to cite

-   **Reference**: Please cite the following Open Access paper if you
    use this dataset <http://www.kit.gwi.uni-muenchen.de/?p=13739&v=2>\
    *Authors*: Cordin, Patrizia; Rabanus, Stefan; Alber, Birgit; Mattei,
    Antonio; Casalicchio, Jan; Tomaselli, Alessandra; Bidese,
    Ermenegildo; Padovan, Andrea\
    *Title*: VinKo, Versione 2\
    *Date of online publication*: 2018\
    *Journal*: Korpus im Text, Serie A, 13739

2. Abbreviations
----------------

### **Languages**

-   "lldan" = Ampezzano (Ladin)
-   "lldba" = Badiot (Ladin)
-   "cim" = Cimbrian
-   "lldfa" = Fassan (Ladin)
-   "lldfo" = Fodom (Ladin)
-   "lldgh" = Gardenese (Ladin)
-   "mhn" = Mòcheno
-   "plo" = Sappadino
-   "zah" = Saurano
-   "tre" = Trentino
-   "tir" = Tyrolean
-   "vec" = Veneto

### **Phonological phenomena**

-   "obstr" = obstruent consonants
-   "sch" = /s/ retraction
-   "r" = realization of /r/

3. Data Structure
-----------------

File structure under each language variety is identical and organized as
follows:

        VINKO
        |--- README.txt
        ¦ 
        +--- cim_lldfa_lldfo_lldan_lldgh_lldba_mhn_zah
        ¦    +--- cim
        ¦         |--- S0027_cim_U0056.flac
        ¦         |--- W0098_cim_U0056.flac
        ¦         |--- T0101_cim_U0056.flac  
        ¦         |...
        ¦    +--- lldfa
        ¦         |--- ... equivalent to "cim"
        ¦    +--- lldfo
        ¦         |--- ... equivalent to "cim"
        ¦    +--- lldan
        ¦         |--- ... equivalent to "cim"
        ¦    +--- lldgh
        ¦         |--- ... equivalent to "cim"
        ¦    +--- lldba
        ¦         |--- ... equivalent to "cim"
        ¦    +--- mhn
        ¦         |--- ... equivalent to "cim"
        ¦    +--- zah
        ¦         |--- ... equivalent to "cim"
        ¦       
        +--- tir_01 (from ID U0048 to ID 0429)
        ¦    |--- T0312_tir_U0392.flac
        ¦    |--- W0460_tir_U0360.flac
        ¦    |--- S0104_tir_U0391.flac
        ¦    |...
        +--- tir_02 (equivalent to "tir_01", from ID U0430 to ID 0489)
        +--- tir_03 (equivalent to "tir_01", from ID U0490 to ID 0819)
        +--- tre
        ¦    |--- W0001_tre_U0077.flac
        ¦    |--- S0115_tre_U0169.flac
        ¦    |--- T0316_tre_U0607.flac
        ¦    |...
        +--- vec_01 (from ID U0298 to ID 0529)
        ¦    |--- S0088_vec_U0307.flac
        ¦    |--- W0480_vec_U0515.flac
        ¦    |--- T0403_vec_U0449.flac
        ¦    |...
        +--- vec_02 (equivalent to "vec_01", from ID U0539 to ID 0691)
        +--- vec_03 (equivalent to "vec_01", from ID U0692 to ID 0769)
        +--- vec_04 (equivalent to "vec_01", from ID U0770 to ID 0817)
        +--- vec_05 (equivalent to "vec_01", from ID U0820 to ID 0873)
        +--- vec_06 (equivalent to "vec_01", from ID U0874 to ID 0900)
        ¦
        +--- Metadata
        ¦    |--- Users.ods
        ¦    |--- Sentences.ods 
        ¦    |--- Tales.ods
        ¦    |--- Words.ods
        ¦
        +--- Images 
             |--- 001_image.png
             |--- 002_image_IT.png
             |--- 002_image_DE.png
             |...

As can be seen, the VinKo Corpus consists of:

-   1 readme file: contains the main information about the corpus and
    the Vinko project

-   1 Metadata folder: contains tables with relevant linguistic
    information as well as sociolinguistic information about speakers;

-   1 Images folder: contains the pictures employed as visual context
    for the morphology section;

-   11 Audio folders containing raw audio recordings collected from
    speakers. The audio files were organized based on language variety.
    For ease of dowloading, multiple folders (max size 2GB) were created
    for Veneto and Tyrolean. In these subfolders, the audio files were
    split along USER IDs (see **Users** table)

The audio files collected in this repository come from two versions of
the VinKo web platform which have been implemented over the years:
version 1 covers the two-year-span 2017/2018, whereas version 2 covers
from 2019 to 2021.

There are audio recordings in seven out of the eight language varieties
(to date there are no recordings of Sappadino yet) for three different
levels of linguistic analysis: phonology, morphology, and syntax. The
investigation of each linguistic domain involves a different type of
linguistic stimulus: single words for phonology; short stories with a
visual context for morphology; sentences for syntax. All linguistic
domains are available in the different language varieties and they are
comparable across these language varieties.

The audio file name always mentions the stimulus ID, the language
variety and the USER ID (e.g., `S0027_cim_U0056`). The first letter of
the stimulus ID indicates the linguistic domain under investigation:

-   "S" = Syntax (sentences)
-   "T" = Morphology (tales)
-   "W" = Phonology (words)

Some speakers recorded more than one audio file for the same stimulus.
These files are reported in the corpus as follows: `S001_cim_U0056a`,
`S001_cim_U0056b`.

#### **Phonology**

This section investigates three main phonological phenomena across
language varieties:

-   the obstruent consonants
-   /s/ retraction before consonant
-   the realization of /r/

These phenomena have been investigated within specific contexts. This
information is reported in the corresponding table in the metadata
folder (Words.ods).

#### **Morphology**

This section investigates the following morphological phenomena:

-   Case syncretism, especially in personal pronouns and articles
-   Realization of subject and object clitics

The specific linguistic variables investigated are reported in the
corresponding table in the metadata folder (Tales.ods).

#### **Syntax**

This section investigates the following syntactic topics:

-   Syntax of the adjective within DP
-   Syntax of clitics
-   Negative concord
-   Pro drop
-   Complementizers
-   Locative particles
-   Auxiliary selection
-   Pronouns
-   Proper name syntax

Each of these broad topics includes the analysis of different linguistic
variables which are reported in the corresponding table in the metadata
folder (Sentences.ods).

### **Metadata folder**

This folder contains four tables with the relevant information about the
speakers and the linguistic stimuli:

#### **Users**

The speaker information includes:

-   USER ID (included in the audio file name)
-   Language variety
-   Geographic location
-   Personal information (gender, age)
-   Linguistic profile: language proficiency, frequency and contexts of
    use - VinKo questionnaire version
-   Period of data collection
-   Note: comments on participants and questionnaire administration

#### **Words**

The metadata for the phonology section includes the following
information:

-   STIMULUS ID (included in the audio file name)
-   Language variety
-   Graphical rendering of item (Graphy)
-   Gloss in Standard German and/or Standard Italian
-   Investigated phenomenon
-   Word context (e.g., initial, medial, final position)
-   Target phoneme/specific context under investigation
-   VinKo questionnaire version

#### **Sentences**

The metadata for the syntax section includes the following information:

-   STIMULUS ID (included in the audio file name)
-   Sentence in Standard German and/or Standard Italian
-   Syntactic topic
-   Linguistic variable under investigation
-   VinKo questionnaire version
-   Code (for internal use only)

#### **Tales**

The metadata for the morphology section includes the following
information:

-   STIMULUS ID (included in the audio file name)
-   Image used as visual context
-   Text in Standard German and/or in Standard Italian
-   Task (Translation; guided free-speech Production)
-   Linguistic variable under investigation
-   VinKo questionnaire version

4. Additional information
-------------------------

### **Websites**

-   VinKo Atlas\
    <https://www.vinko.it/listen-explore.php>

### **Scientific Publications**

-   Cordin, P., Rabanus, S., Alber, B., Mattei, A., Casalicchio, J.,
    Tomaselli, A., Bidese, E., & A. Padovan (2018). Vinko, Versione 2.
    Korpus im Text, Serie A, 13739
    <http://www.kit.gwi.uni-muenchen.de/?p=13739&v=2>

### **Dissemination**

-   Vita Trentina (06/2017)
    <https://www.vitatrentina.it/2017/06/29/parli-dialetto-fatti-sentire/>

-   La Usc di Ladins (04/2021)
    <https://www.lausc.it/valedes-ladines/fascia/15356-vinko-n-projet-per-valoriser-i-lengac-mendres>

-   Di Sait vo Lusérn (04/2021)
    <http://mediateca.istitutocimbro.it/applications/webwork/site_bibliocim/media/sait_vo_lusern/2021_04_02.pdf>

-   Zimbar Earde (27/03/2021 - min 7:00)
    <https://www.youtube.com/watch?v=U8rBYsAlpa0>

-   RAI Radio1 Interview - Alla sorgente del sapere (10/2021)
    <http://www.urly.it/3gva6>

5. Error reporting
------------------

The collected files are raw audio data and some may be missing or empty.
If you spot any inconsistency, error, or corrupted recording please
contact us at <vinko@ateneo.univr.it>.

6. Updates
----------

### \*\* Changes v1.0 \> v1.1\*\*

-   Addition of Sappadino to the language varieties under investigation

-   Metadata folder \> Words: phonology section now includes new items
    for Sappadino. For each word, the glosses are provided both in
    Standard German and Standard Italian.

-   Metadata folder \> Sentences: fixed a typo for item S0031 (him \>
    ihm)

-   Upload of the audio data collected from June to December 2021.
    Number of total audio files from 37806 to 63863.

-   Audio data split into multiple folders for Veneto and Tyrolean due
    to large amount of data (see \#3 Data Structure for explanation
    about the split)

-   Metadata folder \> Users: new participants included

-   Metadata folder \> Users: column 'note' included - here you can find
    comments on participants and questionnaire administration, e.g., if
    the participant selected the wrong language variety

-   User U0648 has been removed from Metadata folder \> Users as it
    contains old linguistic data without precise information about the
    speaker. The corresponding audio files can still be accessed in
    Audio folder \> lldgh (Gardenese)

-   RAI Radio1 Interview added to 4. Additional information \>
    Dissemination