AThEME Verona-Trento Corpus


The AThEME Verona-Trento Corpus is a spoken corpus composed of data collected during the AThEME (Advancing the European Multilingual Experience 2014–2019) project in the Work Package 2 ‘Regional Languages’ by the units of Verona (Prof. Birgit Alber, Prof. Andrea Padovan, Prof. Stefan Rabanus, Prof. Alessandra Tomaselli) and Trento (Prof. Ermenegildo Bidese, Prof. Jan Casalicchio, Prof. Patrizia Cordin). The AThEME project was a large-scale European project that took “an integrated approach towards the study of multilingualism in Europe by incorporating and combining linguistic, cognitive and sociological perspectives; by studying multilingualism in Europe at three different levels of societal magnitude, viz. the individual multilingual citizen, the multilingual group, and the multilingual society; by using a palate of research methodologies, ranging from fieldwork methods to various experimental techniques and advanced EEG/ERP technologies” (see project description at The contribution of the Trento/Verona units was titled ‘Germanic-Romance language contact in the Southern-Central Alps’ and the data was collected via linguistic fieldwork on location. This corpus is composed of the resulting audio files and transcriptions.

The corpus contains the responses to two questionnaires: a phonological questionnaire and a morpho-syntactical questionnaire. The phonological questionnaire targeted the obstruent inventory, final devoicing, s-retractions, and the realization of /r/. The morpho-syntactical questionnaire targeted adjectives (position and inflection of the attributive, predicative and adverbial adjectives; comparatives and superlatives), pronouns (personal-pronoun paradigm for case, number, and gender marking), noun/article (gender; proper names), and the formation of movement verbs (prefix vs. locative particle), the syntax of subject and object pronouns and clitics (enclisis/proclisis), negative concord, pro-drop, complementizers, auxiliary selection, and restructuring. The corpus also contains data on the Germanic minority languages of Timau and Sauris. The Timau data was collected during the PRIN 2017 ‘Models of language variation and change: new evidence from language contact’ project in the unit located at the University of Verona (Prof. Alessandra Tomaselli, Dott. Francesco Zuin). The Sauris data was collected in 2017 by Prof. Alessandra Tomaselli (University of Verona) and Prof. Ermenegildo Bidese (University of Trento).



README Structure

  1. General
  2. Abbreviations
  3. Data Structure
  4. Additional Information

1. General

2. Abbreviations


Phonological phenomena

Syntactic phenomena

3. Data Structure

File structure under each language variety is identical and organized as follows:

AThEME Verona-Trento Corpus 
¦   ReadMe.txt
+--- Audio folders
¦    +--- cim
¦         |--- F0190_cim_U11.flac
¦         |--- F0194_cim_U11.flac
¦         |--- S026_cim_U11.flac
¦         |--- S085_cim_U11.flac
¦         |--- S085_cim_U12.flac
¦         | ...
¦    +--- lldfa 
¦         | ...(equivalent to "cim")
¦    +--- lldfo
¦         | ...(equivalent to "cim")
¦    +--- mhn
¦         | ...(equivalent to "cim")
¦    +--- tre
¦         | ...(equivalent to "cim")
¦    +--- tir
¦         | ...(equivalent to "cim")
¦    +--- vec
¦         | ...(equivalent to "cim")
¦    +--- tis
¦         | ...(equivalent to "cim")
¦    +--- zah
¦         | ...(equivalent to "cim")
+--- Metadata
     |--- participants.ods
     |--- syntax.ods 
     |--- phonology.ods
     |--- syntax_transcription.ods

As can be seen, the AThEME Verona-Trento Corpus consists of two main folders: the audio folder, containing the segmented audio recordings collected from speakers and the metadata folder, containing tables with relevant linguistic information as well as sociolinguistic information about speakers, it also contains the transcriptions of the audio files where available. In addition, there is this readme file with the main information about the corpus.

Audio folders

There are audio recordings in nine language varieties for two different levels of linguistic analysis: phonology and syntax. The investigation of each linguistic domain involves a different type of linguistic stimulus: single words for phonology and entire sentences for syntax.

The audio file name always mentions the stimulus ID, the language variety, and the user ID (e.g., S026_cim_U12). The first letter of the stimulus ID indicates the linguistic domain under investigation:

The following numbers indicate the associated stimulus word/sentence which can be found in the files ‘phonology.ods’ and ‘syntax.ods’ located in the metadata folder (e.g., S026 is stimulus 26 of the syntactical questionnaire).

The second part of the ID is formed by the code for the specific language variety, (e.g. cim indicates the language variety is Cimbrian):

The last part of the audio ID is the User ID (e.g. U16) for which the sociolinguistic information can be found in the file ‘participants.ods’, located in the metadata folder.

Some speakers recorded more than one audio file for the same stimulus. These files are reported in the corpus as follows: S042_cim_U13a, S042_cim_U13b, etc.

Metadata folder

This folder contains four tables with the relevant information about the speakers and the linguistic stimuli:


The speaker information includes: - USER ID (included in the audio file name) - Language variety - Geographic location; name of community and GeoName ( code. - Personal information; gender and age of participants - Year of data collection - Linguistic profile: language proficiency, frequency of use, and contexts of use, if they use the language with family and/or friends.


The phonological questionnaire investigates three main phonological phenomena across language varieties: - the obstruent consonants - /s/ retraction before consonant - the realization of /r/

These phenomena have been investigated within specific contexts.

The metadata for the phonology section includes the following information: - STIMULUS ID (included in the audio file name) - Language variety - Graphical rendering of item (Graphy) - Gloss in Standard German and/or Standard Italian - Investigated phenomenon - Word context (e.g., word initial, medial, final position) - Target phoneme/specific context under investigation - VinKo Corpus ( reference for words elicited in both corpora - Notes: indicates items that were realized different from the target items, and the locations that the item was elicited in case of Trentino words, since not all items were elicited in every location.


This section investigates the following syntactic topics: - Syntax of the adjective within DP - Syntax of clitics - Negative concord - Pro drop - Complementizers - Locative particles - Auxiliary selection - Pronouns - Proper name syntax

The metadata for the syntax section includes the following information: - STIMULUS ID (included in the audio file name) - Sentence in Standard German and/or Standard Italian - Verbal context of the stimulus in Standard German and/or Italian (where applicable). - Syntactic topic - Linguistic variable under investigation - VinKo Corpus ( ID for sentences elicited in both corpora - ATHEME ID (for internal use only) - Questionnaire version/variety for which the sentence was elicited.


The transcription file includes: - Sentence ID (identical to associated audio file name) - Transcription; transcription of the audio file in local orthography. - Stimulus sentence and verbal context - Language variety - VinKo Corpus ( ID for sentences elicited in both corpora - ATHEME ID (for internal use only) - Audio: yes or no. ‘Yes’ indicates that the corpus contains an audio file to go with the transcription. ‘No’ indicates that no audio is available.

4. Additional information

Scientific Publications