2024-03-28T22:24:55Zhttp://clarin.eurac.edu/repository/oai/requestoai:clarin.eurac.edu:20.500.12124/32023-03-17T15:51:45Zhdl_20.500.12124_35hdl_20.500.12124_2
PAISÀ Corpus of Italian Web Text
Lyding, Verena
Stemle, Egon
Borghetti, Claudia
Brunello, Marco
Castagnoli, Sara
Dell’Orletta, Felice
Dittmann, Henrik
Lenci, Alessandro
Pirrelli, Vito
web corpus
language learning
The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ.
All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system.
The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor.
Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words.
The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.
2013-01
corpus
http://hdl.handle.net/20.500.12124/3
ita
http://aclweb.org/anthology/W14-0406
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
application/gzip
application/gzip
application/gzip
application/gzip
text/plain; charset=utf-8
downloadable_files_count: 4
Institute for Applied Linguistics, Eurac Research
http://www.corpusitaliano.it
oai:clarin.eurac.edu:20.500.12124/72023-03-17T16:05:41Zhdl_20.500.12124_35hdl_20.500.12124_2
DIDI - The DiDi Corpus of South Tyrolean CMC 1.0.0
Frey, Jennifer-Carmen
Glaznieks, Aivars
Stemle, Egon W.
Facebook
Social Media
Computer-mediated Communication
Chat
Status Updates
Comment
Social Networking Sites
Multilingualism
Dialect
South Tyrol
Instant Messaging
CMC
The DiDi corpus has an overall size of around 600.000 Tokens gathered from 136 South Tyrolean Facebook users who participated in the DiDi project. It consists of 11.102 Facebook wall posts, 6.507 wall comments and 22.218 private messages. All messages were written by the participants throughout the year 2013. Please read the fulldescription of the corpus for further details. Please consider also the description of the method of data collection and the full description of the DiDi project and its research questions.
As every participant could offer either his/her private messages, his/her texts on the wall or both, the corpus comprises wall posts and wall comments from 130 profiles and private messages of 56 profiles; 50 participants granted access to both types of data. Free access to the corpus is given to the wall posts and comments. Due to privacy issues the access to the private messages is restricted. Access to the private messages can be given for scientific research only, after signing a non-disclosure agreement. In case you are interested in the data for scientific reasons, please contact the research team.
All texts were anonymised in order to guarantee that the participants' identity cannnot be infered from the texts. The anonymisation included person names, group names, geographical names and adjectival references, institution names, hyperlinks, mail addresses, phone numbers, numbers of bank accounts, servers, postal codes and other private information. Please, read the anonymisation document for the anonymisation keys.
The corpus offers a vast range of research opportunities for linguists that are interested in CMC in general, and more specific in multilingual language use, the use of regional varieties, code switching, code shifting and code mixing phenomena, etc.
Access to the DiDi corpus: https://commul.eurac.edu/annis/didi
2019-03-07
corpus
http://hdl.handle.net/20.500.12124/7
deu
ita
eng
lad
https://gitlab.inf.unibz.it/commul/didi/data-bundle/-/tags/v1.0.0
http://www.eurac.edu/en/research/autonomies/commul/Documents/DiDi/NLP4CMC-2015_DiDi_paper.pdf
http://www.eurac.edu/en/research/autonomies/commul/Documents/DiDi/didi_clic-it2016_FINAL.pdf
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/html
text/html
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
text/plain
text/plain
downloadable_files_count: 6
Institute for Applied Linguistics, Eurac Research
http://www.eurac.edu/didi
oai:clarin.eurac.edu:20.500.12124/52023-03-17T15:51:45Zhdl_20.500.12124_1hdl_20.500.12124_4
MERLIN Written Learner Corpus for Czech, German, Italian 1.0
Wisniewski, Katrin
Abel, Andrea
Vodičková, Kateřina
Plassmann, Sybille
Meurers, Detmar
Woldt, Claudia
Schöne, Karin
Blaschitz, Verena
Lyding, Verena
Nicolas, Lionel
Vettori, Chiara
Pečený, Pavel
Hana, Jirka
Čurdová, Veronika
Štindlová, Barbora
Klein, Gudrun
Lauppe, Louise
Boyd, Adriane
Bykh, Serhiy
Krivanek, Julia
CEFR
language learning
learner corpus
The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels.
2014-12
corpus
http://hdl.handle.net/20.500.12124/5
ces
deu
ita
info:eu-repo/grantAgreement/EC/FP7/200250
https://gitlab.inf.unibz.it/commul/merlin-platform/data-bundle/-/tags/v1.0
http://www.lrec-conf.org/proceedings/lrec2014/summaries/606.html
http://hdl.handle.net/20.500.12124/6
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
PUB
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
text/plain
downloadable_files_count: 10
Institute for Applied Linguistics, Eurac Research
https://merlin-platform.eu
oai:clarin.eurac.edu:20.500.12124/102023-03-17T17:04:03Zhdl_20.500.12124_1hdl_20.500.12124_4
KoKo German L1 Learner Corpus v1
Abel, Andrea
Glaznieks, Aivars
Culy, Chris
learner corpus
German varieties
students in secondary school
argumentative essays
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It
has been created with the aim to investigate and describe the writing skills of
German-speaking secondary-school pupils at the end of their school career by
analysing authentic texts produced in classrooms.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to
university,
2. determine external factors that may influence the distribution of writing
skills, such as the region, sociolinguistic (gender, age), socio-economic, and
language-related biographical factors (L1, preferred variety of German, reading
and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the
schools were located (small vs. medium vs. big) and the type of school
(providing general education vs. education specific to a particular profession)
as strata for the sampling. Since data were collected during regular courses,
the typical formation of secondary-school classes in the three regions is
represented in the whole corpus. Most of the participants are German native
speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
2012-12
corpus
http://hdl.handle.net/20.500.12124/10
deu
https://gitlab.inf.unibz.it/commul/koko/data/bundle/-/tags/v1
http://apples.jyu.fi/article/abstract/305
http://hdl.handle.net/20.500.12124/11
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/plain; charset=utf-8
text/plain
text/plain
application/zip
text/html
text/html
application/zip
downloadable_files_count: 4
Institute for Applied Linguistics, Eurac Research
http://www.korpus-suedtirol.it/KoKo.html
oai:clarin.eurac.edu:20.500.12124/62023-03-17T15:51:45Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
MERLIN Written Learner Corpus for Czech, German, Italian 1.1
Wisniewski, Katrin
Abel, Andrea
Vodičková, Kateřina
Plassmann, Sybille
Meurers, Detmar
Woldt, Claudia
Schöne, Karin
Blaschitz, Verena
Lyding, Verena
Nicolas, Lionel
Vettori, Chiara
Pečený, Pavel
Hana, Jirka
Čurdová, Veronika
Štindlová, Barbora
Klein, Gudrun
Lauppe, Louise
Boyd, Adriane
Bykh, Serhiy
Krivanek, Julia
CEFR
language learning
learner corpus
The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels.
2018-08-24
corpus
http://hdl.handle.net/20.500.12124/6
ces
deu
ita
info:eu-repo/grantAgreement/EC/FP7/200250
https://gitlab.inf.unibz.it/commul/merlin-platform/data-bundle/-/tags/v1.1
http://www.lrec-conf.org/proceedings/lrec2014/summaries/606.html
http://hdl.handle.net/20.500.12124/5
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
PUB
text/html
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
text/plain
text/plain
downloadable_files_count: 10
Institute for Applied Linguistics, Eurac Research
https://merlin-platform.eu
oai:clarin.eurac.edu:20.500.12124/342023-02-03T12:41:47Zhdl_20.500.12124_1hdl_20.500.12124_4
WIP: LEONIDE PoS training
Schmalz, Veronica
Frey, Jennifer-Carmen
Stemle, Egon W.
CMC
Unfinished Draft
2021-08-05
corpus
http://hdl.handle.net/20.500.12124/34
ita
downloadable_files_count: 0
Learner Language
oai:clarin.eurac.edu:20.500.12124/252023-03-17T16:40:50Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_42hdl_20.500.12124_43
LEONIDE - Longitudinal Learner Corpus in Italiano, Deutsch and English 1.1
Glaznieks, Aivars
Frey, Jennifer-Carmen
Stopfner, Maria
Zanasi, Lorenzo
Nicolas, Lionel
multilingualism
evaluation
language competences
learner corpus
L1
L2
student essays
picture story
opinion texts
argumentative essay
LEONIDE is a longitudinal corpus of student essays documenting the language competences and writing development of lower secondary school students in three different languages.
The corpus contains 2.512 texts from 163 pupils, who participated in the project “One school, many languages” conducted in eight schools in the officially multilingual Italian province of South Tyrol / Alto Adige (Zanasi & Stopfner, 2018). The aim of the project was to document the development of the pupils' plurilingual linguistic and communicative skills by collecting oral and written language samples in Italian, German and English, in order to obtain a global view of their individual linguistic repertoire.
LEONIDE contains all the texts written by the participating students during the course of the project, the overall size of the corpus amounts to ca. 240.000 tokens. The texts were collected over the span of 3 consecutive years (2015-2018) in public middle schools (i.e. lower secondary school, grade 6 to grade 8). The pupils were 11 years old at the beginning of the data collection and 13 years old at the end. In each grade, two written texts were collected that differ with respect to genre: the first text was elicited using a picture story re-telling task; the second text is an opinion text on different aspects related to the pupils’ life and public discourse. For each genre and each grade, the corpus provides texts in the three languages German, Italian and English. In order to reflect the school system of the Province of South Tyrol / Alto Adige, about half of the texts was collected in four schools in which German is the main language of teaching and Italian is taught as L2. The other half of the texts was collected in four schools in which Italian is the main language of teaching and German is taught as L2. In all schools, English is taught as L3 (i.e. as a foreign language at school). Subdivided by language, the corpus contains 844 Italian, 833 German and 835 English texts.
Manual annotation:
The corpus is fully anonymised and annotated with target hypotheses correcting orthography errors in the text as well as annotations on structural elements (paragraphs, line breaks, bullet points, symbols or emoticons etc.), foreign word insertions and transcript surface features (e.g. deletions, corrections or insertions of the student, unreadable or ambiguous items).
Automatic annotation:
Automatic linguistic annotation included sentence splitting, tokenisation, lemmatisation and part-of-speech-tagging.
Text metadata:
The corpus provides a series of relevant person-related metadata (e.g. age, gender, first language(s), school and possible special needs of the students) as well as task-related metadata (e.g. task year, text genre, etc.)
Usage:
As the corpus documents the development of plurilingual competences of individual learners over a period of three years, it will allow both quantitative research on the characteristics of young learners’ language over a relatively long period, as well as investigations of the development of individuals taking into account a wide range of person related metadata. In addition, it allows contrastive analyses of the young learners’ progress in their L1, L2 and L3.
Availability:
The corpus will be available for corpus queries via an ANNIS search interface and as download for academic purposes (ACA-BY-NC-NORED 1.0) on the Eurac Research Clarin Centre by the end of 2020.
References:
Zanasi, L. & Stopfner, M. (2018). Rilevare, osservare, consultare. Metodi e strumenti per l’analisi del plurilinguismo nella scuola secondaria di primo grado. In C. M. Coonan, A. Bier Ada & E. Ballarin (Ed.), La didattica delle lingue nel nuovo millennio. Le sfide dell’internazionalizzazione (pp. 135-148). Edizioni Ca’Foscari. http://doi.org/10.30687/978-88-6969-227-7/009
Glaznieks, A., Frey, J.-C., Stopfner, M., Zanasi, L. & Nicolas, L. (accepted): LEONIDE: A longitudinal trilingual corpus of young learners of Italian, German and English. In: International Journal of Learner Corpus Linguistics.
2020-12-18
corpus
http://hdl.handle.net/20.500.12124/25
deu
ita
eng
https://gitlab.inf.unibz.it/commul/leonide/data/bundle/-/tags/v1.1
https://doi.org/10.1075/ijlcr.21004.gla
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/html
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
text/plain
text/plain
downloadable_files_count: 9
Institute for Applied Linguistics, Eurac Research
http://sms-project.eurac.edu/
oai:clarin.eurac.edu:20.500.12124/82023-03-17T15:51:45Zhdl_20.500.12124_35hdl_20.500.12124_2
KrdWrd CANOLA Corpus 1.0
Stemle, Egon W.
Steger, Johannes M.
boiler plate removal
web page cleaning
WaC
Web as Corpus
training data
manual annotation
The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and evaluated by the tools and infrastructure of the KrdWrd Project.
2010-09-10
corpus
http://hdl.handle.net/20.500.12124/8
eng
https://github.com/krdwrd/data/releases/tag/v1.0
https://www.sigwac.org.uk/raw-attachment/wiki/WAC5/WAC5_proceedings.pdf
http://hdl.handle.net/20.500.12124/9
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
PUB
application/gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Institute for Applied Linguistics, Eurac Research
https://krdwrd.github.io
oai:clarin.eurac.edu:20.500.12124/92023-03-17T15:51:45Zhdl_20.500.12124_35hdl_20.500.12124_2
KrdWrd CANOLA Corpus 1.1
Stemle, Egon W.
Steger, Johannes M.
boiler plate removal
web page cleaning
WaC
Web as Corpus
training data
manual annotation
The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and evaluated by the tools and infrastructure of the KrdWrd Project.
2010-11-25
corpus
http://hdl.handle.net/20.500.12124/9
eng
https://github.com/krdwrd/data/releases/tag/v1.1
https://github.com/krdwrd/doc_CANOLA/releases/tag/v1.1
https://www.sigwac.org.uk/raw-attachment/wiki/WAC5/WAC5_proceedings.pdf
http://hdl.handle.net/20.500.12124/8
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
PUB
application/pdf
application/gzip
text/plain; charset=utf-8
text/plain
downloadable_files_count: 2
Institute for Applied Linguistics, Eurac Research
https://krdwrd.github.io
oai:clarin.eurac.edu:20.500.12124/112023-03-17T16:07:21Zhdl_20.500.12124_1hdl_20.500.12124_4
KoKo German L1 Learner Corpus v2
Abel, Andrea
Glaznieks, Aivars
Culy, Chris
learner corpus
German varieties
students in secondary school
argumentative essays
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms.
The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Error annotation relates to the orthographic level only. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to university,
2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization.
2012-12
corpus
http://hdl.handle.net/20.500.12124/11
deu
https://gitlab.inf.unibz.it/commul/koko/data/bundle/-/tags/v2
http://apples.jyu.fi/article/abstract/305
http://www.lrec-conf.org/proceedings/lrec2014/pdf/934_Paper.pdf
http://hdl.handle.net/20.500.12124/10
http://hdl.handle.net/20.500.12124/12
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/html
text/html
application/zip
application/zip
application/zip
text/plain; charset=utf-8
text/plain
text/plain
downloadable_files_count: 5
Institute for Applied Linguistics, Eurac Research
http://www.korpus-suedtirol.it/KoKo.html
oai:clarin.eurac.edu:20.500.12124/122023-03-17T16:39:47Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
KoKo German L1 Learner Corpus v3
Abel, Andrea
Glaznieks, Aivars
Culy, Chris
learner corpus
German varieties
students in secondary school
argumentative essays
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms.
The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections. Error annotations relate to the orthographic level (including punctuation errors), and a selection of the texts (n=597) also contain error annotations on the grammatical level.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to university,
2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization.
2014-12
corpus
http://hdl.handle.net/20.500.12124/12
deu
https://gitlab.inf.unibz.it/commul/koko/data/bundle/-/tags/v3
http://www.lrec-conf.org/proceedings/lrec2014/pdf/934_Paper.pdf
http://hdl.handle.net/20.500.12124/11
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/plain; charset=utf-8
text/plain
text/plain
text/html
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
downloadable_files_count: 7
Institute for Applied Linguistics, Eurac Research
http://www.korpus-suedtirol.it/KoKo.html
oai:clarin.eurac.edu:20.500.12124/152023-10-27T10:44:22Zhdl_20.500.12124_1hdl_20.500.12124_16
Beldeko Summary Corpus v1.0.0
Strobl, Carola
academic writing
L1 Dutch
L2 German
learner corpus
Beldeko Summary Corpus v1.0.0
The Beldeko (Belgisches Deutschkorpus) Summary Corpus is a learner corpus that consists of summaries written by advanced L2 German learners (CEF level B2-C1) with L1 Dutch. It has been created with the aim of investigating the academic writing skills in L2 German of third-year students of two bachelor programmes in Applied Linguistics and Linguistics and Literature, respectively.
The corpus consists of 301 summaries (70774 tokens) written by 115 students of three intact classes (convenience sampling). The texts were collected at Ghent University (in 2013 and in 2014) and University College of Ghent (in 2013) as pre- and posttests of an intervention study on collaborative writing carried out by Carola Strobl in the context of her PhD research (Strobl, C. (2015). Affordances of online technologies for academic writing instruction in a foreign language. Ghent University, unpublished doctoral dissertation). 82 students produced three summaries each (pretest, posttest immediately after the three-weeks-intervention, delayed posttest six weeks after the intervention; missing data are indicated as n.a. in the metadata file) and 33 students produced two summaries each (pretest and posttest, missing data are indicated as n.a. in the metadata file).
The metadata file (Beldeko_Summary_1.0.0_metadata.xlsx) provides information about:
• Institution of data collection (HG= University College of Ghent, UG= Ghent University)
• Year of data collection (2013, 2014)
• Participants´ gender (f, m)
• Number of texts written and number of tokens in each text (T1, T2, T3)
The individual file names of the corpus reveal institution, year, unique ID of participant (per institution per year), text number, in the given order.
The summaries contain between 37-330 words each, with a mean of 230 words (the targeted word count was between 220-250 words). Outliers regarding text length were unfinished texts produced by students who struggled with the time restriction of 60 minutes. The texts were written in class, on computers. Students were allowed to use online auxiliary means such as dictionaries. The task consisted in summarizing two texts (fragments of newspaper articles or interviews or websites) about a topic related to language variation in German each time (Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch, Varianten-Wörterbuch des Deutschen; see also word files provided in metadata). More specifically, the topics were distributed as follows:
Kiezdeutsch: HG_2013_T1, UG_2013_T1, UG_2014_T1
Mundartdebatte in der Schweiz: HG_2013_T2, UG_2013_T2, UG_2014_T2
Viadrinisch: HG_2013_T3,
Varianten-Wörterbuch des Deutschen: UG_2014_T3
2020-02-17
corpus
http://hdl.handle.net/20.500.12124/15
deu
http://hdl.handle.net/1854/LU-6940356
http://hdl.handle.net/20.500.12124/68
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
https://creativecommons.org/licenses/by-nc-nd/4.0/
PUB
text/plain; charset=utf-8
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/pdf
application/pdf
application/pdf
application/pdf
text/plain
text/plain
text/plain
text/plain
downloadable_files_count: 13
Ghent University
oai:clarin.eurac.edu:20.500.12124/242023-06-18T18:36:47Zhdl_20.500.12084_71hdl_20.500.12084_73
ACTER (Annotated Corpora for Term Extraction Research) v1.3
Rigouts Terryn, Ayla
terminology
automatic term extraction
term extraction
comparable corpora
named entities
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised comparable corpora covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).
2019-12-17
corpus
http://hdl.handle.net/20.500.12124/24
eng
fra
nld
https://github.com/AylaRT/ACTER/releases/tag/v1.3
https://doi.org/10.1007/s10579-019-09453-9
http://hdl.handle.net/20.500.12124/38
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
PUB
https://creativecommons.org/licenses/by-nc-sa/4.0/
text/html
application/zip
text/plain; charset=utf-8
text/plain
downloadable_files_count: 2
Ghent University
LT3 Language and Translation Technology Team
https://github.com/AylaRT/ACTER
oai:clarin.eurac.edu:20.500.12124/182023-02-03T12:41:47Zhdl_20.500.12124_1hdl_20.500.12124_4
LEKO Corpus v1
Abel, Andrea
learner corpus
The LEKO corpus is a written learner corpus and constitutes a subcorpus of the KOLIPSI-I Corpus, using 290 texts written in Italian as L2 by German-L1-pupils.
The aim of the project LEKO was to describe the use of phrasemes in these texts. In addition to the annotations available in the KOLIPSI-I Corpus the LEKO Corpus contains manual annotations including the phraseme category, errors, morpho-syntactic features and error explanations.
2020
corpus
http://hdl.handle.net/20.500.12124/18
ita
downloadable_files_count: 0
Institute for Applied Linguistics, Eurac Research
oai:clarin.eurac.edu:20.500.12124/262023-03-17T15:49:40Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
Kolipsi-1 Corpus v1.0
Glaznieks, Aivars
Frey, Jennifer-Carmen
Abel, Andrea
Vettori, Chiara
Nicolas, Lionel
L2
Learner corpora
South Tyrol
argumentative essay
students
high school
upper secondary school
picture story
opinion text
The Kolipsi-1 L2 is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.
The data collection took place in autumn 2007 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre). For both tasks a time limit of 30 minutes was fixed and no additional reference material was allowed.
CEFR levesl have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, lexis, grammar and sociolinguistic appropriateness.
Person-related metadata provides information about:
- the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation
- the writer's age, gender and socio-economic status
- the writer's district of residence and whether he lives in an urban or rural environment
- the language, location and type of school the writer attended
- whether the writer passed the local bilinguality exam or not
- an anonymous identifier for the writer's school class and L2 teacher to account for class effects
All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.
In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
2021-05-05
corpus
http://hdl.handle.net/20.500.12124/26
deu
ita
http://hdl.handle.net/20.500.12124/64
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/html
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
text/plain
text/plain
downloadable_files_count: 8
Institute for Applied Linguistics, Eurac Research
https://www.porta.eurac.edu/lci/kolipsi-family/
oai:clarin.eurac.edu:20.500.12124/302023-03-17T16:02:03Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
Kolipsi-2 Corpus v1.0
Glaznieks, Aivars
Frey, Jennifer-Carmen
Nicolas, Lionel
Abel, Andrea
Vettori, Chiara
L2 corpora
learner corpus
student essay
argumentative essay
picture story
South Tyrol
The Kolipsi-2 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI II project, a replication study of the KOLIPSI project on “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation” that was conducted 7 years after the original study.
The data collection for this second edition took place in spring 2014 and is based on two standardized tests for written productions, that were aligned with the original tasks for the KOLIPSI study. However, while the first task remained the same for both editions, the second task was slightly adapted. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) writing an e-mail about negative aspects of social-media chats prompted by a letter to the editor in a youth magazine (argumentative text genre). For both tasks a time limit of 25 minutes was fixed and no additional reference material was allowed.
CEFR levels have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, sociolinguistic appropriateness, lexical accuracy, lexical diversity, grammar and orthography.
Person-related metadata provides information about:
- the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation as well as the pre-dominant language spoken in the area the writer is residing in
- the writer's results from an additional language test in the L2 (dialang test)
- the writer's competence in the local German dialect (for students with L1 Italian only)
- the writer's age, gender and socio-economic status
- whether the writer lives in an urban or rural environment
- the language, location and type of school the writer attended
- an anonymous identifier for the writer's school class to account for class effects
All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.
In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
2021-05-05
corpus
http://hdl.handle.net/20.500.12124/30
ita
deu
http://hdl.handle.net/20.500.12124/66
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/plain; charset=utf-8
text/plain
text/plain
text/html
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
downloadable_files_count: 8
Institute for Applied Linguistics, Eurac Research
https://www.porta.eurac.edu/lci/kolipsi-family/
oai:clarin.eurac.edu:20.500.12124/322023-03-17T15:51:44Zhdl_20.500.12124_36hdl_20.500.12124_37
VinKo (Varieties in Contact) Corpus v1.0
Rabanus, Stefan
Tomaselli, Alessandra
Padovan, Andrea
Kruijt, Anne
Alber, Birgit
Cordin, Patrizia
Zamparelli, Roberto
Vogt, Barbara Maria
multilingualism
crowdsourcing
German dialects
Italian dialects
Ladin
Cimbrian
Mòcheno
Saurano
language contact
minority languages
VINKO is a spoken corpus based on crowdsourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto, and Friuli-Venezia Giulia, with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano) and Romance (Ladin, Trentino, and Veneto dialects). The data collection took place from June 2017 to May 2021.
2021-08-24
corpus
http://hdl.handle.net/20.500.12124/32
ita
deu
cim
lld
Trentino
Tyrolean
Mòcheno
Saurano
Veneto
info:eu-repo/grantAgreement/EC/FP7/613465
http://www.kit.gwi.uni-muenchen.de/?p=13739&v=2
http://hdl.handle.net/20.500.12124/46
Attribution-NonCommercial-ShareAlike 3.0 Italy (CC BY-NC-SA 3.0 IT)
https://creativecommons.org/licenses/by-nc-sa/3.0/it/deed.en
PUB
text/plain; charset=utf-8
text/html
text/plain
application/zip
application/zip
application/zip
downloadable_files_count: 5
University of Verona
https://www.vinko.it
oai:clarin.eurac.edu:20.500.12124/332023-11-23T21:34:21Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
LEKO v1.0
Abel, Andrea
Zanasi, Lorenzo
Nicolas, Lionel
Konecny, Christine
Autelli, Erica
Phraseology
Phrasemes
Lexical combinations
learner language
student writing
non-standard language
The LEKO corpora LEKO_Kolipsi and LEKO_Merlin provide lexical annotations for phraseological elements in Italian L2 writing on the basis of a subset of the texts of the Kolipsi-1 corpus and the Merlin corpus respectively. The annotations were jointly created by the University of Innsbruck (Austria) and Eurac Research Bolzano (Italy) within the project LEKO, whose aim was to describe the use of phrasemes in these texts. There are manual annotations for phraseme category, lexical errors, morpho-syntactic features and error explanations.
LEKO_Kolipsi contains about 55 000 tokens in 282 texts from 141 pupils of the final year of upper secondary school, representing two different text types (email and letter, narrative and argumentative genre) as described in the Kolipsi-1 documentation.
LEKO_Merlin contains about 9 000 tokens in 50 texts from 50 examinees, who took part in an official language test (TELC) for Italian.
The documents have been transcribed according to the Kolipsi-1 and Merlin Transcription guidelines. Annotation guidelines for the lexical annotations can be found here.
Note: The LEKO corpora do not contain manual annotations for non-lexical errors, foreign word insertions, target language transcriptions, ambiguous writings or other annotations available in the base corpora Kolipsi-1 and Merlin. In order to retrieve any of those annotations and/or full target versions of the student writings please consult the base corpora directly.
2021
corpus
http://hdl.handle.net/20.500.12124/33
ita
http://hdl.handle.net/10863/7683
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
ACA
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
text/html
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
downloadable_files_count: 7
Institute for Applied Linguistics, Eurac Research
oai:clarin.eurac.edu:20.500.12124/382023-06-18T18:36:41Zhdl_20.500.12084_71hdl_20.500.12084_73
ACTER (Annotated Corpora for Term Extraction Research) v1.4
Rigouts Terryn, Ayla
terminology
automatic term extraction
term extraction
comparable corpora
named entities
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised comparable corpora covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).
2020-07-15
corpus
http://hdl.handle.net/20.500.12124/38
eng
fra
nld
https://github.com/AylaRT/ACTER/releases/tag/v1.4
https://doi.org/10.1007/s10579-019-09453-9
https://aclanthology.org/2020.computerm-1.12
http://hdl.handle.net/20.500.12124/47
http://hdl.handle.net/20.500.12124/24
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
PUB
https://creativecommons.org/licenses/by-nc-sa/4.0/
text/html
application/zip
text/plain; charset=utf-8
text/plain
downloadable_files_count: 2
Ghent University
LT3 Language and Translation Technology Team
https://github.com/AylaRT/ACTER
oai:clarin.eurac.edu:20.500.12124/452023-03-17T15:49:39Zhdl_20.500.12124_36hdl_20.500.12124_37
Code preference in OLL of accommodation in Palma
Bruyèl-Olmedo, Antonio
Linguistic landscape
Online linguistic landscape
Multilingualism
minority languages
tourism
The file consists of a database in .SAV format (SPSS) of language choice and preference as reflected in the websites of accommodation establishments in the city of Palma de Mallorca (Spain). The database comprises identifying data of all 245 establishments as well as multilingualism information on code choice and preference. The main variables considered are: Post code, Accommodation type, Ownership, Name, Rating, presence of Catalan, L1, L2, L3, L4, L5, L6, Ln, type of multiwriting and Type of Multilingualism.
Code preference includes positions from L1 through L6.
2022-01-12
lexical conceptual resource
http://hdl.handle.net/20.500.12124/45
eng
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
https://creativecommons.org/licenses/by-nc-nd/4.0/
PUB
application/octet-stream
text/plain; charset=utf-8
downloadable_files_count: 1
Escuela Universitaria de Turismo 'Felipe Moreno' (appointed to Universitat de les Illes Balears)
oai:clarin.eurac.edu:20.500.12124/462023-11-03T13:49:13Zhdl_20.500.12124_36hdl_20.500.12124_37
VinKo (Varieties in Contact) Corpus v1.1
Rabanus, Stefan
Kruijt, Anne
Tagliani, Marta
Tomaselli, Alessandra
Padovan, Andrea
Alber, Birgit
Cordin, Patrizia
Zamparelli, Roberto
Vogt, Barbara Maria
multilingualism
crowdsourcing
German dialects
Italian dialects
Ladin
Cimbrian
Mòcheno
Saurano
language contact
minority languages
Sappadino
VINKO is a spoken corpus based on crowd-sourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto, and Friuli-Venezia Giulia, with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano, and Sappadino) and Romance (Ladin, Trentino and Veneto dialects). The data collection took place from June 2017 to December 2021.
2022
corpus
http://hdl.handle.net/20.500.12124/46
ita
deu
cim
lld
Trentino
Tyrolean
Mòcheno
Saurano
Veneto
Sappadino
info:eu-repo/grantAgreement/EC/FP7/613465
http://www.kit.gwi.uni-muenchen.de/?p=13739&v=2
http://hdl.handle.net/20.500.12124/32
http://hdl.handle.net/20.500.12124/74
Attribution-NonCommercial-ShareAlike 3.0 Italy (CC BY-NC-SA 3.0 IT)
https://creativecommons.org/licenses/by-nc-sa/3.0/it/deed.en
PUB
text/plain
text/plain
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
downloadable_files_count: 16
University of Verona
https://www.vinko.it
oai:clarin.eurac.edu:20.500.12124/472023-06-18T18:36:34Zhdl_20.500.12084_71hdl_20.500.12084_73
ACTER (Annotated Corpora for Term Extraction Research) v1.5
Rigouts Terryn, Ayla
terminology
automatic term extraction
term extraction
comparable corpora
named entities
ACTER (Annotated Corpora for Term Extraction Research) is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).
2022-04-08
corpus
http://hdl.handle.net/20.500.12124/47
eng
fra
nld
https://github.com/AylaRT/ACTER/releases/tag/v1.5
https://doi.org/10.1007/s10579-019-09453-9
https://aclanthology.org/2020.computerm-1.12
http://hdl.handle.net/20.500.12124/38
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
PUB
https://creativecommons.org/licenses/by-nc-sa/4.0/
text/plain
text/plain
text/html
application/zip
text/plain; charset=utf-8
downloadable_files_count: 4
Ghent University
LT3 Language and Translation Technology Team
https://github.com/AylaRT/ACTER
oai:clarin.eurac.edu:20.500.12124/532023-03-17T15:51:45Zhdl_20.500.12124_36hdl_20.500.12124_37
AThEME Verona-Trento Corpus
Tomaselli, Alessandra
Kruijt, Anne
Alber, Birgit
Bidese, Ermenegildo
Casalicchio, Jan
Cordin, Patrizia
Kokkelmans, Joachim
Padovan, Andrea
Rabanus, Stefan
Zuin, Francesco
multilingualism
German dialects
Italian dialects
Fodom Ladin
Fassan Ladin
Mòcheno
Cimbrian
Saurano
Tyrolean
Trentino
Venetan
language contact
minority languages
phonology
syntax
The AThEME Verona-Trento Corpus is a spoken corpus composed of data collected during the AThEME project in Work Package 2 ‘Regional Languages’ by the units of Verona and Trento for minority languages and dialects spoken in the area between Innsbruck and the Po Valley (Tyrolean, Trentino, Fodom Ladin, Fassan Ladin, Mòcheno, Cimbrian, and Venetan). The corpus also contains data on the Germanic minority languages Timavese (PRIN 2017) and Saurano. The corpus contains audio recordings and partial transcriptions of the responses to a phonological questionnaire (topics: obstruents, final devoicing, s-retraction, realization of /r/) and a morpho-syntactical questionnaire (topics: adjectives, pronouns, auxiliary selection, pro-drop, complementizers). The data collection was done via linguistic fieldwork interviews and took place between 2014 and 2019.
2022-12-21
corpus
http://hdl.handle.net/20.500.12124/53
ita
deu
cim
Mòcheno
Trentino
bar
vec
Saurano
Timavese
Tyrolean
lld
info:eu-repo/grantAgreement/EC/FP7/613465
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
text/plain; charset=utf-8
text/plain
text/plain
text/html
application/zip
application/zip
application/zip
downloadable_files_count: 6
University of Verona
https://cordis.europa.eu/project/id/613465
oai:clarin.eurac.edu:20.500.12124/602023-06-18T18:38:20Zhdl_20.500.12084_71hdl_20.500.12084_72
MT@BZ translation corpus v1.0
De Camillis, Flavia
Chiocchetti, Elena
Stemle, Egon W.
machine translation
annotation
translation errors
accuracy
fluency
Italian
German
South Tyrolean German
legal language
The MT@BZ is a translation corpus that consists of 52 decrees published by the Autonomous Province of Bolzano (South Tyrol) aligned with their machine translated versions. More precisely, it consists of 26 decrees in German and the same 26 in Italian in their official versions, respectively machine translated by the project team into Italian and into German. 10 of them are COVID-19 related decress, while 16 are miscellaneous. Overall, they consist of around 130,000 words. Their machine translation was carried out with a customized version of ModernMT. Later, the corpus was uploaded first into the annotation platform Webanno, then transferred to Inception. Four annotators annotated the translation errors made by the machine according to an ad hoc error taxonomy for quality assessment. Finally, the annotations were curated to create a gold standard corpus.
2023-06-13
corpus
http://hdl.handle.net/20.500.12124/60
ita
deu
https://gitlab.inf.unibz.it/commul/mt-bz/data/bundle/-/tags/v1.0
https://events.tuni.fi/uploads/2023/06/11678752-proceedings-eamt2023.pdf
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
PUB
text/html
text/html
application/zip
application/zip
application/zip
text/plain; charset=utf-8
downloadable_files_count: 5
Institute for Applied Linguistics, Eurac Research
https://www.eurac.edu/it/institutes-centers/istituto-di-linguistica-applicata/projects/mtbz
oai:clarin.eurac.edu:20.500.12124/612023-05-03T22:31:16Zhdl_20.500.12124_1hdl_20.500.12124_16
Core Metadata [Schema] for Learner Corpora Draft 1.0
Granger, Sylviane
Paquot, Magali
metadata
learner corpus
standardisation
First proposal towards a "Core Metadata [Schema] for Learner Corpora", presented at the "CLARIN workshop on Interoperability of Second Language Resources and Tools", Gothenburg, Sweden, 06-08/12/2017 <https://sweclarin.se/swe/workshop-interoperability-l2-resources-and-tools>. It was circulated as part of the invited talk "Towards standardization of metadata for L2 corpora" that took stock of a range of metadata sets and made suggestions for minimal and maximal design principles, but it was never published (or part of a publication).
2017-12-15
data management resource
http://hdl.handle.net/20.500.12124/61
eng
https://sweclarin.se/sites/sweclarin.se/files/event_atachements/Granger_Paquot_Metadata_G%C3%B6teborg_final.pdf
https://doi.org/10.14428/DVN/4CDX3P
CC0-No Rights Reserved
https://creativecommons.org/publicdomain/zero/1.0/
PUB
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 1
Institute for Applied Linguistics, Eurac Research
oai:clarin.eurac.edu:20.500.12124/622023-06-18T15:29:13Zhdl_20.500.12084_71hdl_20.500.12124_36hdl_20.500.12084_72hdl_20.500.12124_37
MT@BZ annotation guidelines v1.0
Chiocchetti, Elena
De Camillis, Flavia
annotation guidelines
machine translation
quality assessment
legal language
Italian
German
accuracy
fluency
South Tyrolean German
The MT@BZ annotation guidelines are guidelines for legal Italian-German machine translation quality assessment. Particularly, they cover the South Tyrolean German variety. They are based on version 1.3.3 of the Annotation Guidelines for English-Dutch Machine Translation Quality Assessment (https://www.lt3.ugent.be/publications/annotation-guidelines-for-english-dutch-machine-tr/). The guidelines also include specific instructions on how to annotate errors in WebAnno/INCEpTION and which sources to consult when assessing the correctness of a translation.
2022-05-31
annotation guidelines
http://hdl.handle.net/20.500.12124/62
eng
ita
deu
https://gitlab.inf.unibz.it/commul/mt-bz/guidelines/-/tags/v1.0
http://hdl.handle.net/20.500.12124/60
Creative Commons - Attribution 4.0 International (CC BY 4.0)
PUB
https://creativecommons.org/licenses/by/4.0/
text/html
text/html
application/pdf
text/plain; charset=utf-8
downloadable_files_count: 3
Institute for Applied Linguistics, Eurac Research
https://www.eurac.edu/it/institutes-centers/istituto-di-linguistica-applicata/projects/mtbz
oai:clarin.eurac.edu:20.500.12124/642023-03-17T15:49:40Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43hdl_20.500.12124_31
Kolipsi-1 Corpus v1.1
Glaznieks, Aivars
Frey, Jennifer-Carmen
Abel, Andrea
Vettori, Chiara
Nicolas, Lionel
L2
Learner corpora
South Tyrol
argumentative essay
students
high school
upper secondary school
picture story
opinion text
The Kolipsi-1 L2 is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.
The data collection took place in autumn 2007 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre). For both tasks a time limit of 30 minutes was fixed and no additional reference material was allowed.
CEFR levesl have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, lexis, grammar and sociolinguistic appropriateness.
Person-related metadata provides information about:
- the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation
- the writer's age, gender and socio-economic status
- the writer's district of residence and whether he lives in an urban or rural environment
- the language, location and type of school the writer attended
- whether the writer passed the local bilinguality exam or not
- an anonymous identifier for the writer's school class and L2 teacher to account for class effects
All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.
In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
2023-02-15
corpus
http://hdl.handle.net/20.500.12124/64
deu
ita
http://hdl.handle.net/20.500.12124/26
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/plain; charset=utf-8
text/html
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
downloadable_files_count: 8
Institute for Applied Linguistics, Eurac Research
https://www.porta.eurac.edu/lci/kolipsi-family/
oai:clarin.eurac.edu:20.500.12124/662023-03-17T16:01:25Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43hdl_20.500.12124_31
Kolipsi-2 Corpus v1.1
Glaznieks, Aivars
Frey, Jennifer-Carmen
Nicolas, Lionel
Abel, Andrea
Vettori, Chiara
L2 corpora
learner corpus
student essay
argumentative essay
picture story
South Tyrol
The Kolipsi-2 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI II project, a replication study of the KOLIPSI project on “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation” that was conducted 7 years after the original study.
The data collection for this second edition took place in spring 2014 and is based on two standardized tests for written productions, that were aligned with the original tasks for the KOLIPSI study. However, while the first task remained the same for both editions, the second task was slightly adapted. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) writing an e-mail about negative aspects of social-media chats prompted by a letter to the editor in a youth magazine (argumentative text genre). For both tasks a time limit of 25 minutes was fixed and no additional reference material was allowed.
CEFR levels have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, sociolinguistic appropriateness, lexical accuracy, lexical diversity, grammar and orthography.
Person-related metadata provides information about:
- the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation as well as the pre-dominant language spoken in the area the writer is residing in
- the writer's results from an additional language test in the L2 (dialang test)
- the writer's competence in the local German dialect (for students with L1 Italian only)
- the writer's age, gender and socio-economic status
- whether the writer lives in an urban or rural environment
- the language, location and type of school the writer attended
- an anonymous identifier for the writer's school class to account for class effects
All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.
In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
2021-02-28
corpus
http://hdl.handle.net/20.500.12124/66
ita
deu
http://hdl.handle.net/20.500.12124/30
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/html
text/html
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
downloadable_files_count: 8
Institute for Applied Linguistics, Eurac Research
https://www.porta.eurac.edu/lci/kolipsi-family/
oai:clarin.eurac.edu:20.500.12124/682023-10-27T10:43:27Zhdl_20.500.12124_1hdl_20.500.12124_16
Beldeko Summary Corpus v1.1.0
Strobl, Carola
Wedig, Helena
academic writing
L1 Dutch
L2 German
learner corpus
Beldeko Summary Corpus v1.1.0
The Beldeko (Belgisches Deutschkorpus) Summary Corpus is a learner corpus that consists of summaries written by advanced L2 German learners (CEF level B2-C1) with L1 Dutch. It has been created with the aim of investigating the academic writing skills in L2 German of third-year students of two bachelor programmes in Applied Linguistics and Linguistics and Literature, respectively.
The corpus consists of 301 summaries (70774 tokens) written by 115 students of three intact classes (convenience sampling). The texts were collected at Ghent University (in 2013 and in 2014) and University College of Ghent (in 2013) as pre- and posttests of an intervention study on collaborative writing carried out by Carola Strobl in the context of her PhD research (Strobl, C. (2015). Affordances of online technologies for academic writing instruction in a foreign language. Ghent University, unpublished doctoral dissertation). 82 students produced three summaries each (pretest, posttest immediately after the three-weeks-intervention, delayed posttest six weeks after the intervention; missing data are indicated as n.a. in the metadata file) and 33 students produced two summaries each (pretest and posttest, missing data are indicated as n.a. in the metadata file).
The metadata file (Beldeko_Summary_1.1.0_metadata.xlsx) provides information about:
• Institution of data collection (HG= University College of Ghent, UG= Ghent University)
• Year of data collection (2013, 2014)
• Participants´ gender (f, m)
• Number of texts written and number of tokens in each text (T1, T2, T3)
The individual file names of the corpus reveal institution, year, unique ID of participant (per institution per year), text number, in the given order.
The summaries contain between 37-330 words each, with a mean of 230 words (the targeted word count was between 220-250 words). Outliers regarding text length were unfinished texts produced by students who struggled with the time restriction of 60 minutes. The texts were written in class, on computers. Students were allowed to use online auxiliary means such as dictionaries. The task consisted in summarizing two texts (fragments of newspaper articles or interviews or websites) about a topic related to language variation in German each time (Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch, Varianten-Wörterbuch des Deutschen; see also word files provided in metadata). More specifically, the topics were distributed as follows:
Kiezdeutsch: HG_2013_T1, UG_2013_T1, UG_2014_T1
Mundartdebatte in der Schweiz: HG_2013_T2, UG_2013_T2, UG_2014_T2
Viadrinisch: HG_2013_T3,
Varianten-Wörterbuch des Deutschen: UG_2014_T3
The new version of the corpus (Beldeko 1.1.0) contains the manual annotations of the texts with token id, sentence id, source text form, target form, POS (STTS) and simple UPOS part-of-speech tag.
2023-03-01
corpus
http://hdl.handle.net/20.500.12124/68
deu
http://hdl.handle.net/20.500.12124/15
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
PUB
text/plain; charset=utf-8
application/zip
application/zip
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/pdf
application/pdf
application/pdf
application/pdf
downloadable_files_count: 7
Ghent University
University of Antwerp
oai:clarin.eurac.edu:20.500.12124/742023-12-21T07:20:32Zhdl_20.500.12124_36hdl_20.500.12124_37
VinKo (Varieties in Contact) Corpus v1.2
Rabanus, Stefan
Kruijt, Anne
Tagliani, Marta
Tomaselli, Alessandra
Padovan, Andrea
Alber, Birgit
Cordin, Patrizia
Zamparelli, Roberto
Vogt, Barbara Maria
multilingualism
crowdsourcing
German dialects
Italian dialects
Ladin
Cimbrian
Mòcheno
Saurano
language contact
minority languages
Sappadino
VINKO is a spoken corpus based on crowd-sourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto, and Friuli-Venezia Giulia, with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano, and Sappadino) and Romance (Ladin, Trentino and Veneto dialects). The data collection took place from June 2017 to May 2023.
2023
corpus
http://hdl.handle.net/20.500.12124/74
ita
deu
cim
lld
Trentino
Tyrolean
Mòcheno
Saurano
Veneto
Sappadino
info:eu-repo/grantAgreement/EC/FP7/613465
https://hdl.handle.net/11562/1095869
http://hdl.handle.net/20.500.12124/46
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
https://creativecommons.org/licenses/by-nc-nd/4.0/
PUB
text/plain; charset=utf-8
text/html
text/plain
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
application/zip
downloadable_files_count: 39
University of Verona
https://alpilink.it/vinko/
oai:clarin.eurac.edu:20.500.12124/752023-12-20T12:19:27Zhdl_20.500.12124_36hdl_20.500.12124_37
e-LIS: Electronic Bilingual Dictionary Italian Sign Language (LIS) – Italian v1.0
Vettori, Chiara
Zanoni, Claudio
Felice, Mauro
Stanizzi, Isabella
Baj, Claudio
Battagin, Alessandra
Consolati, Marco
Valente, Maddalena
sign language
italian
italian sign language
dictionary
electronic dictionary
sign language dictionary
bilingual dictionary
visual language
deaf studies
sign language parameters
stokoe
stokoe-based notations
sign language search engine
Legacy files of the former Electronic Bilingual Dictionary Italian Sign Language (LIS) - Italian, the first prototype of an online Italian Sign Language reference dictionary (2004-2008). Data includes 2677 videos with definitions and examples for 294 Italian lemmas.
2006
corpus
http://hdl.handle.net/20.500.12124/75
ita
ise
https://gitlab.inf.unibz.it/commul/elis/data/-/tags/v1.0
https://doi.org/10.1007/978-3-540-77010-7_41
https://hdl.handle.net/10863/8888
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
text/html
application/zip
text/plain; charset=utf-8
downloadable_files_count: 2
Institute for Applied Linguistics, Eurac Research
https://web.archive.org/web/20220519084454/http://elis.eurac.edu/index_it.html