2024-03-28T12:15:38Zhttp://clarin.eurac.edu/repository/oai/requestoai:clarin.eurac.edu:20.500.12124/52023-03-17T15:51:45Zhdl_20.500.12124_1hdl_20.500.12124_4
MERLIN Written Learner Corpus for Czech, German, Italian 1.0
2018-09-03T13:40:44Z
http://hdl.handle.net/20.500.12124/5
Wisniewski, Katrin
Abel, Andrea
Vodičková, Kateřina
Plassmann, Sybille
Meurers, Detmar
Woldt, Claudia
Schöne, Karin
Blaschitz, Verena
Lyding, Verena
Nicolas, Lionel
Vettori, Chiara
Pečený, Pavel
Hana, Jirka
Čurdová, Veronika
Štindlová, Barbora
Klein, Gudrun
Lauppe, Louise
Boyd, Adriane
Bykh, Serhiy
Krivanek, Julia
2018-09-03T13:40:44Z
The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels.
http://hdl.handle.net/20.500.12124/5
http://hdl.handle.net/20.500.12124/6
Institute for Applied Linguistics, Eurac Research
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
CEFR
language learning
learner corpus
corpus
Text
oai:clarin.eurac.edu:20.500.12124/102023-03-17T17:04:03Zhdl_20.500.12124_1hdl_20.500.12124_4
KoKo German L1 Learner Corpus v1
2019-09-19T13:04:44Z
http://hdl.handle.net/20.500.12124/10
Abel, Andrea
Glaznieks, Aivars
Culy, Chris
2019-09-19T13:04:44Z
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It
has been created with the aim to investigate and describe the writing skills of
German-speaking secondary-school pupils at the end of their school career by
analysing authentic texts produced in classrooms.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to
university,
2. determine external factors that may influence the distribution of writing
skills, such as the region, sociolinguistic (gender, age), socio-economic, and
language-related biographical factors (L1, preferred variety of German, reading
and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the
schools were located (small vs. medium vs. big) and the type of school
(providing general education vs. education specific to a particular profession)
as strata for the sampling. Since data were collected during regular courses,
the typical formation of secondary-school classes in the three regions is
represented in the whole corpus. Most of the participants are German native
speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
http://hdl.handle.net/20.500.12124/10
http://hdl.handle.net/20.500.12124/11
Institute for Applied Linguistics, Eurac Research
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
learner corpus
German varieties
students in secondary school
argumentative essays
corpus
Text
oai:clarin.eurac.edu:20.500.12124/62023-03-17T15:51:45Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
MERLIN Written Learner Corpus for Czech, German, Italian 1.1
2018-09-03T16:28:45Z
http://hdl.handle.net/20.500.12124/6
Wisniewski, Katrin
Abel, Andrea
Vodičková, Kateřina
Plassmann, Sybille
Meurers, Detmar
Woldt, Claudia
Schöne, Karin
Blaschitz, Verena
Lyding, Verena
Nicolas, Lionel
Vettori, Chiara
Pečený, Pavel
Hana, Jirka
Čurdová, Veronika
Štindlová, Barbora
Klein, Gudrun
Lauppe, Louise
Boyd, Adriane
Bykh, Serhiy
Krivanek, Julia
2018-09-03T16:28:45Z
The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels.
http://hdl.handle.net/20.500.12124/6
Institute for Applied Linguistics, Eurac Research
http://hdl.handle.net/20.500.12124/5
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
CEFR
language learning
learner corpus
corpus
Text
oai:clarin.eurac.edu:20.500.12124/342023-02-03T12:41:47Zhdl_20.500.12124_1hdl_20.500.12124_4
WIP: LEONIDE PoS training
2021-08-05T08:42:28Z
http://hdl.handle.net/20.500.12124/34
Schmalz, Veronica
Frey, Jennifer-Carmen
Stemle, Egon W.
2021-08-05T08:42:28Z
Unfinished Draft
http://hdl.handle.net/20.500.12124/34
Learner Language
CMC
corpus
Text
oai:clarin.eurac.edu:20.500.12124/252023-03-17T16:40:50Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_42hdl_20.500.12124_43
LEONIDE - Longitudinal Learner Corpus in Italiano, Deutsch and English 1.1
2020-07-06T10:24:27Z
http://hdl.handle.net/20.500.12124/25
Glaznieks, Aivars
Frey, Jennifer-Carmen
Stopfner, Maria
Zanasi, Lorenzo
Nicolas, Lionel
2020-07-06T10:24:27Z
LEONIDE is a longitudinal corpus of student essays documenting the language competences and writing development of lower secondary school students in three different languages.
The corpus contains 2.512 texts from 163 pupils, who participated in the project “One school, many languages” conducted in eight schools in the officially multilingual Italian province of South Tyrol / Alto Adige (Zanasi & Stopfner, 2018). The aim of the project was to document the development of the pupils' plurilingual linguistic and communicative skills by collecting oral and written language samples in Italian, German and English, in order to obtain a global view of their individual linguistic repertoire.
LEONIDE contains all the texts written by the participating students during the course of the project, the overall size of the corpus amounts to ca. 240.000 tokens. The texts were collected over the span of 3 consecutive years (2015-2018) in public middle schools (i.e. lower secondary school, grade 6 to grade 8). The pupils were 11 years old at the beginning of the data collection and 13 years old at the end. In each grade, two written texts were collected that differ with respect to genre: the first text was elicited using a picture story re-telling task; the second text is an opinion text on different aspects related to the pupils’ life and public discourse. For each genre and each grade, the corpus provides texts in the three languages German, Italian and English. In order to reflect the school system of the Province of South Tyrol / Alto Adige, about half of the texts was collected in four schools in which German is the main language of teaching and Italian is taught as L2. The other half of the texts was collected in four schools in which Italian is the main language of teaching and German is taught as L2. In all schools, English is taught as L3 (i.e. as a foreign language at school). Subdivided by language, the corpus contains 844 Italian, 833 German and 835 English texts.
Manual annotation:
The corpus is fully anonymised and annotated with target hypotheses correcting orthography errors in the text as well as annotations on structural elements (paragraphs, line breaks, bullet points, symbols or emoticons etc.), foreign word insertions and transcript surface features (e.g. deletions, corrections or insertions of the student, unreadable or ambiguous items).
Automatic annotation:
Automatic linguistic annotation included sentence splitting, tokenisation, lemmatisation and part-of-speech-tagging.
Text metadata:
The corpus provides a series of relevant person-related metadata (e.g. age, gender, first language(s), school and possible special needs of the students) as well as task-related metadata (e.g. task year, text genre, etc.)
Usage:
As the corpus documents the development of plurilingual competences of individual learners over a period of three years, it will allow both quantitative research on the characteristics of young learners’ language over a relatively long period, as well as investigations of the development of individuals taking into account a wide range of person related metadata. In addition, it allows contrastive analyses of the young learners’ progress in their L1, L2 and L3.
Availability:
The corpus will be available for corpus queries via an ANNIS search interface and as download for academic purposes (ACA-BY-NC-NORED 1.0) on the Eurac Research Clarin Centre by the end of 2020.
References:
Zanasi, L. & Stopfner, M. (2018). Rilevare, osservare, consultare. Metodi e strumenti per l’analisi del plurilinguismo nella scuola secondaria di primo grado. In C. M. Coonan, A. Bier Ada & E. Ballarin (Ed.), La didattica delle lingue nel nuovo millennio. Le sfide dell’internazionalizzazione (pp. 135-148). Edizioni Ca’Foscari. http://doi.org/10.30687/978-88-6969-227-7/009
Glaznieks, A., Frey, J.-C., Stopfner, M., Zanasi, L. & Nicolas, L. (accepted): LEONIDE: A longitudinal trilingual corpus of young learners of Italian, German and English. In: International Journal of Learner Corpus Linguistics.
1.1
http://hdl.handle.net/20.500.12124/25
Institute for Applied Linguistics, Eurac Research
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
multilingualism
evaluation
language competences
learner corpus
L1
L2
student essays
picture story
opinion texts
argumentative essay
corpus
Text
oai:clarin.eurac.edu:20.500.12124/112023-03-17T16:07:21Zhdl_20.500.12124_1hdl_20.500.12124_4
KoKo German L1 Learner Corpus v2
2019-09-19T13:21:30Z
http://hdl.handle.net/20.500.12124/11
Abel, Andrea
Glaznieks, Aivars
Culy, Chris
2019-09-19T13:21:30Z
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms.
The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Error annotation relates to the orthographic level only. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to university,
2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization.
http://hdl.handle.net/20.500.12124/11
http://hdl.handle.net/20.500.12124/12
Institute for Applied Linguistics, Eurac Research
http://hdl.handle.net/20.500.12124/10
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
learner corpus
German varieties
students in secondary school
argumentative essays
corpus
Text
oai:clarin.eurac.edu:20.500.12124/122023-03-17T16:39:47Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
KoKo German L1 Learner Corpus v3
2019-09-19T14:27:45Z
http://hdl.handle.net/20.500.12124/12
Abel, Andrea
Glaznieks, Aivars
Culy, Chris
2019-09-19T14:27:45Z
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms.
The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections. Error annotations relate to the orthographic level (including punctuation errors), and a selection of the texts (n=597) also contain error annotations on the grammatical level.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to university,
2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization.
http://hdl.handle.net/20.500.12124/12
Institute for Applied Linguistics, Eurac Research
http://hdl.handle.net/20.500.12124/11
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
learner corpus
German varieties
students in secondary school
argumentative essays
corpus
Text
oai:clarin.eurac.edu:20.500.12124/182023-02-03T12:41:47Zhdl_20.500.12124_1hdl_20.500.12124_4
LEKO Corpus v1
2020-05-29T08:17:19Z
http://hdl.handle.net/20.500.12124/18
Abel, Andrea
2020-05-29T08:17:19Z
The LEKO corpus is a written learner corpus and constitutes a subcorpus of the KOLIPSI-I Corpus, using 290 texts written in Italian as L2 by German-L1-pupils.
The aim of the project LEKO was to describe the use of phrasemes in these texts. In addition to the annotations available in the KOLIPSI-I Corpus the LEKO Corpus contains manual annotations including the phraseme category, errors, morpho-syntactic features and error explanations.
http://hdl.handle.net/20.500.12124/18
Institute for Applied Linguistics, Eurac Research
learner corpus
corpus
Text
oai:clarin.eurac.edu:20.500.12124/262023-03-17T15:49:40Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
Kolipsi-1 Corpus v1.0
2021-05-05T15:52:10Z
http://hdl.handle.net/20.500.12124/26
Glaznieks, Aivars
Frey, Jennifer-Carmen
Abel, Andrea
Vettori, Chiara
Nicolas, Lionel
2021-05-05T15:52:10Z
The Kolipsi-1 L2 is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.
The data collection took place in autumn 2007 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre). For both tasks a time limit of 30 minutes was fixed and no additional reference material was allowed.
CEFR levesl have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, lexis, grammar and sociolinguistic appropriateness.
Person-related metadata provides information about:
- the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation
- the writer's age, gender and socio-economic status
- the writer's district of residence and whether he lives in an urban or rural environment
- the language, location and type of school the writer attended
- whether the writer passed the local bilinguality exam or not
- an anonymous identifier for the writer's school class and L2 teacher to account for class effects
All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.
In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
http://hdl.handle.net/20.500.12124/26
http://hdl.handle.net/20.500.12124/64
Institute for Applied Linguistics, Eurac Research
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
L2
Learner corpora
South Tyrol
argumentative essay
students
high school
upper secondary school
picture story
opinion text
corpus
Text
oai:clarin.eurac.edu:20.500.12124/302023-03-17T16:02:03Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
Kolipsi-2 Corpus v1.0
2021-05-05T15:33:00Z
http://hdl.handle.net/20.500.12124/30
Glaznieks, Aivars
Frey, Jennifer-Carmen
Nicolas, Lionel
Abel, Andrea
Vettori, Chiara
2021-05-05T15:33:00Z
The Kolipsi-2 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI II project, a replication study of the KOLIPSI project on “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation” that was conducted 7 years after the original study.
The data collection for this second edition took place in spring 2014 and is based on two standardized tests for written productions, that were aligned with the original tasks for the KOLIPSI study. However, while the first task remained the same for both editions, the second task was slightly adapted. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) writing an e-mail about negative aspects of social-media chats prompted by a letter to the editor in a youth magazine (argumentative text genre). For both tasks a time limit of 25 minutes was fixed and no additional reference material was allowed.
CEFR levels have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, sociolinguistic appropriateness, lexical accuracy, lexical diversity, grammar and orthography.
Person-related metadata provides information about:
- the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation as well as the pre-dominant language spoken in the area the writer is residing in
- the writer's results from an additional language test in the L2 (dialang test)
- the writer's competence in the local German dialect (for students with L1 Italian only)
- the writer's age, gender and socio-economic status
- whether the writer lives in an urban or rural environment
- the language, location and type of school the writer attended
- an anonymous identifier for the writer's school class to account for class effects
All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.
In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
http://hdl.handle.net/20.500.12124/30
http://hdl.handle.net/20.500.12124/66
Institute for Applied Linguistics, Eurac Research
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
L2 corpora
learner corpus
student essay
argumentative essay
picture story
South Tyrol
corpus
Text
oai:clarin.eurac.edu:20.500.12124/332023-11-23T21:34:21Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43
LEKO v1.0
2023-02-22T09:50:42Z
http://hdl.handle.net/20.500.12124/33
Abel, Andrea
Zanasi, Lorenzo
Nicolas, Lionel
Konecny, Christine
Autelli, Erica
2023-02-22T09:50:42Z
The LEKO corpora LEKO_Kolipsi and LEKO_Merlin provide lexical annotations for phraseological elements in Italian L2 writing on the basis of a subset of the texts of the Kolipsi-1 corpus and the Merlin corpus respectively. The annotations were jointly created by the University of Innsbruck (Austria) and Eurac Research Bolzano (Italy) within the project LEKO, whose aim was to describe the use of phrasemes in these texts. There are manual annotations for phraseme category, lexical errors, morpho-syntactic features and error explanations.
LEKO_Kolipsi contains about 55 000 tokens in 282 texts from 141 pupils of the final year of upper secondary school, representing two different text types (email and letter, narrative and argumentative genre) as described in the Kolipsi-1 documentation.
LEKO_Merlin contains about 9 000 tokens in 50 texts from 50 examinees, who took part in an official language test (TELC) for Italian.
The documents have been transcribed according to the Kolipsi-1 and Merlin Transcription guidelines. Annotation guidelines for the lexical annotations can be found here.
Note: The LEKO corpora do not contain manual annotations for non-lexical errors, foreign word insertions, target language transcriptions, ambiguous writings or other annotations available in the base corpora Kolipsi-1 and Merlin. In order to retrieve any of those annotations and/or full target versions of the student writings please consult the base corpora directly.
http://hdl.handle.net/20.500.12124/33
Institute for Applied Linguistics, Eurac Research
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
Phraseology
Phrasemes
Lexical combinations
learner language
student writing
non-standard language
corpus
Text
oai:clarin.eurac.edu:20.500.12124/642023-03-17T15:49:40Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43hdl_20.500.12124_31
Kolipsi-1 Corpus v1.1
2023-02-15T09:07:59Z
http://hdl.handle.net/20.500.12124/64
Glaznieks, Aivars
Frey, Jennifer-Carmen
Abel, Andrea
Vettori, Chiara
Nicolas, Lionel
2023-02-15T09:07:59Z
The Kolipsi-1 L2 is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.
The data collection took place in autumn 2007 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre). For both tasks a time limit of 30 minutes was fixed and no additional reference material was allowed.
CEFR levesl have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, lexis, grammar and sociolinguistic appropriateness.
Person-related metadata provides information about:
- the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation
- the writer's age, gender and socio-economic status
- the writer's district of residence and whether he lives in an urban or rural environment
- the language, location and type of school the writer attended
- whether the writer passed the local bilinguality exam or not
- an anonymous identifier for the writer's school class and L2 teacher to account for class effects
All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.
In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
http://hdl.handle.net/20.500.12124/64
Institute for Applied Linguistics, Eurac Research
http://hdl.handle.net/20.500.12124/26
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
L2
Learner corpora
South Tyrol
argumentative essay
students
high school
upper secondary school
picture story
opinion text
corpus
Text
oai:clarin.eurac.edu:20.500.12124/662023-03-17T16:01:25Zhdl_20.500.12124_1hdl_20.500.12124_39hdl_20.500.12124_4hdl_20.500.12124_43hdl_20.500.12124_31
Kolipsi-2 Corpus v1.1
2023-02-17T07:53:20Z
http://hdl.handle.net/20.500.12124/66
Glaznieks, Aivars
Frey, Jennifer-Carmen
Nicolas, Lionel
Abel, Andrea
Vettori, Chiara
2023-02-17T07:53:20Z
The Kolipsi-2 Corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI II project, a replication study of the KOLIPSI project on “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation” that was conducted 7 years after the original study.
The data collection for this second edition took place in spring 2014 and is based on two standardized tests for written productions, that were aligned with the original tasks for the KOLIPSI study. However, while the first task remained the same for both editions, the second task was slightly adapted. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) writing an e-mail about negative aspects of social-media chats prompted by a letter to the editor in a youth magazine (argumentative text genre). For both tasks a time limit of 25 minutes was fixed and no additional reference material was allowed.
CEFR levels have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, sociolinguistic appropriateness, lexical accuracy, lexical diversity, grammar and orthography.
Person-related metadata provides information about:
- the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation as well as the pre-dominant language spoken in the area the writer is residing in
- the writer's results from an additional language test in the L2 (dialang test)
- the writer's competence in the local German dialect (for students with L1 Italian only)
- the writer's age, gender and socio-economic status
- whether the writer lives in an urban or rural environment
- the language, location and type of school the writer attended
- an anonymous identifier for the writer's school class to account for class effects
All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.
In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.
http://hdl.handle.net/20.500.12124/66
Institute for Applied Linguistics, Eurac Research
http://hdl.handle.net/20.500.12124/30
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
L2 corpora
learner corpus
student essay
argumentative essay
picture story
South Tyrol
corpus
Text