KoKo Corpus v1
Description
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms.
The corpus building process was guided by two goals:
describe writing skills at the transition from secondary school to university,
determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%).
Person-related metadata provides information about:
writer’s L1
writer’s gender
type of school the essay comes from
location of the school the essay comes from
grade attended at data collection
References:
Glaznieks, Aivars, Lionel Nicolas, Egon W. Stemle, Andrea Abel & Verena Lyding (2014): Establishing a Standardised Procedure for Building Learner Corpora. In: Apples - Journal of Applied Language Studies 8 (3), 5-20, Special Issue on Learner Language, Learner Corpora: From corpus compilation to data analysis, Jarmo Harri Jantunen, Sisko Brunni & Marianne Spoelmann (eds). http://apples.jyu.fi/issue/view/15
Files
The KoKo Corpus is available from
Eurac Research Clarin Centre (ERCC)
On-premise GitLab installation
and also ready-to-search in ANNIS from
Eurac Research ANNIS installation.
For further information visit http://www.korpus-suedtirol.it/koko/ or write to linguistics@eurac.edu.
The following file bundles are available:
xmlmind-v1.zip contains the transcribed corpus in the KoKo XML format, done with XMLmind. [ERCC download] [GitLab download] [Source code repository]
docs-v1.zip contains documentation. [ERCC download] [GitLab download] [Source code repository]
LICENSE
The KoKo Corpus is available under CLARIN ACADEMIC END-USER LICENCE ACA-BY-NC-NORED
Text file
Pdf file
Any code or scripts in the repositories are licensed under their respective LICENSE files.