Kolipsi-1 v1.0
Description
The Kolipsi-1 corpus is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus.
The data collection took place in autumn 2007 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre). For both tasks a time limit of 30 minutes was fixed and no additional reference material was allowed.
The Kolipsi-1 corpus contains 2063 Italian and 700 German texts from 1037 and 358 L2 writers respectively. The corpus has an overall size of about 0.5 million tokens. CEFR levels have been assigned to all texts, providing a holistic score as well as evaluations of coherence, lexis, grammar and sociolinguistic appropriateness.
Person-related metadata provides information about:
the writer’s language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation
the writer’s age, gender and socio-economic status
the writer’s district of residence and whether he lives in an urban or rural environment
the language, location and type of school the writer attended
whether the writer passed the local bilinguality exam or not
an anonymous identifier for the writer’s school class and L2 teacher to account for class effects
Additionally, the download package contains a reference corpus of L1 texts for comparison. The L1 text corpus contains 82 Italian texts and 365 German texts from 44 and 184 writers respectively (in total about 90 000 tokens) and provides only a restricted set of metadata (unique identifier for the writer, school and language background).
As all sub-corpora of the Kolipsi Corpus Family, Kolipsi-1 (L1 and L2) contains manually performed transcription annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level. In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus.
References:
Abel, A., Vettori, C., & Wisniewski, K. (2012). KOLIPSI: Gli studenti altoatesini e la seconda lingua; Indagine linguistica e psicosociale= KOLIPSI: Die Südtiroler SchülerInnen und die Zweitsprache; Eine linguistische und sozialpsychologische Untersuchung. Eurac Research.
Glaznieks, A., Frey, J.-C., Nicolas, L., Abel, A. & Vettori, C. (in preperation): The Kolipsi Corpus Family. A collection of Italian and German L2 learner texts from secondary school pupils.
Files
Kolipsi-1 is available from
Eurac Research Clarin Centre (ERCC)
On-premise GitLab installation
and also ready-to-search in ANNIS from
Eurac Research ANNIS installation.
For further information visit https://www.porta.eurac.edu/?page_id=492 or write to porta@eurac.edu.
The following file bundles are available:
docs-v1.0.zip contains documentation on the corpus such as transcription guidelines, annotation guidelines and task instructions or proficiency level descriptors. [ERCC download] [GitLab download] [Source code repository]
metadata-v1.0.zip contains metadata on the corpus, the texts, tasks and authors in tab-separated format. [ERCC download] [GitLab download] [Source code repository]
xmlmind-v1.0.zip contains the transcribed corpus in an custom XML format. [ERCC download] [GitLab download] [Source code repository]
annis-v1.0.zip contains the complete corpus in ANNIS format with all metadata and annotation. [ERCC download] [GitLab download] [Source code repository]
mmax2-v1.0.zip shows a corpus version with stand-off annotations produced using the annotation tool MMAX2. [ERCC download] [GitLab download] [Source code repository]
txt-v1.0.zip contains the original and corrected plain text versions of the corpus. [ERCC download] [GitLab download] [Source code repository]
pdf-v1.0.zip contains the original scans as PDF. [ERCC download] [GitLab download] [Source code repository]
LICENSE
LEONIDE is available under CLARIN ACADEMIC END-USER LICENCE ACA-BY-NC-NORED
Text file
Pdf file
Any code or scripts in the repositories are licensed under their respective LICENSE files.