ERCC Open: Learner Language

Core Metadata Schema for Learner Corpora (LC-meta) v2

Fri, 31 May 2024 00:00:00 GMT

Core Metadata Schema for Learner Corpora (LC-meta) v2 Paquot, Magali; König, Alexander; Stemle, Egon; Frey, Jennifer-Carmen This document contains a list of metadata fields that can be used to describe learner corpus data. The core metadata scheme is structured around 8 metadata types: - Administrative metadata; - Corpus design metadata; - Learner; - Text (language sample); - Situational and task characteristics; - Annotation; - Annotator; - Transcriber.

Core Metadata Schema for Learner Corpora (LC-meta) v1

Fri, 31 May 2024 00:00:00 GMT

Core Metadata Schema for Learner Corpora (LC-meta) v1 Paquot, Magali; König, Alexander; Stemle, Egon; Frey, Jennifer-Carmen The Core Metadata Schema for Learner Corpora is an extensive revision of Granger & Paquot's (2017) Core Metadata [Schema] for Learner Corpora Draft 1.0 in the field of learner corpus research. The original proposal was presented in the form of a draft at the CLARIN workshop on Interoperability of Second Language Resources and Tools (University of Gothenburg, Sweden, 6-7 December 2017, https://sweclarin.se/swe/workshop-interoperability-l2-resources-and-tools). This document contains version 1 of the Core Metadata Schema for Learner Corpora as shared with the community in 2023-2024 to collect feedback.

Beldeko Summary Corpus v1.1.0

Wed, 01 Mar 2023 00:00:00 GMT

Beldeko Summary Corpus v1.1.0 Strobl, Carola; Wedig, Helena Beldeko Summary Corpus v1.1.0 The Beldeko (Belgisches Deutschkorpus) Summary Corpus is a learner corpus that consists of summaries written by advanced L2 German learners (CEF level B2-C1) with L1 Dutch. It has been created with the aim of investigating the academic writing skills in L2 German of third-year students of two bachelor programmes in Applied Linguistics and Linguistics and Literature, respectively. The corpus consists of 301 summaries (70774 tokens) written by 115 students of three intact classes (convenience sampling). The texts were collected at Ghent University (in 2013 and in 2014) and University College of Ghent (in 2013) as pre- and posttests of an intervention study on collaborative writing carried out by Carola Strobl in the context of her PhD research (Strobl, C. (2015). Affordances of online technologies for academic writing instruction in a foreign language. Ghent University, unpublished doctoral dissertation). 82 students produced three summaries each (pretest, posttest immediately after the three-weeks-intervention, delayed posttest six weeks after the intervention; missing data are indicated as n.a. in the metadata file) and 33 students produced two summaries each (pretest and posttest, missing data are indicated as n.a. in the metadata file). The metadata file (Beldeko_Summary_1.1.0_metadata.xlsx) provides information about: • Institution of data collection (HG= University College of Ghent, UG= Ghent University) • Year of data collection (2013, 2014) • Participants´ gender (f, m) • Number of texts written and number of tokens in each text (T1, T2, T3) The individual file names of the corpus reveal institution, year, unique ID of participant (per institution per year), text number, in the given order. The summaries contain between 37-330 words each, with a mean of 230 words (the targeted word count was between 220-250 words). Outliers regarding text length were unfinished texts produced by students who struggled with the time restriction of 60 minutes. The texts were written in class, on computers. Students were allowed to use online auxiliary means such as dictionaries. The task consisted in summarizing two texts (fragments of newspaper articles or interviews or websites) about a topic related to language variation in German each time (Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch, Varianten-Wörterbuch des Deutschen; see also word files provided in metadata). More specifically, the topics were distributed as follows: Kiezdeutsch: HG_2013_T1, UG_2013_T1, UG_2014_T1 Mundartdebatte in der Schweiz: HG_2013_T2, UG_2013_T2, UG_2014_T2 Viadrinisch: HG_2013_T3, Varianten-Wörterbuch des Deutschen: UG_2014_T3 The new version of the corpus (Beldeko 1.1.0) contains the manual annotations of the texts with token id, sentence id, source text form, target form, POS (STTS) and simple UPOS part-of-speech tag.

Core Metadata [Schema] for Learner Corpora Draft 1.0

Fri, 15 Dec 2017 00:00:00 GMT

Core Metadata [Schema] for Learner Corpora Draft 1.0 Granger, Sylviane; Paquot, Magali First proposal towards a "Core Metadata [Schema] for Learner Corpora", presented at the "CLARIN workshop on Interoperability of Second Language Resources and Tools", Gothenburg, Sweden, 06-08/12/2017 . It was circulated as part of the invited talk "Towards standardization of metadata for L2 corpora" that took stock of a range of metadata sets and made suggestions for minimal and maximal design principles, but it was never published (or part of a publication).

Beldeko Summary Corpus v1.0.0

Mon, 17 Feb 2020 00:00:00 GMT

Beldeko Summary Corpus v1.0.0 Strobl, Carola Beldeko Summary Corpus v1.0.0 The Beldeko (Belgisches Deutschkorpus) Summary Corpus is a learner corpus that consists of summaries written by advanced L2 German learners (CEF level B2-C1) with L1 Dutch. It has been created with the aim of investigating the academic writing skills in L2 German of third-year students of two bachelor programmes in Applied Linguistics and Linguistics and Literature, respectively. The corpus consists of 301 summaries (70774 tokens) written by 115 students of three intact classes (convenience sampling). The texts were collected at Ghent University (in 2013 and in 2014) and University College of Ghent (in 2013) as pre- and posttests of an intervention study on collaborative writing carried out by Carola Strobl in the context of her PhD research (Strobl, C. (2015). Affordances of online technologies for academic writing instruction in a foreign language. Ghent University, unpublished doctoral dissertation). 82 students produced three summaries each (pretest, posttest immediately after the three-weeks-intervention, delayed posttest six weeks after the intervention; missing data are indicated as n.a. in the metadata file) and 33 students produced two summaries each (pretest and posttest, missing data are indicated as n.a. in the metadata file). The metadata file (Beldeko_Summary_1.0.0_metadata.xlsx) provides information about: • Institution of data collection (HG= University College of Ghent, UG= Ghent University) • Year of data collection (2013, 2014) • Participants´ gender (f, m) • Number of texts written and number of tokens in each text (T1, T2, T3) The individual file names of the corpus reveal institution, year, unique ID of participant (per institution per year), text number, in the given order. The summaries contain between 37-330 words each, with a mean of 230 words (the targeted word count was between 220-250 words). Outliers regarding text length were unfinished texts produced by students who struggled with the time restriction of 60 minutes. The texts were written in class, on computers. Students were allowed to use online auxiliary means such as dictionaries. The task consisted in summarizing two texts (fragments of newspaper articles or interviews or websites) about a topic related to language variation in German each time (Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch, Varianten-Wörterbuch des Deutschen; see also word files provided in metadata). More specifically, the topics were distributed as follows: Kiezdeutsch: HG_2013_T1, UG_2013_T1, UG_2014_T1 Mundartdebatte in der Schweiz: HG_2013_T2, UG_2013_T2, UG_2014_T2 Viadrinisch: HG_2013_T3, Varianten-Wörterbuch des Deutschen: UG_2014_T3