DiDi Corpus v1.0.0
Description
The DiDi Corpus is a corpus of South Tyrolean Data of Computer-mediated Communication (CMC). The corpus comprises around 370,000 tokens from Facebook wall posts and comments on wall posts, as well as socio-demographic data of participants. All data was automatically annotated with language information (DE, IT, EN and others), and manually normalised and anonymised. Furthermore, semi-automatic token level annotations include part-of-speech and CMC phenomena ( e.g. emoticons, emojis, and iteration of graphemes and punctuation).
The anonymised corpus without the private messages is freely available for researchers.
Files
The DiDi Corpus is available from
Eurac Research Clarin Centre (ERCC)
On-premise GitLab installation
and also ready-to-search in ANNIS from
Eurac Research ANNIS installation.
For further information visit http://www.eurac.edu/didi or write to linguistics@eurac.edu.
The following file bundles are available:
data-annis-v1.0.0.zip contains the complete corpus in ANNIS format with all metadata and annotation. [ERCC download] [GitLab download] [Source code repository]
data-didijson-v1.0.0.zip contains the compete corpus in didijson dumps with all metadata and annotation. [ERCC download] [GitLab download] [Source code repository]
data-didixml-v1.0.0.zip contains the complete corpus in didixml format with all metadata and annotation. [ERCC download] [GitLab download] [Source code repository]
data-docs-v1.0.0.zip contains documentation (German:DE and English:EN) about annotation layers, anonymization, and metadata. [ERCC download] [GitLab download] [Source code repository]
LICENSE
The DiDi Corpus is available under CLARIN ACADEMIC END-USER LICENCE ACA-BY-NC-NORED
Text file
Pdf file
Any code or scripts in the repositories are licensed under their respective LICENSE files.