2024-03-28T10:58:06Zhttp://clarin.eurac.edu/repository/oai/requestoai:clarin.eurac.edu:20.500.12124/32023-03-17T15:51:45Zhdl_20.500.12124_35hdl_20.500.12124_2
PAISÀ Corpus of Italian Web Text
Lyding, Verena
Stemle, Egon
Borghetti, Claudia
Brunello, Marco
Castagnoli, Sara
Dell’Orletta, Felice
Dittmann, Henrik
Lenci, Alessandro
Pirrelli, Vito
web corpus
language learning
The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ.
All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system.
The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor.
Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words.
The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.
2013-01
corpus
http://hdl.handle.net/20.500.12124/3
ita
http://aclweb.org/anthology/W14-0406
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
PUB
application/gzip
application/gzip
application/gzip
application/gzip
text/plain; charset=utf-8
downloadable_files_count: 4
Institute for Applied Linguistics, Eurac Research
http://www.corpusitaliano.it
oai:clarin.eurac.edu:20.500.12124/72023-03-17T16:05:41Zhdl_20.500.12124_35hdl_20.500.12124_2
DIDI - The DiDi Corpus of South Tyrolean CMC 1.0.0
Frey, Jennifer-Carmen
Glaznieks, Aivars
Stemle, Egon W.
Facebook
Social Media
Computer-mediated Communication
Chat
Status Updates
Comment
Social Networking Sites
Multilingualism
Dialect
South Tyrol
Instant Messaging
CMC
The DiDi corpus has an overall size of around 600.000 Tokens gathered from 136 South Tyrolean Facebook users who participated in the DiDi project. It consists of 11.102 Facebook wall posts, 6.507 wall comments and 22.218 private messages. All messages were written by the participants throughout the year 2013. Please read the fulldescription of the corpus for further details. Please consider also the description of the method of data collection and the full description of the DiDi project and its research questions.
As every participant could offer either his/her private messages, his/her texts on the wall or both, the corpus comprises wall posts and wall comments from 130 profiles and private messages of 56 profiles; 50 participants granted access to both types of data. Free access to the corpus is given to the wall posts and comments. Due to privacy issues the access to the private messages is restricted. Access to the private messages can be given for scientific research only, after signing a non-disclosure agreement. In case you are interested in the data for scientific reasons, please contact the research team.
All texts were anonymised in order to guarantee that the participants' identity cannnot be infered from the texts. The anonymisation included person names, group names, geographical names and adjectival references, institution names, hyperlinks, mail addresses, phone numbers, numbers of bank accounts, servers, postal codes and other private information. Please, read the anonymisation document for the anonymisation keys.
The corpus offers a vast range of research opportunities for linguists that are interested in CMC in general, and more specific in multilingual language use, the use of regional varieties, code switching, code shifting and code mixing phenomena, etc.
Access to the DiDi corpus: https://commul.eurac.edu/annis/didi
2019-03-07
corpus
http://hdl.handle.net/20.500.12124/7
deu
ita
eng
lad
https://gitlab.inf.unibz.it/commul/didi/data-bundle/-/tags/v1.0.0
http://www.eurac.edu/en/research/autonomies/commul/Documents/DiDi/NLP4CMC-2015_DiDi_paper.pdf
http://www.eurac.edu/en/research/autonomies/commul/Documents/DiDi/didi_clic-it2016_FINAL.pdf
CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)
https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md
ACA
text/html
text/html
application/zip
application/zip
application/zip
application/zip
text/plain; charset=utf-8
text/plain
text/plain
downloadable_files_count: 6
Institute for Applied Linguistics, Eurac Research
http://www.eurac.edu/didi
oai:clarin.eurac.edu:20.500.12124/82023-03-17T15:51:45Zhdl_20.500.12124_35hdl_20.500.12124_2
KrdWrd CANOLA Corpus 1.0
Stemle, Egon W.
Steger, Johannes M.
boiler plate removal
web page cleaning
WaC
Web as Corpus
training data
manual annotation
The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and evaluated by the tools and infrastructure of the KrdWrd Project.
2010-09-10
corpus
http://hdl.handle.net/20.500.12124/8
eng
https://github.com/krdwrd/data/releases/tag/v1.0
https://www.sigwac.org.uk/raw-attachment/wiki/WAC5/WAC5_proceedings.pdf
http://hdl.handle.net/20.500.12124/9
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
PUB
application/gzip
text/plain; charset=utf-8
downloadable_files_count: 1
Institute for Applied Linguistics, Eurac Research
https://krdwrd.github.io
oai:clarin.eurac.edu:20.500.12124/92023-03-17T15:51:45Zhdl_20.500.12124_35hdl_20.500.12124_2
KrdWrd CANOLA Corpus 1.1
Stemle, Egon W.
Steger, Johannes M.
boiler plate removal
web page cleaning
WaC
Web as Corpus
training data
manual annotation
The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and evaluated by the tools and infrastructure of the KrdWrd Project.
2010-11-25
corpus
http://hdl.handle.net/20.500.12124/9
eng
https://github.com/krdwrd/data/releases/tag/v1.1
https://github.com/krdwrd/doc_CANOLA/releases/tag/v1.1
https://www.sigwac.org.uk/raw-attachment/wiki/WAC5/WAC5_proceedings.pdf
http://hdl.handle.net/20.500.12124/8
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
PUB
application/pdf
application/gzip
text/plain; charset=utf-8
text/plain
downloadable_files_count: 2
Institute for Applied Linguistics, Eurac Research
https://krdwrd.github.io