PAISÀ Corpus of Italian Web Text

Name: PAISÀ Corpus of Italian Web Text
License: https://creativecommons.org/licenses/by-nc-sa/4.0/

Lyding, Verena; Stemle, Egon; Borghetti, Claudia; Brunello, Marco; Castagnoli, Sara; Dell’Orletta, Felice; Dittmann, Henrik; Lenci, Alessandro; Pirrelli, Vito

dc.contributor.author	Lyding, Verena
dc.contributor.author	Stemle, Egon
dc.contributor.author	Borghetti, Claudia
dc.contributor.author	Brunello, Marco
dc.contributor.author	Castagnoli, Sara
dc.contributor.author	Dell’Orletta, Felice
dc.contributor.author	Dittmann, Henrik
dc.contributor.author	Lenci, Alessandro
dc.contributor.author	Pirrelli, Vito
dc.date.accessioned	2018-05-29T11:06:34Z
dc.date.available	2018-05-29T11:06:34Z
dc.date.issued	2013-01
dc.identifier.uri	http://hdl.handle.net/20.500.12124/3
dc.description	The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ. All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system. The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor. Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.
dc.language.iso	ita
dc.publisher	Institute for Applied Linguistics, Eurac Research
dc.relation.isreferencedby	http://aclweb.org/anthology/W14-0406
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label	PUB
dc.source.uri	http://www.corpusitaliano.it
dc.subject	web corpus
dc.subject	language learning
dc.title	PAISÀ Corpus of Italian Web Text
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
hidden	false
hasMetadata	false
has.files	yes
branding	CMC & WaC
contact.person	Corpus Manager clarin@eurac.edu Eurac Research CLARIN Centre (ERCC)
sponsor	Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR) N/A Fondo per gli Investimenti della Ricerca di Base (FIRB) nationalFunds
size.info	380000 pages
size.info	250M words
files.size	2538447018
files.count	4

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: paisa.raw.utf8.gz
Size: 521.58 MB
Format: application/gzip
Description: raw cleaned web texts
MD5: d7804d4d9af31ddaec5bfa7409926f2e

Download file

Name: paisa.annotated.CoNLL.utf8.gz
Size: 1.84 GB
Format: application/gzip
Description: cleaned and linguistically annotated web texts in CoNLL format
MD5: 9d49fd1e86c9e6de3a6cb67a6c10a2f2

Download file

Name: lemma-WITHOUTnumberssymbols-frequencies-paisa.txt.gz
Size: 6.94 MB
Format: application/gzip
Description: lemma frequencies (only composed of letters and the following three symbols: . - ' )
MD5: 6d3959478ad4c5fecfe9c9cc305c68af

Download file

Name: lemma-frequencies-paisa.txt.gz
Size: 9.53 MB
Format: application/gzip
Description: lemma frequencies
MD5: ea27fe186efc59410d5ea39c4130315b

Download file

Show simple item record

Files in this item

Contact

Repository

More