Frequently Asked Questions



What is the repository?

It is like a library for linguistic data and tools. This means, using it, you can

  • Search for data and tools and easily download them.
  • Deposit your data and be sure it is safely stored, everyone can find it, use it, and correctly cite it (giving you credit)

What does the repository contain?

It contains descriptions of linguistic data collections and tools in the form of metadata, together with the resources themselves, e.g. corpora, NLP tools, etc. The ERCC contains mainly German and Italian language resources.

How to cite a submission?

See our citing policy.

How do I get the most out of my searches?

In contrast to other search engines this one uses OR as a default operator; see examples below that clarify this. If you are not satisfied with the results of your searches, you might wish to go beyond plain text searches. You may search only in certain fields, use negation, add score (emphasis) to some parts of the query and much more. The search engine is SOLR so use its syntax if you know it or check it in the documentation.

Example Queries

PDT wordnet vs PDT AND wordnet
The default operator is OR; ie. the first example searches for PDT OR WordNet in all text fields.
dc.title:P?T && -dc.title:WordNet
Returns all items having P?T in title - ? stands for any character (eg. PDT) - and not having WordNet in the title
dc.title:"Czech WordNet"
Use double quotes (") for exact matches and multiword expressions
author:(Bojar && -Tamchyna) && (dc.language.iso:(ces AND eng) OR language:(czech AND english))
Search for items by one author and not the other; interesting are only items about both czech and english languages.

Do I need to create an account to download and/or make a submission?

  • You can download data and tools with a license that allows free sharing without registration. Just read the license and download. This applies to all data with Creative Commons and tools with open source licenses.
  • To download data and tools that require you to sign a license, you need to log in. To make a submission, you also need to log in. However, if you are from the academic world, you probably don't need any new account.
  • Just click "Login" and search for your academic institution. To sign in, you can use any account with an Identity Provider that is a member of the EduGAIN federation.
  • If you cannot find your home organisation in the Login dialog list of organisations then register at clarin.eu and authenticate using "clarin.eu website account".

I see an error logging in

Please let us know through our Help Desk, if you have any trouble logging in.

Ocassionally (usually when you are the first one logging in using your home institution) you might see an error stating "The authentication was successful; however, your identity provider did provide neither your email, eppn nor targeted id." This means your home institution did not send us enough data about you to operate our service; the institution is doing so to protect your personal data. We only require an email address and we are following the Data Protection Code of Conduct, which helps us convince the institution we won't abuse data about you.

If you have an account with multiple providers and you login with a different one each time, you might see an error stating "Your email is already associated with a different user.". Please try to use the same provider each time, if that is not possible, let us know and we'll change the default one.

What submissions do we accept?

In general, we are open to consider any linguistic data or tools for inclusion, but we are more interested and therefore especially encourage deposits of data that fits with our fields of speciality (Learner Corpora, Corpora of Computer-mediated Communication (CMC), or terminological and lexicographical data) or that has a strong connection to South Tyrol. Submitting data can be done in two ways: (1) You upload the metadata plus the data itself. (2) You can also make a metadata-only record, if required. Although we do not strictly require to upload data (option 1), it is the mode we favor. We also support online license-signing for immediate availability of restricted resources. If you would like to submit something to our repository, please get in touch with us, so that we can evaluate your case and give you the ok to include your data, or provide explanation, why we cannot include it.

When uploading language resources, please try to use one of the recommended formats mentioned in LRT Standards.

Why should I submit my data into your repository?

  • It is free and safe.
  • We respect your license. We encourage Free Data and believe it benefits not only users, but also the data providers. However we accept also more closed data and we can make users sign a license before downloading your data, if that is what you need.
  • The data is visible, giving you maximal credit for your work (google, VLO, DataCite, OLAC, Data Citation Index, arXive).
  • The data is easy to cite. We provide ready-to-use one-click citations in BibTex, RIS, and other popular reference formats. All the citations include permanent links created from persistent identifiers (we use handles for PIDs). These PIDs are future-proof.

Why should I submit my tools?

  • See "Why should I submit my data into your repository?". Everything applies to software tools too.
  • You can just link your version control system (svn, git), if it is publicly accessible. You can also link your project page, or demo site.

I don't want / cannot make the data publicly available or make them available after a specific date. Would you still archive them for me?

In accordance with the advocacy of the research infrastructures and the general development with respect to Open Access, we strongly encourage the data producers to be as open as possible. However, in other circumstances we will archive your data even if they will not be publicly available. Please, contact our Help Desk prior to completing the submission.

What license should I pick for my data/tool?

We encourage using a free license. A representative selection of free licenses as well as CC licenses (more appropriate for data) is available directly during submission. There is a great OPEN License Selector which can guide you through the selection of appropriate license.
If you need a different license, Contact Us.

Where can I find more information about supported licenses?

The list of licenses currently supported is here. However, do not hesitate to Contact Us in case you need a specific license. The licenses can be accompanied by various requirements; eg. limit to logged in users, fill in additional details (purpose) etc.

How safe is my data, if I store it with you?

Quite safe, probably much more than on your computer. Our storage plan:

  • All the data in the repository have an on-site backup copy.
  • There is another off-site copy, so even complete destruction of our building does not destroy your data.
  • We check all the copies regularly and should any of them become corrupted, we delete it and recreate it.
  • We keep at least three copies, one of them off-site, at all times.

What is the actual depositing/archiving procedure?

During the submission of digital language resources to the repository, the data undergo a curation process in order to ensure quality and consistency. We assist you in meeting necessary requirements for sustainable resource archiving. Data have to be provided with metadata in standard formats accepted/adopted in the respective communities, persistent identifiers (PIDs) have to be assigned, IPR issues have to be resolved and clear statements with regard to licensing and possible use of the resources are to be made. The depositor is also required to electronically sign a deposition agreement acknowledging that they are the holder of rights to the data and that they have the right to grant the rights contained in this licence. Once the data is indeed deposited in the repository, it is assigned a PID for stable reference.

Why do we strongly prefer real authors to institutions?

It is not about contact, it is about citations, credit and trust. This is why we have separate metadata fields for authors and for contact person. Filling in a non-personal address as Contact, e.g. a helpdesk, is perfectly fine, not acknowledging the authors of a scholarly work is not, which is why in this field you should always input a real person. We support the direct citation of data (https://www.force11.org/datacitation). This is why we also give them PIDs, create formatted citations, etc. This is the reason we really want proper authors, so that they get citations and other scientists know whose work they rely on.

What is the PID (handle) good for?

It is a special permanent URL. It provides a permanent link that will resolve correctly even if in some distant future the data is moved, for example if we restructure things internally and even if the URL of the repository itself should change. Thus you should always use the PID link as the URL in citations.

What if I want/need to update the archived data?

Every change to the resources and metadata should be stored as a new version with a new PID. However if the changes are minimal (e.g., typos or clear mistakes) then contact our Help Desk with the submission PID and the changes which should be made. It is up to the reviewer to decide whether these changes should result in a new version or not. But iff you need to or want to create a new submission, see creating a new version.

What if I want to withdraw the resources in the future? Can I delete the data?

Yes, in this case contact our Help Desk with the submission PID and the reason. However, we need to keep a reference that the data was in our repository (because a persistent identifier was issued), so the administrative metadata will be retained indicating that the data itself were removed.

We have started our own repository, can we somehow move records submitted to the ERCC collection?

We can create a tombstone page for the moved record and we can add a notice to that page saying the resource is now at a new location. The submission is effectively hidden from search, browse and harvesting (oai-pmh), but the PID still resolves. But instead of the actual data, we show a link to the item in your repository. Please contact the Help Desk for more details.

Where can I report problems with the repository?

If you encounter any problems while using the ERCC repository, you can always send us a mail. You can also report bugs or typos in our Issue Tracker on gitlab (if you already have access).