ACTER Annotated Corpora for Term Extraction Research, version 1.4

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).

Readme structure: 1. General 2. Abbreviations 3. Data Structure 4. Annotations 5. Additional Information 6. Updates 7. Error Reporting 8. License

1. General

2. Abbreviations

Languages and domains: * “en” = English * “fr” = French * “nl” = Dutch * “corp” = corruption * “equi” = equitation (dressage) * “htfl” = heart failure * “wind” = wind energy

Types of terms/annotations: * “Spec” or “Specific”: Specific Terms * “Com” or “Common”: Common Terms * “OOD”: Out-of-Domain Terms * “NE(s)”: Named Entities

3. Data Structure

File structure under each language folder (“en”, “fr”, and “en”) is identical:

ACTER
│   README.md
│   sources.txt
│
└───en
│   └───corp
│   |   └───annotations
│   |   |   |   corp_en_terms.ann
│   |   |   |   corp_en_terms_nes.ann
│   |   | 
│   |   └───texts
|   |       └───annotated
│   |       |   corp_en_01.txt
│   |       |   corp_en_02.txt
│   |       |   ...
│   |       |
|   |       └───unannotated
│   |           |   corp_en_03.txt
│   |           |   ...
|   |
│   └───equi (equivalent to "corp")
|   |
│   └───htfl (equivalent to "corp")
|   |
│   └───wind (equivalent to "corp")
|
└───fr (equivalent to "en")
└───nl (equivalent to "en")

As can be seen, there are corpora in three languages and four domains. All domains are available in all languages and they are comparable across these languages, meaning that they are not only about the same domain, but also have a similar style and size. However, they are not parallel corpora, so they cannot be aligned (not even on document level). The file names always mention the subject, language, and a unique id (e.g. corp_en_01.txt).

For each part of the corpus, both the plain text files and the annotations are included. There are two annotation files: one with only the term annotations (Specific Terms, Common Terms, and OOD Terms), and one with both term and Named Entity annotations. The labels are mentioned for each annotation (see also section 4).

The plain text files are split into those which have been annotated and those that have not been annotated. This means that all annotations were found in the parts of the corpora labelled as “annotated” and that the “unannotated” parts of the corpora may contain many more terms which are not (yet) in the gold standard. Currently, around 50k words in each corpus (combination language/domain) have been manually annotated.

There is a single case where a text has been annotated only partially: wind_fr_06; therefore the text has been split and the unannotated part is called wind_fr_06bis.

In addition, there is this readme file and a txt-file with the sources of all corpora.

4. Annotations

4.1 Format

The annotations are provided in simple UTF-8 encoded plain text files, with one annotation per line.

The term annotation files include all the term annotations (Specific, Common, and OOD Terms have been combined). A separate file (terms_nes) includes both terms and NEs. Since version 1.2, the labels are added to each annotation.

This means that the “terms.amm” and “terms_nes.ann” files now contain two types of information per line: the annotation (lowercased, unlemmatised, see further), followed by a tab and the label of this annotation (“Specific_Term”, “Common_Term”, “OOD_Term”, or “Named_Entity”). In cases where a single annotation received different labels depending on the context, the most frequently given label is provided.

4.2 Casing, POS-tagging, and Lemmatisation

True-casing, POS-tagging & lemmatisation are non-trivial tasks but not the focus of this edition of TermEval. Therefore, all data will be lower-cased, non-lemmatised, and with only one entry per term.

For example, the English corpus on dressage contains the term “bent” (verb – past tense of “to bend”), but also “Bent” (proper noun – person name). While both capitalisation and POS differ, and “bent” is not the lemmatised form, there will be only one entry: “bent” (lowercased) in the gold standard (other full forms of the verb “to bend” have separate entries, if they are present and annotated in the corpus).

5. Additional Information

Websites: * For more information about the TermEval shared task, visit: https://termeval.ugent.be * For more information about the CompuTerm workshop, visit: https://sites.google.com/view/computerm2020/ * For more information about the annotation guidelines, visit: http://hdl.handle.net/1854/LU-8503113

Publications: * Rigouts Terryn, A., Hoste, V., & Lefever, E. (2018). A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents. Proceedings of LREC 2018. * Rigouts Terryn, A., Hoste, V., & Lefever, E. (2019). In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora. Language Resources and Evaluation, 54(2), 385–418. https://doi.org/10.1007/s10579-019-09453-9 * Rigouts Terryn, A., Hoste, V., Drouin, P., & Lefever, E. (2020). TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset. Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94.

The dataset has been updated since the publication of the former two papers. These papers also discuss aspects of the data which have not been made available yet, such as cross-lingual annotations and information on the span of the annotations.

Number of annotations per corpus:

Domain Language # term annotations # term + Named Entity annotations # Specific Terms # Common Terms # OOD Terms # Named Entities: * corp en 927 1173 278 642 6 247 * equi en 1155 1575 777 309 69 420 * htfl en 2361 2585 1883 319 157 226 * wind en 1091 1534 781 296 14 443 * corp fr 979 1207 298 675 5 229 * equi fr 961 1181 701 234 26 220 * htfl fr 2228 2374 1684 487 57 146 * wind fr 773 968 444 308 21 195 * corp nl 1047 1295 310 730 6 249 * equi nl 1393 1544 1022 330 41 151 * htfl nl 2074 2254 1559 449 66 180 * wind nl 940 1245 577 342 21 305

Normalisation:

The following normalisation procedures are applied to both the original text files and the annotations: * unicodedata.normalize(“NFC”, text) * normalising all dashes to “-”, all single quotes to “‘" and all double quotes to’”’

6. Updates

Changes version 1.0 > version 1.1

Changes version 1.1 > version 1.2

Changes version 1.2 > version 1.3

Changes version 1.3 > version 1.4

7. Error Reporting

The ACTER dataset is an ongoing project, so we are always looking to improve the data. Any questions or issues regarding this dataset may be reported via the Github repository at: https://github.com/AylaRT/ACTER and will be addressed asap.

8. License

The data can be freely used and adapted for non-commercial purposes, provided the above mentioned paper is cited and any changes made to the data are clearly stated.