LegISTyr

Marlies Alber, Elena Chiocchetti, Natascia Ralli, Isabella Stanizzi

Institute for Applied Linguistics, Eurac Research

{name.surname}@eurac.edu

LegISTyr is a machine translation test set for evaluating legal terminology translation quality in the language combination Italian to South Tyrolean German.

South Tyrolean German is the standard variety of German used in South Tyrol (i.e., the Autonomous Province of Bozen/Bolzano in Northern Italy), where it is an officially recognised co-official language. It is used by the local legislature, administration and judicature. Two factors influence legal terminology in South Tyrolean German:

  1. the system-bound nature of legal terminology causes terminological variation across legal systems using the same natural language

  2. the official standardisation process by the South Tyrolean Terminology Commission establishes mandatory correspondences between terms designating the same legal concept in Italian and German to reduce terminological variation within South Tyrol

Machine translation systems often fail to accurately translate legal terminology into this minor (lesser-used) standard variety of German. However, terminological correctness and consistency are of paramount importance in a high-stakes domain like the legal domain.

The curated LegISTyr test set provides sentence examples containing Italian legal terms. For each legal term, it provides the officially standardised or simply current translation into South Tyrolean German. Where available, it also provides local term variants (i.e. other acceptable or common terms) and the equivalents used in other, major German-speaking legal systems (i.e. Austria, Germany, Switzerland, occasionally EU law).

The test set is divided into eight subsets, respectively focusing on:

  1. standardised terminology from various legal subdomains

  2. terms related to occupational health and safety

  3. terms related to subsidised housing

  4. terms related to family law

  5. terms related to criminal and criminal procedure law

  6. homonyms that are translated differently depending on the context or legal subdomain

  7. abbreviated forms (initialisms, acronyms and abbreviations), a well-known challenge in legal translation

  8. gender-inclusive variants (e.g. split forms, neutralisations, forms with neomorphemes) that should be translated using equally inclusive forms in accordance with local Provincial Law no. 5/2010

Subsets 1 to 7 include five examples for each selected term, subset 8 presents similar sentences with different strategies for inclusive writing with at least 20 examples per each strategy. Each subset consists of at least 250 examples. The overall amount of examples is 2067.

The test set was compiled by professional South Tyrolean terminologists with very good competencies in both languages and later checked by a legal expert. It was compiled following these suggestions:

Each example comprises at least a) a source sentence in Italian, b) a source term in Italian and c) a target term in South Tyrolean German. Many examples also provide d) other (less) acceptable South Tyrolean German variants expressing the same concept and, where available in bistro, e) equivalents expressing comparable legal concepts in one or more other German-speaking legal systems. This allows LegISTyr users to assess to what extent the interference from more represented legal systems impacts machine translation into South Tyrolean German.

Below you find an example row from the LegISTyr test set (subsets 1 to 6):

TERM NUMBER IT EXAMPLE LEGAL DOMAIN IT TERM STANDARDISED/RECOMMENDED TARGET HYPOTHESIS (DE SOUTH TYROL) OTHER TERMS SOUTH TYROL (CSV) TERMS FROM OTHER LEGAL SYTEMS (CSV)
37 Alla morte dell'usufrut-tuario è necessario cancellare l'usufrutto sul bene. civil law usufrutto yes Fruchtgenuss Fruchtgenuss-recht Fruchtnießung, Nießbrauch, Niessbrauch

TERM NUMBER = progressive number based on each term. There are 5 examples for each term in subsets 1 to 6 (i.e. five lines with the same number).

IT EXAMPLE = Example sentence in Italian that contains the term.

LEGAL DOMAIN = Domain or subdomain of law (e.g. civil law, inheritance law, criminal procedure law, subsidised housing).

IT TERM = Term in Italian contained in the example sentence. The term is listed in its basic form.

STANDARDISED/RECOMMENDED = Information on whether the South Tyrolean German term has been officially standardised by the Terminology Commission or recommended for use in South Tyrol. The column contains only two values: yes or no.

TARGET HYPOTHESIS (DE SOUTH TYROL) = The South Tyrolean German term that is either the standardised/recommended target hypothesis or the expected translation (based on its use by South Tyrolean bodies like the Provincial administration).

OTHER TERMS SOUTH TYROL (CSV) = Other terms (e.g. synonyms, short forms, spelling variants) that are used or may be acceptable for use in South Tyrol. When more than one term is present in the cell, they are separated by a comma. The column can be empty.

TERMS FROM OTHER LEGAL SYTEMS (CSV) = German Terms generally used in other legal systems (e.g. Austria, German, Switzerland). When more than one term is present in the cell, they are separated by a comma. The column can be empty.

Subset 7 on abbreviated forms has a different structure:

TERM NUMBER TYPE OF ABBRE-VIATED FORM IT EXAMPLE IT TERM STANDAR-DISED/RE-COMMENDED FULL FORM TARGET HYPOTHESIS (DE SOUTH TYROL) OTHER TERMS SOUTH TYROL (CSV) TERMS FROM OTHER LEGAL SYTEMS (CSV)
24 initialism Un eventuale provvedimento di rigetto dell'istanza per la nomina del CTU deve essere adeguatamente motivato. CTU yes Amtssach-verständiger Sachverständiger, Gerichtssach-verständiger gerichtlicher Sachverständiger

TYPE OF ABBREVIATED FORM = Abbreviated forms can be abbreviations (e.g. art.), acronyms (e.g. DUVRI) or initialisms (S.P.P.). Independent of their pronunciation as acronyms or initialisms, spelling variants with or without full stops are possible. Some examples also contain a combination of initialism/acronym + full form since this might help to correctly translate the abbreviated form. There are 5 examples per spelling variant or combination (e.g. initialism + full form) in the subset.

STANDARDISED/RECOMMENDED FULL FORM = Information on whether the full form of the term is standardised/recommended for use in South Tyrol. Note that the Terminology Commission does not standardise abbreviated forms.

Subset 8 on gender-inclusive writing also has a different structure

EXAMPLE NUMBER TYPE OF GENDER STRATEGY IT EXAMPLE FOCUS
48 full split form with slash Nessun avvocato/Nessuna avvocata può percepire compensi forensi superiori all’ammontare del proprio trattamento economico complessivo a tempo pieno, tenuto conto di tutti gli elementi retributivi spettanti. Nessun avvocato/Nessuna avvocata

TYPE OF GENDER STRATEGY = Indicates the type of strategy used for (binary/non binary) inclusive writing.

The following strategies are considered:

  1. full split form, e.g. il collaboratore o la collaboratrice, scalatrici e scalatori

  2. full split form with an omitted part of a term, e.g.: la datrice o il datore di lavoro, tecnico o tecnica di cantiere

  3. full split form with slash, e.g.: al revisore unico/alla revisora unica, le lavoratrici / i lavoratori

  4. full split form of an article or preposition before an epicene term with slash, e.g.: una/un geometra, il/la titolare

  5. full split form of an article or preposition before an epicene term, e.g.: del o della ricorrente, la o il giudice

  6. contracted split form, e.g.: tecnico/a, fisioterapisti/e

  7. split forms of pronouns and clitics: lui/lei, assegnatole/assegnatogli

  8. epicene terms (the noun is invariable but takes different articles, adjectives etc.), e.g.: giudice, giornalista

  9. invariable terms, e.g.: guardia giurata, persona di riferimento

  10. collective terms, e.g.: corpo docente, equipaggio

FOCUS = part of the sentence that is of interest


Notes on each subset

  1. Standardised terminology: terms from various legal subdomains that have been standardised by the South Tyrolean Terminology Commission. Most are present in bistro, some are present only in the lists of standardised terms. Other subsets may also contain standardised terms. The subset does not contain any short forms like initialisms or abbreviations, since the Terminology Commission never standardises them.

  2. Terms from occupational health and safety (OHS): Terminology used within Italian texts dealing with occupational health and safety. They can originally belong to other subdomains (e.g. datore di lavoro, employer, is part of labour law) but are relevant and frequently used in OHS. Since abbreviations and initialisms are very frequent in this subdomain, several concepts are listed both with their full form and their abbreviated form. This subset has been part-financed by the Autonomous province of Bolzano within the project SSL-Laien (see below)

  3. Terms from subsidised housing: Terminology used within South Tyrolean texts dealing with subsidised housing. They can originally belong to other subdomains but are relevant and frequently used in texts on subsidised housing. The list also contains frequently used abbreviations and initialisms.

  4. Terms from family law. Terminology used within Italian texts dealing with family law. Many are standardised terms.

  5. Terms from criminal and criminal procedure law. Terminology used within Italian texts dealing with criminal law and/or criminal procedure law. Many are standardised terms.

  6. Homonyms: Terms that should be translated differently depending on the specific context or legal subdomain. A human expert should have no difficulties in disambiguating the meaning of each term within the context of the example.

  7. abbreviated forms (initialisms and acronyms): there is considerable spelling variation for initialisms and acronyms (use of full stops, capitalisation). Abbreviated forms can be ambiguous and are a well-known challenge in legal translation. The subset contains common examples of abbreviated forms from different legal subdomains that can be found spelled differently in text. In Italian, abbreviated forms can have a plural spelling (e.g. art. vs artt., s.m.i. vs. ss.mm.ii.). Not all abbreviated forms in Italian correspond to a full form in German; sometimes the full form is or should be used in the target language (e.g. Sicherheitssprecher for RLS). The Terminology Commission never standardises abbreviated forms. The information in the column on standardisation is therefore referred to the full form.

  8. gender-inclusive variants of legal agentives: includes a range of standard (e.g. split forms) and substandard (e.g. forms with the neomorpheme schwa or an asterisk) that should be translated using equally inclusive forms (be it with similar or different strategies) in accordance with local Provincial Law no. 5/2010.

Note on original sources

The examples are copied, often adapted (e.g. shortened or otherwise amended, names and locations changed) from different publicly available sources on the web. Given that they are small fragments from a large number of different texts, we do not provide the source for each example. However, an aggregated list of reference sources is available. The dataset is intended for research purposes and does not aim to replace or reproduce the original texts in their entirety. Any rights to the original content remain with their respective owners.

Note on financing

Dataset creation was partly financed by the Autonomous Province of Bozen/Bolzano–South Tyrol within the project SSL-Laien “Optimising expert-lay communication. Case study: e-learning modules of the Autonomous Province of Bolzano – South Tyrol” awarded via the Programme Agreement for Eurac Research for 2022-2024. This financing only concerns only the subset on occupational health and safety.

All other subsets were financed by Eurac Research.

The Autonomous Province of Bolzano and Eurac Research both have an open access policy.

Note on papers using the test set

LegISTyr is part of one of the papers in September at the TermTrends25 “Bridging the Gap between Terminological Resources and Large Language Models” workshop co-located with the LDK 2025 – 5th Conference on Language, Data and Knowledge:

Di Natale, Paolo, Egon W. Stemle, Elena Chiocchetti, Marlies Alber, Natascia Ralli, Isabella Stanizzi, Elena Benini (2025) “The LegISTyr test set: Investigating off-the-shelf instruction-tuned LLMs for terminology-constrained translation in a low-resource language”. In Proceedings of the Workshop TermTrends25 “Bridging the Gap between Terminological Resources and Large Language Models”, Naples, 9 September 2025.

Bolzano/Bozen, 7 July 2025

Elena Chiocchetti