Mikhail Mikhailov and Julia Souma (Tampere University, Finland)

MLCCA: a Finnish-Russian mixed type corpus

Parallel corpora are very important for developing translating technologies of all kinds as well as for comparative language studies. The field is developing rapidly, but still there is not enough data, especially for smaller languages (see e.g. Mikhailov & Cooper 2016, Doval & Sánchez Nieto 2019).

In our talk a new Finnish-Russian corpus will be presented. The Multilingual Corpus of Contracts and Agreements (MLCCA) compiled at the Tampere University includes both parallel and comparable texts.

The text type of contract is ubiquitous: communities of all sorts as well as individuals have to constantly make agreements to guarantee progress and stability. The documents can be very simple and very complicated, they can be done in two or more languages, or just in one.

MLCCA contains different types of contracts and agreements – treaties, agreements between states, ministries, cities, companies, universities as well as lower level agreements and contracts between small enterprises or physical persons. Most of the texts of the corpus are real documents concluded between real organizations, others are templates for documents. The texts with rare exceptions were collected from open sources on the Internet. The time of creation of the texts is from the beginning of the 1990-ies to the present.

The parallel part of the corpus contains 56 Finland-Russia treaties, 13 lower level agreements and 10 license agreements.

The comparable part contains 300 contracts and agreements in the Finnish language and 340 contracts and agreements in the Russian language.

Now the size of the whole corpus is 590 thousand running words in Russian and 350 thousand running words in Finnish.

The main challenges of compiling MLCCA:

a) certain types of agreements such as agreements between companies are usually confidential and cannot be made available;

b) some types of agreements such as agreements on cooperation between sister cities are not confidential, but were never published in print neither on the web;

c) some agreements on partnership are done in English language only or in Russian and English languages;

d) the copyright problem: one of the possible sources are books of templates, which are subject to IPR;

e) it is sometimes difficult to collect comparable documents in Finnish and in Russian, some very typical for Russia contracts are almost never done in Finland;

f) it is difficult to achieve the harmony in size between Finnish and Russian monolingual parts: Finnish contracts are often arranged as tables or fill-out forms while Russian contracts are regular cohesive texts. Therefore, there is large difference in numbers of words between Russian and Finnish subcorpora, although the number of texts is quite close.

The corpus can be further extended with bilateral agreements, license agreements, and comparable documents.

MLCCA can be used for contrastive studies of language for special purposes (LSP) as well as for practical work of translators and copy-editors. The corpus will be available online via Tampere University and the Language Bank of Finland.

MLCCA was compiled with financial support from the FIN-CLARIN consortium.

References

Irene Doval, Maria Teresa Sánchez Nieto (eds.) (2019). Parallel Corpora: Creation and Applications. Benjamins.

Mikhailov Mikhail, Cooper Robert. (2016). Corpus Linguistics for Translation and Contrastive Studies: a guide for research. London and New York: Routledge.