Maria Skeppstedta, Elina Kangasa, Peter Ljunglöfb,

Magnus Ahltorpa, Gunnar Erikssona, and Rickard Domeija

aThe Institute for Language and Folklore, Sweden

bThe University of Gothenburg, Sweden/Chalmers University of Technology, Sweden

Plans for using texts from public authorities for creating a partly parallel Meänkieli corpus

A parallel corpus consisting of texts from Swedish public authorities has previously been made freely available within The National Language Bank of Sweden.0 We plan to gradually extend this corpus by adding texts written in the national minority languages of Sweden [4], i.e. all varieties of Meänkieli, Finnish, Romani Chib, Yiddish and Sami that are spoken in Sweden. As a first step, we will collect texts published in Meänkieli on the web pages of public agencies and local municipalities. Where available, we will also collect Finnish, English and Swedish texts that are parallel to the retrieved Meänkieli texts. These texts are often originally written in Swedish and then translated into other languages.

Meänkieli texts published by Swedish authorities are usually (i) thematically narrow, (ii) limited in size, and (iii) written by a small number of authors and translators. As these texts have typically been translated from Swedish, they could also be considered as belonging to a somewhat constructed genre, which might be influenced by the source language.

However, although there are corpora collections that include Meänkieli texts [3], the general availability of language resources and language tools for Meänkieli is scarce [1]. Therefore, despite its limitations, the planned corpus could form a useful resource, e.g. for gaining more examples of how the language is used, and as an input for creating some types of language processing tools. In particular, we estimate that the resource will become valuable, since it will be freely available, and since it will contain parallel and semi-parallel texts.

Using Meänkieli as the example language, we plan to create a semi-automatic pipeline for collecting parallel corpora with texts written in the minority languages of Sweden. The pipeline will consist of the following steps: (i) manual selection of which html web pages with Meänkieli texts to download, (ii) automatic extraction of the actual texts, (iii) automatic sentence-alignment of those texts that have been (manually) estimated to be parallel enough for this to be possible, (iv) manual word-alignment for a small subset of the parallel texts, and finally, (v) manual quality control of the automatic text extraction and sentence-alignment. For step (iii), we plan to use the Bitextor0 tool [2], which we will provide with a bilingual resource in the form of a Swedish-Meänkieli dictionary for performing the alignment.

After creating a first version of the corpus, we plan to explore to what extent more steps in the pipeline can be fully or partly automated. For instance, (i) whether relevant Meänkieli texts and parallel texts in other languages can be automatically identified with the use of automatic language identification and the use of automatic alignment methods, and (ii) whether manual word alignment can be simplified by providing automatic pre-alignments that the user can correct.

Using the knowledge gained from the construction of a corpus-creating pipeline for Meänkieli, we plan to continue by creating similar corpora for the other national minority languages of Sweden.

  1. Domeij, R., Karlsson, O., Moshagen, S., Trosterud, T.: Enhancing information accessibility and digital literacy for minorities using language technology — the example of Sami and other national minority languages in Sweden. In: Perspectives on Indigenous Writing and Literacies. Brill (2019)

  2. Esplà-Gomis, M.: Bitextor: a Free/Open-source Software to Harvest Translation Memories from Multilingual Websites. In: Proceedings of MT Summit XII. Association for Machine Translation in the Americas, Ottawa, Canada (2009)

  3. Jauhiainen, T., Jauhiainen, H., Linden, K.: The Finno-Ugric languages and the Internet project. In: Proceedings of the First International Workshop on Computational Linguistics for Uralic Languages (2015)

  4. Kulturdepartementet: Lag (2009:724) om nationella minoriteter och minoritetsspråk. (2009)