Parallel Corpora as Digital Resources and Their Applications

DHN2020
March 17, 2020
Riga, Latvia

Schedule

Session 1

9.00–9.10 Opening
9.10–10.10 Natalia Levshina (invited talk): Parallel corpora and big questions in linguistics
10.10–10.40 Olga Lyashevskaya: Belarusian-Russian and Lithuanian-Russian parallel treebanks: three practical tasks, three dozen dependency relations, and an indefinite number of language-specific constructions
10.40–11.10 Coffee break

Session 2

11.10–11.40 Maria Kunilovskaya: Types of translationese and register variation in English-to-Russian professional translation
11.40–12.10 Edyta Jurkiewicz-Rohrbacher, Elżbieta Kaczmarska, Alexandr Rosen: Parallel corpus as functional context of aspectual interpretation – the case of Slavic biaspectual verbs in the comparative context of Finnish
12.10–12.40 Liubov Nesterenko, Anastasia Bonch-Osmolovskaya: Computational methods meet parallel data: approaches for comparative analysis of passives in European languages
12.40–13.40 Lunch

Session 3

13.40–14.10 Atle Grønn: The RuN-Euro corpus and its applications
14.10-14.40 Kirill Semenov, Sonia Durneva, Yulia Kuznetsova: The Russian-Chinese parallel corpus in Ruscorpora: achievements and challenges
14.40-15.10 Mikhail Mikhailov, Julia Souma: MLCCA: a Finnish-Russian mixed type corpus
15.10–15.40 Coffee break

Session 4

15.40–16.10 Marina Akimova, Anastasia Belousova, Igor Pilshchikov, Vera Polilova: CPCL: A multilingual parallel corpus of poetic texts and new perspectives for comparative literary studies
16.10-16.40 Federico Aurora: Bibliotheca Polyglotta
16.40-17.00 Maria Skeppstedt, Elina Kangas, Peter Ljunglöf, Magnus Ahltorp, Gunnar Eriksson, Rickard Domeij: Plans for using texts from public authorities for creating a partly parallel Meänkieli corpus
17.00-17.50 Short break followed by a concluding discussion

Format of the workshop

A full-day session will include one invited lecture and 10 slots for 10 or 20 minute talks + time for questions and discussion. The session will be followed by a general round table discussion.

How to submit

Abstracts (up to 500 words without references) should be submitted to parallelcorporadhn2020@gmail.com by February 10 (extended!), 2020. Notifications of acceptance will be sent by February 17.

Registration for the conference is open here. Fees for participation in workshops only are not supposed to be changed even after the early-bird registration deadline and are specified as 15 euro (coffee breaks covered)

In order to be able to accept more presentations, we adapt the schedule to the main conference scheme consisting of contributions in forms of short and long papers. Please indicate your preferences (short/long presentation) when you submit your abstracts. Unfortunately, we cannot guarantee that all the long presentation preferences will be guaranteed the desired slot; all that is only due to the workshop time restrictions.

Who should submit and/or attend

The aim of the workshop is to bring together specialists working on the development of parallel corpora or the data from such corpora and to share our knowledge on different areas of their applications and the variety of methods used in the studies based on parallel texts. The major focus of the workshop will be the parallel corpora of particular relevance for Northern Europe, though participants from other countries or those working with other languages are also very welcome to contribute by presenting their research.

Why parallel and not just monolingual corpora

A parallel corpus "consists of the same documents in a number of languages, that is a set of texts and their translations" (Baker et al. 2006). This type of corpora is widely used in linguistics, more specifically in cross-linguistic comparison and typology. Semantic correspondences (both on lexical and grammatical levels) can be more easily extracted from parallel texts, including massive multilingual corpora (cf. Cysouw, Wälchli (eds.) 2007, Christodouloupoulos, Steedman 2015) than from monolingual corpora. The latter lack semantic annotation, and without a prior cross-linguistic comparison the categories to annotate do not often suggest themselves. Existing work on the topic includes different types of techniques for analysis (cluster and regression analysis, multidimensional scaling, collocation analysis) and visualization of linguistic data extracted from parallel corpora. A special open-source parallel corpus for typological research ParTy is collected by Natalia Levshina.

Parallel corpora and, more general, parallel texts have always been and still remain an important resource type for training and evaluating natural language processing tools, most particularly in machine translation. The hot topic of transfer learning (Ruder 2019), which, among others, focuses on different techniques of transferring models from high-resource languages to low-resource ones, makes parallel texts an important indirect object of study. For instance, parallel texts are highly relevant for building cross‑lingual parsers or word embeddings (Yarovsky et al. 2001; McDonald et al. 2011; Tiedemann 2015; Agić et al. 2016; Søgaard et al. 2019), and also can be used for more linguistically informed experiments (Östling 2015; Östling & Tiedemann 2017).

Many language pairs, however, remain scarcely represented in the domain of parallel corpora beyond some specific genres such as legalese or religious texts, and so cannot be used in forms of representative big data collections. Even parallel corpora for high-resource languages are typically composed of literary and non-fiction texts. Access to such genres as business correspondence, contracts, letters, etc. is limited. Other genres, e.g., user manuals, tourist guides, web pages, often contained low-quality data, sometimes even machine-translated. The digitalization of the accessible texts and creation of new language pairs is still an important task per se.

Relevance of parallel corpora in a wider context of digital humanities

We welcome participants working with data that can be seen as a special type of a parallel corpus, even though such data are not purely linguistic and rather concern the phenomenon relevant for the wider area of digital humanities. For instance, many publications with texts in the Circum-Baltic languages can primarily serve as the source of folklore texts and oral history in general. Nevertheless, they have usually been published with translations into higher-resourced languages, with a particular relevance of German and Russian for the region, e.g., the comprehensive bilingual collection of Latvian fairytales and legends compiled by Pēteris Šmits (Šmits 1925-1937; pasakas.lfk.lv) For the minor Finnic languages, Finnish and Estonian have often been used in translations. A corpus of international treaties compiled at Tampere University (Mikhailov et al. 2019) can be used as a source of the history of the Finnish-Russian relationships reflected in the structure, language and pragmatics of the treaties.

Texts represented in parallel corpora can also be treated from the perspective of cultural heritage and their representation (as emphasised in Derzhanski & Siruk 2013; Giouli et al. 2009). This is strongly related to the issue of the representativity of particular corpora: for instance, what texts are selected by the creators of the corpora and what is the wider cultural perspective reflected in such texts, which might be of particular relevance for low-resource languages.

Parallel texts and/or corpora are used in language learning (Doval et al. 2019) and as a translation-assisting tool, including collecting translation memories. They are one of the best resources for studying the regularities of translation and literary, cultural and social context of interlanguage translation (Zanettin 2014). For example multiple translations of the same text reflect the evolution of cultural techniques used to represent the text to the readers’ audience.

The use of parallel corpora in the study of multimodal non-verbal communication is also significant. There exists a parallel corpus of different stage versions of the same play (MultiPARC within the Russian National Corpus, ruscorpora.ru), and corpora of signed languages are beginning to emerge (eg Morrissey et al 2010). Studies based on parallel corpora are also relevant for the domain of oral interpreting (Fantinuoli 2017).

The Nordic context

The Nordic and Circum-Baltic countries are home to different parallel corpora involving Nordic languages (e.g., the English-Swedish parallel corpus ESPC, the Finnish-Swedish parallel corpus KOTUS, the Lithuanian-Latvian corpus LiLa) and larger projects such as the open-source parallel corpus OPUS developed by Jörg Tiedemann (University of Helsinki), cf. (Tiedemann 2012, Skadiņš et al. 2014) for the latter. Notable centres focusing on corpus creation and corpus-based research, including parallel corpora, are Lund, Gothenburg, Oslo, Stockholm, Helsinki, Tampere, Tartu and other Nordic universities. A bilingual parallel corpora project featuring various Circum-Baltic languages is run within the Russian National corpus (cf. Perkova, Sitchinava 2019).

Research topics

We welcome contributions focusing on particular cases of linguistic, historical, anthropological, pedagogical and other applications of parallel corpora, as well as more general papers discussing methods of extracting, building, aligning and annotating parallel texts for the purposes of digital humanities.

Possible topics for talks may relate to (but are not restricted to) the following:

  • creation of new parallel corpora featuring Nordic and other languages;
  • studies of phenomena characteristic for the Circum-Baltic languages (Dahl, Koptjevskaja-Tamm ed. 2001) based on parallel corpora;
  • automatic and manual annotation of linguistic data in parallel corpora on different levels, including lemmatization and grammatical tagging;
  • using parallel corpora in literary and cultural studies and other humanities;
  • parallel corpora in translation studies: historical, cultural, social and other aspects of the translation studies with regard to the parallel texts;
  • multimodal corpora, including interpreting and sign language corpora, and studies related to them

Contact

For all correspondence concerning the workshop, please contact the organizers at:
parallelcorporadhn2020@gmail.com

Organized by:
Natalia Perkova (Stockholm University / Uppsala University)
Dmitri Sitchinava (Institute of the Russian language / Higher School of Economics)

Please feel free to share this call with all your colleagues who might be interested in the workshop!

References

  • Željko Agić, Anders Johannsen, Barbara Plank, Héctor Martínez Alonso, Natalie Schluter, and Anders Søgaard. 2016. Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4:301–312.
  • Paul Baker, Andrew Hardie, Tony McEnery. A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press, 2006.
  • Christos Christodouloupoulos and Mark Steedman. A massively parallel corpus: the Bible in 100 languages. Language Resources and Evaluation 2015; 49(2): 375–395. Michael Cysouw, Bernhard Wälchli (eds.). Parallel Texts. Using Translational Equivalents in Linguistic Typology. Theme issue in Sprachtypologie & Universalienforschung STUF 60.2, 2007.
  • Östen Dahl, Maria Koptjevskaja-Tamm (eds.) Circum-Baltic languages. Typology and contact. Vol. 1-2, Amsterdam—Philadelphia: Benjamins, 2001.
  • Ivan Derzhanski, Olena Siruk. Linguistic Corpora as International Cultural Heritage: The Corpus of Bulgarian and Ukrainian Parallel Texts. In: Digital Presentation and Preservation of Cultural and Scientific Heritage, 2013, 3, 91--98.
  • Irene Doval, Santiago Fernández Lanza, Tomás Jiménez Juliá, Elsa Liste Lamas and Barbara Lübke. Corpus PaGeS: A multifunctional resource for language learning, translation and cross-linguistic research. In: Parallel Corpora for Contrastive and Translation Studies. Amsterdam: Benjamins. 2019, 103--121.
  • Claudio Fantinuoli. Computerlinguistik in der Dolmetschpraxis unter besonderer Berücksichtigung der Korpusanalyse. In: Silvia Hansen-Schirra, Stella Neumann & Oliver Čulo (Hrsg.), Annotation, exploitation and evaluation of parallel corpora. Berlin: Language Science Press, 2017, 111– 146.
  • Voula Giouli, Nikos Glaros, Kiril Simov, Petya Osenova. A web-enabled and speech-enhanced parallel corpus of Greek - Bulgarian cultural texts. In: Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education –LaTeCH – SHELT&R 2009, 35–42
  • Ryan McDonald, Slav Petrov, Keith Hall. Multi-Source Transfer of Delexicalized Dependency Parsers. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, 62–72
  • Mikhail Mikhailov, Miia Santalaht, Julia Souma. PEST: A parallel electronic corpus of state treaties. In: Parallel Corpora for Contrastive and Translation Studies. Amsterdam: Benjamins. 2019, 183--195.
  • Sara Morrissey, Harold Somers, Robert Smith, Shane Gilchrist and Sandipan Dandapat. Building a Sign Language corpus for use in Machine Translation. In: 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologie, LREC 2010
  • Sebastian Ruder. 2019. Neural Transfer Learning for Natural Language Processing. PhD thesis. National University of Ireland, Galway.
  • Robert Östling. 2015. Word Order Typology through Multilingual Word Alignment. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Vol. 2 (Short Papers), 205-211.
  • Robert Östling and Jörg Tiedemann. 2017. Continuous multilinguality with language vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, April 2017. Association for Computational Linguistics, 644--649,
  • Dmitri Sitchinava, Natalia Perkova. 2019. Bilingual Parallel Corpora Featuring the Circum-Baltic Languages within the Russian National Corpus. In: Digital Humanities in the Nordic Countries. Proceedings of the Digital Humanities in the Nordic Countries 4th Conference Copenhagen, Denmark, March 5-8, 2019, 495–502.
  • Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis, Daiga Deksne. 2014. Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus. In: Proceedings of the LREC 2014,1850--1855.
  • Anders Søgaard, Ivan Vulić, Sebastian Ruder, and Manaal Faruqui. 2019. Cross-Lingual Word Embeddings. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers.
  • Pēteris Šmits. 1925-1937. Latviešu tautas pasakas un teikas (15 volumes) / Lettische Märchen und Sagen [Latvian Fairytales and Legends].
  • Jörg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In: Proceedings of the LREC 2012, Istambul, 2214-2218
  • Jörg Tiedemann. 2015. Improving the Cross-Lingual Projection of Syntactic Dependencies. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), 191-199
  • Jörg Tiedemann. OPUS – Parallel Corpora for Everyone. 2016. In: Baltic Journal of Modern Computing (BJMC), Vol 4, No. 2, Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT): 384
  • David Yarowsky, Grace Ngai, and Richard Wicentowski. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research, pages 1–8. Association for Computational Linguistics.
  • Federico Zanettin. Corpora in Translation. In: Juliane House (ed.) Translation: A Multidisciplinary Approach. London: Palgrave Macmillan, 2014, 178-199