Olga Lyashevskaya (National Research University Higher School of Economics, V. V. Vinogradov Russian Language Institute RAS, Moscow)

Belarusian-Russian and Lithuanian-Russian parallel treebanks: three practical tasks, three dozen dependency relations, and an indefinite number of language-specific constructions

We present two pilot parallel treebanks for related, East Slavic and Baltic, languages annotated according to the Universal Dependencies (UD) scheme (Nivre et al. 2017). Belarusian-Russian and Lithuanian-Russian treebanks are annotated at the level of full morphology and dependencies, lemmatized, and aligned at the sentence-to sentence and word-to-word level. All annotations are checked manually. The development of our resources is based on the experience of several parallel dependency projects, such as Prague Czech-English dependency treebank (Hajič et al. 2012 , Urešová et al. 2019), LinES English-Swedish treebank (Ahrenberg 2007), ParTUT, a corpus for Italian, English and French (Bosco et al. 2012), and PROIEL, a parallel treebank of the old Indo-European Bible translations (Haug, Jøhndal 2008). Some of these treebanks were converted to UD (Ahrenberg 2015; Ekhoff et al. 2018), and a multilingual Parallel UD collection was introduced in the CONLL 2017 shared task (Zeman et al. 2017).

Alignment of texts written in closely related languages is often considered a trivial task. However, the non-matching parts of trees allows one to reveal language-specific constructions (e.g. Be. больш як 15 працэнтаў урачоў - Ru. более 15 процентов врачей 'more than 15 per cent doctors') or tendencies to different choice between rival patterns and word orders in two languages. On the one hand, the systematic analysis of such mismatches paves the way towards quantitative token-based typology (Levshina 2019, 2015; Haspelmath et al. 2014; Bjerva et al. 2019; Guzmán Naranjo et al. 2018) and identifies dimensions for developing the multi-lingual constructicons and FrameNets (Lyngfeldt et al. 2018; Boas, Höder 2018; Gilardi, Baker 2018). On the other hand, construction alignment can be useful to ensure the quality of corpus annotation (Ahrenberg 2019).

Taking such a practical perspective, we propose an approach based on parallel annotations to define language-specific UD shemas and to assist annotators who don't possess extensive expertise in a divergent, although related to his/her native, language. Moreover, the parallel tree alignment is helpful while evaluating the quality of texts (e. g. news, wikipedia articles) selected for including into the parallel collections.


References

Ahrenberg, Lars. 2007. LinES: An English-Swedish Parallel Treebank. Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA, 2007).

Ahrenberg, Lars. 2015. Converting an English-Swedish parallel treebank to Universal Dependencies. In Third International Conference on Dependency Linguistics (DepLing 2015), Uppsala, Sweden, August 24-26, 2015, pp. 10-19.

Ahrenberg, Lars. 2019. Towards an adequate account of parataxis in Universal Dependencies. In Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019) (pp. 94-100).

Bjerva, Johannes, Robert Östling, Maria Han Veiga, Jörg Tiedemann, and Isabelle Augenstein. 2019. What do language representations really represent? Computational Linguistics, 45(2), pp. 381-389.

Boas, Hans C., and Steffen Höder. Construction Grammar and language contact. Constructions in contact: Constructional perspectives on contact phenomena in Germanic languages 24 (2018): 5.

Bosco, Cristina, Manuela Sanguinetti, and Leonardo Lesmo. 2012. The Parallel-TUT: a multilingual and multiformat treebank. In Eight International Conference on Language Resources and Evaluation (LREC'12), pp. 1932-1938.

Eckhoff, Hanne, Kristin Bech, Gerlof Bouma, Kristine Eide, Dag Haug, Odd Einar Haugen, and Marius Jøhndal. 2018. The PROIEL treebank family: a standard for early attestations of Indo-European languages. Language Resources and Evaluation 52(1): 29-65.

Gilardi, Luca, and Colin Baker. 2018. Learning to align across languages: Toward multilingual framenet. In Proceedings of the International FrameNet Workshop, pp. 13-22.

Guzmán Naranjo, Matías, and Laura Becker. 2018. Quantitative word order typology with UD. Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), Issue 155, 91–104. Oslo University, Norway, 13–14 December 2018.

Hajič, Jan, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. 2012. Announcing Prague Czech-English dependency treebank 2.0. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 3153–3160.

Haspelmath, Martin, Andreea Calude, Michael Spagnol, Heiko Narrog, and Elif Bamyacı. 2014. Coding causal-noncausal verb alternations: A form-frequency correspondence explanation. Journal of Linguistics 50(3). 587–625.

Haug, Dag T. T., and Marius Jøhndal. 2008. Creating a parallel treebank of the old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pp. 27-34.

Levshina, Natalia. 2015. European analytic causatives as a comparative concept. Evidence from a parallel corpus of film subtitles. Folia Linguistica 49(2): 487–520.

Levshina, Natalia, 2019. Token-based typology and word order entropy: A study based on Universal Dependencies. Linguistic Typology, 23(3): 533-572.

Lyngfelt, Benjamin, et al. (eds.). 2018. Constructicography: Constructicon development across languages. Amsterdam/Philadelphia: Benjamins.

Nivre, Joakim, Željko Agić, Lars Ahrenberg et al. 2017. Universal dependencies 2.0 – CoNLL 2017 shared task development and test data. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University. http://hdl.handle.net/11234/1-2184. See also http://universaldependencies.org/

Urešová, Zdeňka, Eva Fučíková, Eva Hajičová, and Jan Hajič. 2019. Parallel Dependency Treebank Annotated with Interlinked Verbal Synonym Classes and Roles. In Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), pp. 38-50.

Zeman, Daniel, Martin Popel, Milan Straka, Jan Hajič, Joakim Nivre, Filip Ginter, et al. 2017. CONLL 2017 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. ACL, Vancouver, Canada, pp. 1–19.