Oliver Čulo (International Computer Science Institute) "From word lists to grammar to semantics – a deeper parallelism for parallel corpora"
Wed, September 5, 2012 • 10:00 AM - 11:00 AM • Burdine Hall 214
It is much owed to the works of Baker (1993) and Toury (1995) that corpus linguistics has made its way into translation studies. In the early days, word lists and concordancers were the basic tools performing studies on parallel texts. However, the latest advances in natural language processing in other languages than English and in automatic alignment have facilitated studies based on grammatical notions. These studies can help us detect and quantify translation shifts and study effects like text type or translation direction with regard to shift types and frequencies (cf. e.g. Čulo et al. 2011; Hansen-Schirra, Neumann, and Steiner in print). Knowledge of these shifts can be used in classroom for translator education, but also to improve tasks like machine translation e.g. by automatic extraction of bilingual valency pattern alignments resp. dictionaries (cf. e.g. Addaki et al. 2012; Čulo 2011)
Studies like the ones cited have been too rare to paint a complete picture, though the benefit from a linguistic and translational view on what we can learn from parallel corpora also extends to domains like machine translation or foreign language teaching. One part of the problem is the lack of accessibility of such annotations. Analyses of parallel texts used for machine translation are unviable to linguists and translation scholars unless the have excellent computer knowledge. There are annotations of parallel corpora accessible through annotation tools, but they often consist of separate annotations linked together by word or sentence alignment. There are only few tools which allow display and/or correction of parallel grammatical annotations, such as the Stockholm Tree Aligner (Samuelsson and Volk 2007). Processing pipelines, however, producing data for such tools, are rare, as well. One exception is GATE (Bontcheva et al. 2003), which allows processing parallel texts and correcting the alignment, but only for stretches of texts, not structures.
What translation studies lacks even more badly are parallel corpora with semantic annotation. One such corpus is a subset of the Europarl corpus, 1,000 English and 1,000 German sentences annotated with syntactic structures and frame semantics (Padó and Lapata 2009), from which the observations on semantic “parallelism” presented in this talk are mainly drawn.
The talk will present studies on translated texts and how they diverge from each other grammatically. It will also be discussed how this interacts with semantics, both ways: Sometimes, typological restrictions make it harder to express certain semantic content, and sometimes, it is different conventions in how we say things that may lead to semantic divergencies (cf. e.g. House 1997). Besides these linguistic/translational insights, the talk will also present a project which has recently set up a UIMA-based pipeline for parallel texts which, using the TrEd treebank tool, allows to easily edit the structural annotation and alignment of parallel texts with both phrase structure and semantic annotation.