Publishing Multi-lingual Texts Online

Jonathan Slocum

N.B. The abstract below introduces a paper published on the LRC website as Early Indo-European Languages: Publishing Multi-lingual Texts Online.

April 3, 2007

The Linguistics Research Center (LRC) at the University of Texas is publishing online, in web page format, a series of lessons intended to cover at least one language in each of the major Indo-European families. To date, ca. 650 web pages have been published, with complete sets of lessons in 16 languages representing the Anatolian, Armenian, Baltic, Celtic, Germanic, Hellenic, Indic, Iranian, Italic, and Slavic families; other lessons, now well underway, extend coverage to the Tocharian family, and future efforts may address Albanian in the Balkan group.

Major considerations in this project have included means to ease text transcription while, at the same time, providing web page output in multiple character sets (e.g. ISO-8859-1 and Unicode) and possibly in multiple formats (e.g. HTML and LaTeX for PDF). As well, attention is paid to metadata and the potential for post-production, large- or small-scale extensions and revisions to lesson content and format. These factors argue for generating published resources via software, rather than hand-crafting them. This paper discusses our designs and methodologies related to resource authoring and automated processing. Our implementations have proven highly flexible and capable, generating lesson pages with metadata and enabling the addition of major, unforeseen new components months or even years after series completion by lesson authors.