The University of Texas at Austin; College of Liberal Arts
Hans C. Boas, Director :: PCL 5.556, 1 University Station S5490 :: Austin, TX 78712 :: 512-471-4566
LRC Links: Home | About | Books Online | EIEOL | IE Doc. Center | IE Lexicon | IE Maps | IE Texts | Pub. Indices | SiteMap

Early Indo-European Languages

Publishing Multi-lingual Texts Online

Jonathan Slocum

Research Scientist
Linguistics Research Center

Abstract

The Linguistics Research Center (LRC) at the University of Texas is publishing online, in web page format, a series of lessons intended to cover at least one language in each of the major Indo-European families. To date, ca. 650 web pages have been published, with complete sets of lessons in 16 languages representing the Anatolian, Armenian, Baltic, Celtic, Germanic, Hellenic, Indic, Iranian, Italic, and Slavic families; other lessons, now well underway, extend coverage to the Tocharian family, and future efforts may address Albanian in the Balkan group.

Major considerations in this project have included means to ease text transcription while, at the same time, providing web page output in multiple character sets (e.g. ISO-8859-1 and Unicode) and possibly in multiple formats (e.g. HTML and LaTeX for PDF). As well, attention is paid to metadata and the potential for post-production, large- or small-scale extensions and revisions to lesson content and format. These factors argue for generating published resources via software, rather than hand-crafting them. This paper discusses our designs and methodologies related to resource authoring and automated processing. Our implementations have proven highly flexible and capable, generating lesson pages with metadata and enabling the addition of major, unforeseen new components months or even years after series completion by lesson authors.

Introduction

It has been noted (Anderson, 2000) that students of Indo-European (IE) languages require access to ancient texts and other publications that are becoming increasingly rare in libraries -- if they are accessible at all -- due to budgetary and other constraints. These factors, coupled with the extended intervals observed in traditional publication of IE resources (Slocum and Simms, 2002), argue that the World Wide Web can and should play an important role in IE linguistics. Indeed, more and more academic journals are becoming available online. Why not extend web techniques to the publication of pedagogical and data resources? It has become even more obvious, since Anderson's "epilogue" some years ago, that this question is purely rhetorical.

With support from the Salus Mundi Foundation and other sponsors, the Linguistics Research Center (LRC) at the University of Texas has been working since 2001 to develop a series of web page lessons for each of a number of early Indo-European languages. The basic rationale for the series, and the content per language family, are detailed elsewhere (Slocum, 2005; Lehmann and Slocum, 2002-2007). Our aims here are to introduce the reader to these online resources and, most especially, to a high-level view of our rationale and designs for software techniques used to create, maintain, and extend them.

Our resources are not made available as, say, Microsoft Word® documents, but in the more widely accessible form of HTML web pages for Internet browsers running on Windows, Macintosh, Linux, or other computing platforms. Lesson authors are spared the trouble of publishing web pages, however, by adopting and adapting our simple but formal conventions for lesson creation. One great advantage of divorcing authors from publication is that each lesson can be generated as several different web pages, each assuming different character set capability; if this were attempted by making available, say, Word documents, each resource would have to be re-created, by hand, by the lesson author -- and, of course, the lessons could not be viewed at all on computing platforms lacking either Word or the font(s) used by the author. These days every computer comes with a web browser, and indeed there are several alternatives from which one may select.

Though it may seem strange to some, it is nonetheless true that not all computers have Adobe Acrobat® installed; furthermore, some tools (e.g. reading machines for the blind) take advantage of metadata markup including language identification and table row/column header tags that may not be available in Acrobat readers. Thus, while we do not rule out generation of resources in PDF format (we have begun to experiment with this), we do not at present make such resources public: we prefer formats that are more accessible to our website visitors, who in recent years have represented almost 85% of the nations on earth (and who, no doubt, are in many cases using truly antiquated hardware and software).

Following some background information, including characteristics of resources in each Early Indo-European Online (EIEOL) series and audience responses to them, which together constitute our "credentials," we describe the basics of web page generation, the "raw" materials authored by lesson developers, computer conversion of these raw materials into language resources (lessons, etc.), and finally the advantages offered by our techniques.

Characteristics of Each Series

A series contains lessons and other resources for a language, or a pair of related languages, belonging to a family (e.g. Old Norse in the Germanic family, or Avestan and Old Persian in the Iranian family). For the most part, a series will comprise ten lessons plus other resources, but sometimes the lessons number fewer than ten if authors so elect. A series will begin with an introduction (e.g. Lehmann and Slocum, 2002) to the language(s) and family, the speakers, their location and historical time frame, important events, etc. Following the introduction -- speaking, of course, as if a reader followed the series from "first" page to "last" page -- come the lessons (e.g. Krause and Slocum, 2003), then any appendices (which are optional; e.g. Thomson and Slocum, 2006). Ancillary resources include a table of contents (e.g. Harvey and Slocum, 2004) and various series glossaries; all of these are created by software. An optional annotated bibliography may be supplied; it can appear in the last few "grammar points" (discussed below), or within the series introduction (above), or as a separate page -- again, depending on the desires of the lesson author(s).

Each lesson in a series follows a prescribed format. First there is an introduction to the text author if known, possibly to the original audience, and to the text being glossed; this "sets the stage" to aid text understanding. Next is the text, with word-by-word glosses; the text may be a complete work, or an extract if the work is too long. (Glossed texts are known to be highly effective tools for language teaching.) This is followed by the text, sans glosses, with appropriate formatting (e.g. for poetry) and possibly line/verse numbers. Then comes a similarly formatted/numbered translation into English. Finally, five grammar points per lesson discuss details of the writing system, sound system, morphology, syntax, etc. The grammar points in a series are organized for progressive tutorial exposure to the language. (Incremental exposition of grammar points is also known to be effective in language teaching.) In any given lesson series, the final grammar points may be used to provide an annotated bibliography for further reading; this is optional, as determined by the lesson author(s).

The table of contents is generated fully automatically -- i.e. not by a lesson author -- by software that identifies grammar point section and subsection headings during lesson processing; a minimal seed, provided once by hand, simply identifies the text glossed in each lesson. Various language glossaries are also generated fully automatically -- this time, with no "seeds." Currently there are three of these, all compiled from the glosses of words in texts: a Master Glossary (e.g. Slocum and Beaulieu, 2003) showing features of the surface forms; a Base-Form Dictionary (e.g. Slocum and Kimball, 2005) showing the gloss components that one might find in an ordinary printed dictionary (e.g. "principal parts" and meaning); and an English Meaning Index (e.g. Slocum and Thomson, 2006) providing a "bi-lingual dictionary" with English words as the source. In all cases, entries in these resources are hot-linked to the lesson area(s) that address them -- e.g. to a lesson page & line, to a grammar (sub)section, possibly to specific examples in grammar points, or to all instances of a given gloss in the lesson texts.

Audience Response

We began publishing in 2002; at the present time (2013) we have lessons in the following language families:

It should be noted that our lessons are not oriented to "popular" or grade-school audiences: we use minimal graphics (none for didactic purposes) and no audio/video materials; we avoid "flash." Our aims are educational: we do cater to language students rather than scholars, and to those working in related disciplines such as archaeology, for whom basic knowledge of ancient languages would be of use, e.g. in tracing the spread of technology. As what follows will evidence, the popularity of our resources is significant and increasing.

During the year 2005, page requests ("hits") received by the LRC website as a whole originated in 172 of 240 countries identified by ISO-3166 2-letter codes; in 2006 that number rose to 186; in 2007, to 195; since then, the tally has exceeded 200. This was determined by analysis of the IP addresses making page requests, fewer than 1% of which [addresses] were unavailable or could not be classified. We have witnessed a history of continuing increases in LRC page requests since statistical tracking began for the week ending September 20, 2003; that week, there were 1,166 requests for EIEOL pages, with lessons available only for Latin, Classical and N.T. Greek, and Old Church Slavonic. In December 2004, with additional lessons covering Armenian, Avestan, and Old Persian, EIEOL hits first exceeded 2,000 within a single week. In May 2005, with the addition of lessons covering Old Norse and Lithuanian, EIEOL hits first exceeded 5,000 within a single week. In January 2006, with the addition of lessons covering Hittite, Gothic, and Rigvedic Sanskrit, EIEOL hits first exceeded 10,000 within a single week; in February, 15,000. These days we routinely exceed such numbers.

We have received unsolicited e-mail feedback from a number of our readers, spanning the gamut from thanks and praise through serious comments and questions about content; included, also, are sometimes hilarious diatribes about one failing or another. Some examples follow.

Thanks and Praise

(Happily, one of Santa's elves very soon granted R.A.W's request, thus answering C.B's question.)

Comments and Questions

(We welcome error reports, and make corrections when called for. We often answer short questions, like that from E.L.S. We do not offer translation services, such as requested by M.H. and others.)

Diatribes

(Three examples should suffice. Needless to say, we don't always reply to these -- though later we may be taken to task for our silence: "I am getting slightly tired trying to communicate with you..." Of course, by not responding we risk missing out when someone tries to share a discovery "equivalent to the Rosetta stone to the ancient Egypt" [sic]!)

Web Page Generation

We describe in this section the basics of our EIEOL web page generation technology. We do not go into excruciating detail, but we do touch upon salient issues of our implementation. We generate web pages by concatenating, formatting, and "filling the blanks" in boilerplate files and intermediate data files. After outlining boilerplate content, we discuss issues of accessibility (e.g. by blind readers) and character set selection. We do not say much here about intermediate data files, as these are but mildly processed and indexed versions of the source data created by our authors; see the Guessing subsection, later, for further details.

Boilerplate

Our boilerplate files come in three varieties: header, metadata, and footer. Details depend on the page implementation (e.g. HTML, vs. LaTeX before it is used to generate, say, PDF); here we describe only HTML boilerplate.

A header announces the document type (i.e. HTML) and provides certain basic metadata (e.g. document title, stylesheet link, character set, and author), page framing (e.g. layout table with navigation bar [on the left] and content area), and ubiquitous content (e.g. background image, standard text like "Linguistics Research Center," and a link to the LRC's home page).

The term "DublinCore metadata" refers to 'DC.' meta tags for title, subject (keywords, LCC and/or DDC, LCSH), description, language (English 'en' and the lesson/text language), coverage (time frame), creator (primary author), (secondary) contributor, publisher, rights (copyright), date, type (text), format (text/html), and identifier (URL).

A footer includes a date/time stamp, contact e-mail address, and URLs for the home pages of the LRC, the College of Liberal Arts, and the University of Texas, all encapsulated within any HTML tags required to complete the page layout and terminate the document.

Accessibility

In the U.S., at least, one use of this term has come to mean something like "[a website] being legible to the handicapped" -- notably, those who are visually impaired. Accessibility is not a trivial issue, and even if it were not mandated by law (as in fact it is, for Texas state agencies), we would wish to make such access possible. If we were creating pages by hand, accessibility might impose a substantial burden. But in our project, conformance and tag entry are automated rather than handled manually, hence reliable and complete. Our boilerplate files provide the basic framework; and for each non-English string, tags declare the language and make style (e.g. font/height) recommendations. Such recommendations may of course be altered, or ignored, by the user's browser. Among many additional considerations, we never use graphics alone to display important information (e.g. text): we act as if a reader is browsing our site with the computer monitor turned off. One implication of this is that invisible link tags point to the previous, next, and home pages in each lesson series; these may be followed without having to select (e.g. click on) specific hot-links.

Character Sets

Our readership spans the globe, including not just developed countries but so-called "third world" countries. We cannot assume that everyone has access to modern computer software! Indeed, it is not even correct to assume that everyone in the "developed countries" has modern software: we know of scholars in U.S. academia who still use antiquated software, though it might be argued that these particular individuals are not part of our target audience. Yet we do not believe that facilities can be much better, on average, in very many places. Clearly, in a global environment, alternatives such as PDF files offer no real solutions.

For reasons such as these, we decided at project onset to publish our pages in HTML, and in multiple character sets. We selected "Unicode 3" for "the best" presentation -- when possible, in native script (e.g. polytonic Greek, Cyrillic, Armenian). We felt that, owing to the relative paucity of large Unicode fonts, a "Unicode 2" option with much-reduced font demands was also reasonable. As a third, lowest-level option, we selected ISO-8859-1. Lesson readers can freely choose from among these display options for any page, except for a few pages (such as our EIEOL home page, or an annotated bibliography) that are fully presentable in ISO-8859-1.

We noted, fairly consistently once record-keeping began in September 2003, that about 27% of our readers selected ISO-8859-1 "Latin-1" pages instead of Unicode -- and this, even though we offer the simplified "Unicode 2" option (accounting for 20% of hits). We have seen slightly increased preference for "Unicode 3" pages over the years; but in 2008, for example, 20% still chose ISO-8859-1 and another 16% chose "Unicode 2." We believe there are good reasons for users to make these "lesser" choices in lieu of "the best" presentation (accounting for 64% of page hits in 2008), and we therefore have high confidence that our decision to offer EIEOL [and other] pages in three different character sets is justified. With software generation of pages in multiple character sets taking a second or two on a 5-year-old PC, and incremental web page storage being all but free, counter-arguments melt away. We might consider, in the future, offering a "Unicode 4" option, once more fonts and supporting software are generally available covering scripts (e.g. Gothic) recently added to Unicode, but we judge that the benefit to our readers would be small.

Authored Materials

In this section we discuss the nature of the materials (plain-text "data files") necessarily created by hand by series authors/editors. Some materials -- global resources -- exist in single copies. Others are created for each series; still others are created for each lesson in a series. Tables of contents and series glossaries are not directly authored: their creation is automated.

Global Resources

Global resources are maintained by the EIEOL series editor. There is a single stylesheet for EIEOL pages; this file, via editing and uploading, enables global changes in web site format with trivial effort and no modification of any other page.

The boilerplate files described earlier are also global resources; they make use of macros to accomplish uniform goals (e.g. insertion of a date stamp) and variable goals (e.g. insertion of a link to a page that varies per language and/or character set). At the price of regenerating every web page, large-scale changes can be effected throughout the EIEOL site by editing boilerplate (not lesson files). Invocation of the necessary page generators is also highly automated; the entire process requires less than a minute per lesson series to execute. Uploading files to the web server requires manual attention for another minute.

Resources per Series

Each series has an introduction, an optional annotated bibliography (generally supplied, though in a location that varies), and optional appendices; these are all created as plain text files by the series author(s). A metadata template, which generally makes heavy use of macros, is created by the series editor. The author supplies a list of feature abbreviations for glossing. The author(s) and editor, working together, define a single Beta Code for ASCII (think "English keyboard") transcription of lesson text, with output in multiple (currently 3) character sets, and possibly a similar Gamma Code for transcribing a tertiary language; we stress that authors create lesson materials, once, which are then replicated in each character set by software, and any later revisions are performed on the original file. Author and editor also define Alpha sort keys (e.g. for series glossaries) for those same output character sets; that is, our software allows different sort orders for glossary entries in each character set.

Resources per Lesson

Each lesson has its own introduction, a glossed text file (from which the "full" text is automatically assembled), and a file with English translation plus grammar points. These resources are created as plain text files, three per lesson following fixed formats, by the series author(s); the series editor may revise them.

File Processing

The authored materials described above pass through a series of automated steps that perform error checking and intermediate-file/final-page generation functions. We describe the steps here. These tasks are invoked largely by the series editor, who wrote the Perl software, but also by the resident author of several language series. They could in principle be performed by any author with freely-available Perl software and willingness to invoke batch files on a command line (e.g. by typing commands like "pass0", "guessBeg 1", or "pass1 1").

Error Checking

Humans make mistakes; yet we do not wish mistakes to appear in our lessons and, of course, lesson generation software can become confused by human mistakes and make matters worse. Therefore, we have written software to find and point out real or likely mistakes for inspection and possibly correction by the author/editor. We check syntax (we refer here to the formal rules of lesson composition, not to the structure of natural languages like Old Norse or English) and, to some extent, validate semantics (e.g. in gloss abbreviations). Some types of errors, e.g. the use of non-ASCII characters such as é, can be repaired automatically; others must be handled manually.

Guessing

After any errors have been resolved, one invokes software that, given a plain text file with minimal formatting hints, "guesses" the necessary formatting commands. For example, text bounded by empty lines above and below constitutes a "block." Any one-line block is treated as a section heading; in the context of grammar points, headings are often numbered -- and, when they are, the "guesser" also enters them into a table for later use by the "table of contents generator." Otherwise, the amount of indenting of the line [measured by tabs, not spaces] determines the "level" of the heading.

A multi-line block is treated as a table if internal "column separator" tabs appear; or as a list of bullet points if only line-initial tabs appear; or else a paragraph if no tabs appear. Within a bullet-point list, [line-initial tabs are deleted and] line-ends separate bullet points; within a table, tabs separate columns and line-ends separate rows. In short, text appearance in a plain-text editor, as on a typewriter, is a reasonably close analog of final text presentation.

Plain-text editors do not support bold face and italics -- not, that is, within text segments -- so one uses single pairs of curly braces "{" and "}" to surround text that is to be italicized. One uses double pairs "{{" and "}}" to signal bold face, and triple pairs "{{{" and "}}}" to signal bold italics.

There are a very few other conventions, such as for signaling poetic line-breaks, or [in limited contexts] line or verse numbers, or the presence in otherwise English text of Beta/Gamma Code. In the final output, of course, all formatting signals disappear, leaving only their effects.

Page Generation

After "guessing" produces intermediate files, lesson and resource pages can be generated. Page generation steps perform: boilerplate insertion (header, metadata, footer); macro expansion (e.g. for series title, date/time stamps); and Beta/Gamma Code interpretation (to display text using the relevant script). These tasks produce series introductions, lesson pages (text introduction, glosses, translation, and grammar points), and resources such as the Table of Contents or Master Glossary.

Advantages of our Techniques

Adherence to accessibility standards is simplified and assured by software generation of output (e.g. HTML). Multiple definitions of Beta/Gamma Code enable the generation of output in alternate character sets for a world-wide audience at insignificant additional cost; automatically generated links on each page enable switching with a single mouse-click or the accessible equivalent. Automated generation of output also:

The use of macros simplifies global layout/content changes. A lesson writer need not know HTML (or LaTeX or whatever...): only simple plain-text editing is required. As a consequence of the latter, with content authored in ASCII, lesson composition is independent of platform: Windows, Macintosh, and Linux have all been used. An online "guesser" web-form allows authors (if they wish) to preview the appearance of page content, e.g. to check table layout or verify script output for Beta Code sequences.

Examples

Let us consider but a few examples highlighting the advantages we claim. Months after our Latin and Greek lessons were completed and posted online, the original glosses were expanded by a student to include principal parts, etc. Neither the original author nor the student knew anything about HTML, nor did they encounter it at any time. Enhanced sets of HTML lesson pages were generated from the edited plain-text files, and posted online.

Almost a year after those updates, our Master Glossary feature was introduced: with no input from lesson authors, master glossaries for Latin and Greek were generated from existing gloss "data files" and posted online; in addition, a new boilerplate macro was added to provide a Master Glossary link on each lesson page, in the same character set as that lesson page, and all-new lesson pages were generated and posted. Around the same time, our Old Church Slavonic lessons were being published -- with links to their automated Master Glossary resources, which resources were then used by the lesson author to revise his glosses for enhanced consistency.

Another year later, following publication of our Armenian and Old Iranian lessons, our Base-Form Dictionary feature was introduced: again with no action by lesson authors, some of whom were no longer available, base-form dictionaries were created from unchanged gloss "data files" and posted online; another boilerplate macro was added to provide an appropriate Base-Form Dictionary link on each lesson page, and all-new lesson pages were generated for every series and posted online.

About four months after that, the English Meaning Index and Table of Contents features were added to all lessons (Latin, Greek, ..., Old Iranian). No lesson authors were involved in the additions of these new features, nor was hand-editing of HTML performed by anyone for any purpose. We omit an exhaustive litany of other advantages of our techniques.

Conclusions

New, "authorless" resources have greatly enhanced the value of our lessons to our readership: more than 20% of all EIEOL page requests are for these resources. It is possible that yet more features, some foreseen (e.g. generation of LaTeX/PDF) and others not, may further enhance our lessons in the future. With over 670 EIEOL lesson and resource pages now published or online in draft form, to say nothing of yet more pages and languages to come, it is clear that, without automation of such "linguistic" tasks as we describe here, extensions would simply not be possible with the human resources we have available; indeed, we would have nothing approaching the present variety of resources to offer.

Indo-European linguistics stands to benefit greatly from resources such as we offer on the World Wide Web. IE texts and teaching materials are becoming freely accessible; some materials, such as our Tocharian series (in progress), have never before been available in any form, to say nothing of English. Server statistics demonstrate ample and growing access to our EIEOL and other pages, and automation of "linguistic" resource creation makes it all possible.

References