The University of Texas at Austin; College of Liberal Arts
Hans C. Boas, Director :: PCL 5.556, 1 University Station S5490 :: Austin, TX 78712 :: 512-471-4566
LRC Links: Home | About | Books Online | EIEOL | IE Doc. Center | IE Lexicon | IE Maps | IE Texts | Pub. Indices | SiteMap

LRC Blog

October 8, 2010

Jonathan Slocum

Early Indo-European Online and Braille Displays

N.B. The email exchange below may have been edited, e.g. to remove content not essential to the main point(s) or to standardize English spelling/grammar.

Query

I am extremely grateful to have access to this site. I studied linguistics intermittently in the 1970s, and I am able to use the Romanized texts with a braille display with no difficulty.

Braille is generally a binary writing system employing a matrix three dots high by two dots wide. In order to represent the original ASCII character set, the matrix (cell) is expanded to include two extra dots at the bottom, allowing 256 characters. I am trying to determine whether the texts that are Unicode 3.0 compliant are actually in UTF-8. If this is the case, then even non-Roman scripts could theoretically be rendered in braille, as each letter or diacritic is represented by two eight-dot "glyphs."

I would appreciate very much if someone could tell me whether the texts are actually in UTF-8.

Many thanks, J. M.

Response

We have three versions of each text. Two -- the Unicode versions -- are in UTF-8, while the third -- "Romanized" -- is in ISO-8859-1, which is a simpler character set (think "ASCII plus some accented letters").

In ISO-8859-1, each character is represented by a single 8-bit byte. But in UTF-8 the world is different: each character has a unique digital representation, but that representation variously requires 1, 2, 3, or even 4 (I think) 8-bit bytes, depending on the character.

So you are right (re: UTF-8) that "even non-Roman scripts could theoretically be rendered in braille" but you are wrong in assuming that "each letter or diacritic is represented by two eight-dot 'glyphs'" (i.e. bytes) because, while many characters do require two bytes, others may require only one byte or more than two bytes.

J. S.

P.S. I forgot to mention overstrike diacritics -- accents. While it is often true that the accented letters we need are present in Unicode, and therefore are treated as sketched above, it is sometimes necessary to add separate overstrike diacritics to compose the desired character. For example, to get 'a' + macron + acute, we insert the Unicode a+macron character (2 bytes in UTF-8) then add a separate overstrike acute accent (2 more bytes) -- i.e., requiring 4 bytes for the "single" visible character, which is actually two separate Unicode characters.