N.B. The email exchange below may have been edited, e.g. to remove content not essential to the main point(s) or to standardize English spelling/grammar.
...this is about the vanNooten/Holland RV text in electronic format. (I got my copies from an old version of Prof. Witzel's page cached [on the web].) ...You report trouble identifying the character set used in the files. The answer is DOS Code Page 437...
I found no instances of 42, 46 or 174 (and I'm having trouble finding out what a "bolle" even is!). Which leads to me to wonder about the relation between the files on Prof. Witzel's pages and the ones you worked on. ...it really seems that we may have dealt with different files...
I am also led to ask: is there some way to get these files without scraping the HTML?
Yours Sincerely, A. R.
Dear Mr. R.,
...I was primarily responsible for... all the software part of our task. You raise a few points that I thought I might address.
First, re: the source of our text --
As explained on our RV Intro page (which talks about the difficulties encountered in deciphering the electronic vN/H), our source was a [DOS] 3 1/2" diskette that was included with the 1994 book Rig Veda: a Metrically Restored Text. Hence we did not acquire the text from any online source, and there is no guarantee -- indeed, no strong reason to suspect -- that our source would agree byte-for-byte with anything found online. I shall therefore decline to speculate about any/all character differences between our version and the version [from "Prof. Witzel's page(s)"] that you used: arbitrary editing may have been performed by one or more persons -- seemingly verified by the fact that you "found no instances of [code points] 42, 46 or 174."
Second, re: our "trouble identifying the character set used in the files" --
CP 437 was, of course, a fairly obvious starting point. The problem was that (1) our vN/H text DEVIATED from CP 437 in some respects, and (2) the [included] Harvard documentation claimed that the diskette files used an "International Codepage for Sanskrit diacritics" without explaining what "Codepage," exactly, this phrase refers to. I tried to find out. As best I could determine, the most likely referent had evolved after 1994 but was apparently NOT, even in 1994, identical to CP 437. And so it was that we had to make guesses and conduct textual analysis to verify or refute same. Like you, we arrived at final conclusions (incl. "accidental artifacts") over a period of time.
Third, re: the meaning of [punctuation mark] "bolle" --
Our RV Intro page explains that "van Nooten and Holland's editorial bolle" was used to flag changes in the text -- and, elsewhere, we point out that two DIFFERENT vN/H code points were used to indicate the bolle character [for which we used Unicode U+00B0 : Degree Sign in our HTML rendering]. Obviously, no text can use different code points for a single character while conforming to a single code page (see "identifying the character set" above).
Fourth, re: "some way to get [our] files without scraping the HTML" --
Alas, not really. K.T. and I have been editing RV files, off & on, for some time. For a while I carefully maintained a "vN/H-equivalent" version of the text, with revisions. But at some point, being in a hurry and seeing no strong reason to maintain the vN/H-equivalent -- especially since we began making other annotations (cf. "the Popular Rigveda") -- I began editing my intermediate Unicode/HTML source used to generate the web pages. With such additional annotations, it is not easily (if at all) possible to go back and update a DOS-like source.
At this point, I should also point out that [the HTML and] OUR revisions are copyrighted by the University of Texas, which paid me to do all this processing & editing. The web pages are freely available for examination and research (e.g., running statistical analyses), but alterations in the text are not to be copied for reproduction elsewhere, except in small portions under the "fair use" doctrine. And of course under the "fair use" doctrine one must cite one's source -- as do we, on every RV page. (And as THEIR source, vN/H cite an earlier text created by the LRC -- so the cycle is complete!)
Thank you for writing... J. S.