The University of Texas at Austin; College of Liberal Arts
Hans C. Boas, Director :: PCL 5.556, 1 University Station S5490 :: Austin, TX 78712 :: 512-471-4566
LRC Links: Home | About | Books Online | EIEOL | IE Doc. Center | IE Lexicon | IE Maps | IE Texts | Pub. Indices | SiteMap

Building an Indo-European Lexicon

Using Print and Online Dictionaries

Jonathan Slocum

Research Scientist
Linguistics Research Center


We describe our development of an Indo-European Lexicon, comprising our "Pokorny Master" listing of Proto-Indo-European (PIE) etyma and, in principle for each etymon, a list of Indo-European (IE) reflexes derived therefrom (see, e.g., the PIE root ag- 'to drive, lead'). PIE etyma are described by Semantic Fields, and reflexes are (being) indexed by language in sorted lists. We focus chiefly on procedures for automated collection and glossing of reflexes in English and other IE languages using published dictionaries -- some available in electronic form, one (index) scanned, and the rest accessed manually.


The Linguistics Research Center is a self-supporting Organized Research Unit in the College of Liberal Arts; it was founded by Winfred P. Lehmann in 1961.
Our website URL is:
We recorded 3,337,295 web page visits in 2009 (vs. 2,766,361 in 2008) from ca. 80% of the nations on earth; ca. 3 of every 8 page requests originate outside the U.S. (>1 million in 2009).
Our Early Indo-European Online language lessons cover 17 IE languages; lessons for an 18th language (Tocharian B) are in preparation.
Etymon, pl. etyma
We use 'etymon/etyma' to refer to one or more PIE lexical entries.
The term 'reflex' refers to an Indo-European word/morpheme derived from a PIE etymon.


Our Indo-European Lexicon is displayed as static web pages, each made available in [up to] three character sets:

  1. ISO-8859-1 -- for access by old, lowest-common-denominator computer hardware/software (common in certain parts of the world);
  2. Unicode 2 -- for access by any modern computer hardware/software, which (by definition) supports the Unicode character set;
  3. Unicode 3 -- for access by computer hardware/software supporting Unicode and providing at least one very large Unicode font.

When a given page is available in alternate character sets, links are provided in the left margin for easy switching among the different versions.

Pokorny Master

Our Pokorny Master lexicon, available in Unicode 3 and Unicode 2 and ISO-8859-1 versions, lists 2,222 PIE etyma; each etymon is glossed in English, and links are available to other resources. These etyma were adapted from [most of] the head-words in Julius Pokorny's Indogermanisches etymologisches Wörterbuch (Bern: Franke, 1959).

Reflex Entries

In principle, each PIE etymon may be associated with IE reflexes derived from that etymon; these resources are under construction, and not all PIE etyma have reflexes listed at this time. But most PIE etyma are associated with reflexes, and these are listed on a web page for the etymon (e.g. for the PIE etymon meaning 'apple'), with a link to it from our Pokorny Master lexicon.

Language Indices

Finding the reflexes of a PIE etymon is easy: locate it in the Pokorny Master lexicon, and click on the associated "IE" link. But how does one discover the PIE etymon from which a given IE reflex is derived? For this, we provide (currently 84) Language Indices, which list -- in an alphabetic order suited to each language family -- the reflexes in a given language; each reflex is linked to the PIE etymon from which it is derived and, on that page, cognates of the original reflex may be found. Of course, a multi-morpheme IE word could be linked to more than one PIE etymon.

Semantic Index

Carl Darling Buck's book A Dictionary of Selected Synonyms in the Principal Indo-European Languages (Chicago & London: U. Chicago Press, 1949) defined a hierarchy of "semantic fields," and he used these to categorize Indo-European reflex words. We have used those same semantic fields to index our PIE etyma; our online Semantic Fields index may be used to locate PIE etyma in shared semantic categories, and each etymon having reflex entries is linked to any field(s) categorizing it. Since the semantic fields are all expressed in English, there is no need for separate Unicode representations of the semantic fields pages; however, links from those pages to Pokorny Master and reflex entries are available in the usual three character sets.

Content Sources

We discuss here only our primary content sources.

Bomhard's RPN

We received gracious permission from Allan Bomhard to include, in our Lexicon, IE content extracted from a 2002 edition of his book, Reconstructing Proto-Nostratic (Leiden & Boston: Brill, 2008); to this end he provided us with his electronic source [in 2002], for which we are truly grateful.

Buck's DSS

Our use of Carl Darling Buck's book A Dictionary of Selected Synonyms in the Principal Indo-European Languages is discussed above (see the Semantic Index section).


Our own EIEOL lessons cover one or more languages in every IE family, with the sole current exception of Albanian. Work to link our EIEOL Base-Form Dictionary entries to PIE etyma continues; the majority of our lesson languages already have such links.

Lehmann's GED

Winfred P. Lehmann's book A Gothic Etymological Dictionary (Leiden: E.J. Brill, 1986) was compiled by the LRC in electronic form; the author granted us permission to extract certain content from this source for inclusion in our Lexicon -- taking care, as always, not to violate the publisher's copyright.

Pokorny's IEW

The principal source for our PIE etyma -- in particular, for their spellings -- was Julius Pokorny's Indogermanisches etymologisches Wörterbuch (Bern: Franke, 1959). For various reasons we omitted some of Pokorny's main entries, and we corrected spelling/typographical errors in others. We provided our own English glosses. Where we made use of Pokorny's cross-references to create links to other PIE etyma, we reorganized them in circular chain-reference fashion.

Watkins' AHD

We used, from Calvert Watkins' book The American Heritage Dictionary of Indo-European Roots, 2nd ed. (2000), his index linking English reflexes to [his] PIE etyma, and their cross-references to Pokorny's etyma. We constrained our attention to English words that could be paired directly with entries in W7 (below), and indirectly with Pokorny's etyma.

Merriam's W7

We used, from Webster's Seventh New Collegiate Dictionary (Springfield, Mass: G. & C. Merriam Co., 1963), head-words, etymologies and definition texts -- but only content from the small subset of head-words that could be paired with AHD entries (above). We extracted (or, as necessary, inferred) words in the etymologies, and rephrased definition texts so as to avoid copyright infringement.


Our primary and secondary/other sources are all listed in Appendix A.

Copyright Issues

In our work to assemble lexical content, we strove to avoid copyright infringement. Where an author's permission was acquired, avoidance was all but trivial -- though our data presentation was devised in part to side-step many potential issues (i.e. we have our own unique way of presenting IE Lexicon data). With respect to the dictionary content that we did use (head-words, etymologies, and definition texts), we here describe copyright issues and discuss our avoidance thereof.


We believe the key copyright issue re: head-words to be captured by the phrase "all & only." That is, one's list of dictionary head-words should not be exactly the same as a previously published dictionary's list. We have no problem with this issue, as our "head-words" (however one might try to define them) simply do not equate to anyone else's list. The closest would be Pokorny's IEW, but as mentioned above we omit some of his entries and alter others. W.r.t. Bomhard's RPN we have the advantage of direct permission, but in any case his head-words are Nostratic whereas ours are Proto-Indo-European. W.r.t. Lehmann's GED, his head-words are all Gothic and, in any case, we use only content that can be linked to PIE etyma. W.r.t. Watkins' AHD we use Pokorny's [corrected] etyma spellings, with which Watkins' do not often overlap, and w.r.t. his English reflexes we concentrate on those that can be paired with head-words in W7. W.r.t. W7, we omit most entries from that list of head-words -- concentrating as we do on the subset that can be found in the AHD index.


We are not directly cognizant of copyright issues specific to etymologies, but we assume these to be like copyright as applied to definition texts. In any case our Lexicon content is not "etymological" in structure, so it does not resemble W7's etymological content -- even though we do copy individual words from W7's etymologies (and cite W7 when we do so). Overall we entirely omit most of W7's etymologies, concentrating as we do on English words listed in Watkins' AHD; we directly copy no etymologies, and add significant additional content.

Definition Texts

The key copyright issue re: definition texts is captured by the phrase "5-word sequence" -- that is, a meaning gloss must avoid sharing any sequence of five or more identical words with another dictionary publisher's gloss. We escape this in two ways:

  1. our aim, unlike that of a dictionary publisher, is not to define a word [sense] so that a person who does not know the word can come to understand it, but rather it is to "remind" a reader (who probably does know the word) of what we mean; in particular, we briefly gloss a word in it's oldest (or primary) sense and ignore any newer/secondary senses;
  2. we copy the [single/oldest] W7 definition text, then edit/abbreviate it specifically to avoid identical 5-word sequences -- and, in addition, we cite W7 as our source.
Avoiding Infringement

Key to avoiding copyright infringement is to add original content. As we work our way through some examples, it will become more clear what original content we add, as well as how much publisher content we eliminate.

Example: W7's "abdicate" (phase 1)

In the 1960's, John Olney at System Development Corporation received permission from G. & C. Merriam Company to make a "keypunch copy" of Webster's Seventh New Collegiate Dictionary (1963), herein referred to as W7. The idea was that computational linguists there and elsewhere, with permission from Merriam, could use this resource in their research; the essential constraint, of course, was that none would publish an English dictionary using W7 as its basis.

In the 1970's, the Linguistics Research Center at the University of Texas acquired from Merriam permission to use a copy of this resource in its own research. At the time, just as in the 1960's, this resource was available in a limited set of 64 characters (uppercase only) with line lengths constrained to 80 columns. The keypunched dictionary entry for the head-word abdicate looked like this:


Over the years, as computer hardware improved (for example, lowercase letters became available), the format of such content could be improved by writing software to perform simple text editing -- e.g., reducing letters to lowercase (except those originally flagged as uppercase), converting '\'-flagged 3-character sequences to something more readable, eliminating the 80-column line limit, etc. Much later, as our IE Lexicon project got underway, more special-purpose software was written to extract and edit content from W7. In particular, we extracted F line (main entry) content, E line (etymology) content, and the content from the first [full] D line (i.e., the definition text for [what W7 claimed to be] the oldest sense).

This software extraction & editing was not entirely straightforward for two reasons: some words in etymologies were omitted by the publisher (i.e., were left for the [human] reader to infer), requiring restoration by software, and -- for our purposes -- definition texts needed to be "decluttered" (e.g., shortened & simplified by deletion or replacement of content and reworking of cross-references) to avoid infringement.

We outline, below, four steps involved in "decluttering."

Delete leading word(s)

We discovered that, for our purposes, certain words & phrases at the beginning of a W7 definition text could simply be deleted; for example, we seek & delete the following initial words/phrases in order (where '|' separates alternatives and parentheses enclose optional content and '*' denotes a word segment and '...' denotes content to be retained):

  1. "a|an|the" ...
  2. "used as a function word (to indicate)" ...
  3. "used as a direction in music" ...
  4. "quality or state of being" ...
  5. "(of,) relating to, or" ...
  6. "relating to, or *ing," ...
  7. "of, or relating to," ...
  8. "act or process of" ...
  9. "any of several|various" ...
  10. "used esp. as" ...
  11. "any of" ...
  12. "a|an|the" ...
  13. "genus|group of" ...

(N.B. Deleting other words/phrases might result in an article [a|an|the] moving into first place; thus, it is sought and deleted more than once.)

Delete/replace other word(s)

We discovered that, again for our purposes, certain words & phrases could be deleted (actually, replaced by '.' or '/' to mark editing) wherever they might appear; for example (where " marks surround content to be deleted, or changed into content between ' marks when so signalled by = signs):

  1. by|with|under "or as if by|with|under"
  2. "that consist(s) of" = ':.'
  3. "characterized by" = '.involving.'
  4. "(as" = '(.'
  5. "genus («*») of" = 'genus.of'
  6. "a|an|the" = '.'
  7. "or" = '/'
  8. "with" = 'w/'

(N.B. In the rules above, '«' and '»' are explicit characters in the improved/converted definition text resulting from the oldest stage of software processing, alluded to previously; these delimit text that was italicized in W7.)

Rework cross-references

We discovered that so-called synonymous cross-references could be revised and moved about to shorten glosses and/or avoid infringement; for example:

  1. "‹LOWER›, ‹DEPRESS›" = "(to) depress, lower"
  2. "to destroy the self-possession or self-confidence of : ‹DISCONCERT›" =
    "to disconcert, *destroy.self-possession/self-confidence of"

(N.B. In the 2nd rule above, '*' is inserted to mark the place where potential copying of "original W7 text" resumes.)

Delete punctuation to end

We discovered that certain punctuation or other marks could reliably be used as markers for gloss truncation; for example (where '...' signals arbitrary content also to be deleted, to the end of the line):

  1. "«..." (double-angle quotation mark = italics)
  2. "— ..." (em-dash)
  3. "; ..." (semicolon)
Example: W7's "abdicate" (phase 2)

We illustrate, here, how we extracted and "decluttered" content from the W7 entry for abdicate.

W7 (full)

We started with [pre-processed] W7 content, which looked like this:

E:L «abdicatus», pp. of «abdicare», fr. «ab-» + «dicare» to proclaim — more at ‹DICTION›
D:0;;;vt;to relinquish (as sovereign power) formally : ‹RENOUNCE›
D:0;;;vi;to renounce a throne, high office, dignity, or function
S:•syn• ‹RENOUNCE›, ‹RESIGN›: ‹ABDICATE› implies a giving up of sovereign power or sometimes an evading of responsibility such as that of a parent; ‹RENOUNCE› may replace it but often implies additionally a sacrifice for a greater end; ‹RESIGN› applies to the giving up of an unexpired office or trust

W7 (extract)

We extracted the F line, E line, and [first] D line:

E:L «abdicatus», pp. of «abdicare», fr. «ab-» + «dicare» to proclaim — more at ‹DICTION›
D:0;;;vt;to relinquish (as sovereign power) formally : ‹RENOUNCE›

Decluttering (E gloss)

Using the [synonymous] cross-reference rule exemplified as #2 above, and the "(as" replacement rule illustrated as #4 above, software drafted the following gloss:

abdicate vb = to renounce, *relinquish (.sovereign power) formally

Notice that there are now no 5-word sequences original to W7, and this was accomplished via software with [as of yet] no human intervention.

Example: W7's "abdomen"

We show, here, how we extracted and "decluttered" content from the W7 entry for abdomen.

W7 (full)

We started with [pre-processed] W7 content, which looked like this:

P:'aab-duh-muhn, aab-'doh-muhn
E:MF & L; MF, fr. L
D:1;;;n;the part of the body between the thorax and the pelvis; «also» : the cavity of this part of the trunk containing the chief viscera
D:2;;;n;the posterior section of the body behind the thorax in an arthropod

W7 (extract)

We extracted the F line, E line, and [first] D line:

E:MF & L; MF, fr. L
D:1;;;n;the part of the body between the thorax and the pelvis; «also» : the cavity of this part of the trunk containing the chief viscera

Inferring etymology

There is not one word in the etymology for abdomen. The W7 editors had decided that a human reader could [probably] infer what they meant and supply missing information; and therefore Merriam could save space by omitting [repetition of] the head-word. Our software, therefore, was written to make the same [two] inferences:

...MF «abdomen», fr. L «abdomen»

(N.B. Lacunae in etymologies are not limited to such simple cases as head-word omission; omissions elsewhere in etymologies also required software inference and restoration.)

Decluttering (E gloss)

Using the rule deleting 'the' up front, then the rule replacing 'the' with '.' everywhere else, then the rule truncating from '«' to the end of the line, and lastly the rule truncating from ';' to the end of the line, software drafted the following gloss:

abdomen n = part of.body between.thorax and.pelvis

Again notice that there are no 5-word sequences original to W7, and this plus etymology restoration was accomplished via software with [as of yet] no human intervention.

Example: "abbreviate"

Here we illustrate more completely the process of bridging the span from AHD (English word) to W7 (with etymology) to Pokorny (PIE etymon) using the verb abbreviate.

  1. AHD (E index): abbreviate = mregh-u-
  2. AHD (W index): mregh-u- = W0732
  3. AHD=IEW (cross-ref): W0732 = [Pokorny mreghu- 750.] = P1340

The AHD index tells us that abbreviate is derived from Watkins' etymon mregh-u-, which we associated with the unique identifier W0732. The AHD text reveals that Watkins' etymon is equivalent to Pokorny's mreghu-, which we associated with the unique identifier P1340. Hence, abbreviate (absent from Pokorny's IEW) is derived from -- is a reflex of -- etymon P1340.

Key Observation: if Watkins' derivation of an English word from a certain etymon is correct, and if we have properly paired his English word with an entry in W7, and if the etymology of that English word in W7 is correct, then any reflex words in that etymology must also derive from Watkins' etymon, hence Pokorny's equivalent etymon!
W7 (full)

We continued with [pre-processed] W7 content for abbreviate, which looked like this:

E:ME «abbreviaten», fr. LL «abbreviatus», pp. of «abbreviare» — more at ‹ABRIDGE›
D:0;;;vt;to make briefer : ‹SHORTEN›; «specif» : to reduce (as a word or phrase) to a shorter form intended to stand for the whole

W7 (extract)

We extracted the F line, E line, and [first] D line:

E:ME «abbreviaten», fr. LL «abbreviatus», pp. of «abbreviare» — more at ‹ABRIDGE›
D:0;;;vt;to make briefer : ‹SHORTEN›; «specif» : to reduce (as a word or phrase) to a shorter form intended to stand for the whole

Decluttering (E gloss)

Using the [synonymous] cross-reference rule exemplified as #2 above, software drafted the following gloss:

abbreviate vb.trans = to shorten, *make briefer

Once again there are no 5-word sequences original to W7 -- indeed, no 5-word sequence at all -- and this was accomplished via software with [as of yet] no human intervention.

Reflex Records

We show here what we do with etymology content, still using abbreviate from W7.

IE Lexicon (pre-ed.)

From abbreviate and its W7 etymology, software created the following four reflex records:

15   t E   w abbreviate   p vt   s W7
15   t ME   w abbreviaten   s W7
17   t LLat   w abbreviatus   s W7   g pp.
17   t LLat   w abbreviare   s W7

IE Lexicon (mid-ed.)

The four "raw" reflex records were processed by other software (e.g. to gloss ME abbreviaten & E abbreviate) and by "fast" human editing to yield:

15   t ME   w abbreviaten   s W7/LRC   g abbreviate
15   t E   w abbreviate   p vt   s AHD:abbreviate/W7:abbreviate   g to shorten, *make briefer
17   t LLat   w abbreviatus   p ptc   s W7/LRC   g abbreviated
17   t LLat   w "abbrevio, abbreviare"   p vb   s W7/LRC   g abbreviate

IE Lexicon (post-ed.)

The four "fast-edited" reflex records were post-processed by more software (e.g. to prepend 'to' to glosses for verbs), and were then post-edited by other humans, to yield:

15   t ME   w abbreviaten   p vb   s W7/LRC   g to abbreviate
15   t E   w abbreviate   p vb.trans   s AHD:abbreviate/W7:abbreviate/LRC   g to shorten, make brief
17   t LLat   w "abbrevio/abrevio, abbreviare, abbreviavi, abbreviatus"   p vb   s W7/LRC   g to abbreviate

Software then merged the reflex records above with others for etymon P1340. All reflex records are maintained in source form as "plain text" files, which in principle can be edited on any computer (although Macintoshes have a vile tendency to trash "plain text" files, so one must exercise great care when using them).

Original Content

We summarize, here, the kinds of original content that we add to reflex records.

re: Etyma

To PIE etyma, adapted from Pokorny, we add:

re: Reflexes

To IE reflexes drawn & adapted from numerous sources we add:

Codes & abbreviations in reflex records are explained on the reflex web pages where they appear.

Operational Aspects

We recapitulate, here, our (often highly automated) processing techniques, where our use of past tense signifies a task now complete. We:

Re-run software after subsequent editing

Why do we re-run software after editing?

  1. to identify errors;
  2. to augment Source ID's;
  3. to republish web pages;
  4. to re-index reflexes.

Results to Date

Early in 2010, we can report the following [selected] results.

PIE head-word list is complete
Reflex content is included
1,452 etymon web pages show reflexes
44,359 reflex records are assigned to PIE etyma
171 IE languages have reflexes
84 languages have Reflex Indices (with 10 or more reflexes)

Appendix A: Data Sources

Primary Sources
  1. AHD = Calvert Watkins: The American Heritage Dictionary of Indo-European Roots, 2nd ed. (2000)
  2. DSS = Carl Darling Buck: A Dictionary of Selected Synonyms... (1949)
  3. EIEOL =
  4. GED = Winfred P. Lehmann: A Gothic Etymological Dictionary (1986)
  5. IEW = Julius Pokorny: Indogermanisches etymologisches Wörterbuch (1959)
  6. LRC = Linguistics Research Center, University of Texas, Austin
  7. RPN = Allan R. Bomhard: Reconstructing Proto-Nostratic (2002)
  8. W7 = Webster's Seventh New Collegiate Dictionary (1963)
Secondary Sources
  1. AHW = Rudolf Schützeichel: Althochdeutsches Wörterbuch (1981)
  2. ASD = Joseph Bosworth and T. Northcote Toller: An Anglo-Saxon Dictionary (1898)
  3. CDC = (online): The Century Dictionary and Cyclopedia (1889-1911)
  4. CID = Cassell's Italian Dictionary (1958)
  5. CLD = Cassell's Latin Dictionary (1959, rev. 1968)
  6. DEO = Hermann Vinterberg and C.A. Bodelsen: Dansk-Engelsk Ordbog (1966)
  7. ELD = Charlton T. Lewis: An Elementary Latin Dictionary (1999)
  8. GE = Colin Mark: The Gaelic-English Dictionary (2003)
  9. HH = Heinrich Hübschmann: Armenische Grammatik (1897)
  10. IED = Patrick S. Dinneen: An Irish-English Dictionary (1927)
  11. IEO = Arngrimur Sigurdsson: Íslenzk-Ensk Orðabók (1970)
  12. LD = Bronius Piesarskas and Bronius Svecevicius: Lithuanian Dictionary (1994)
  13. LS = Liddell and Scott: Greek-English Lexicon, 7th ed. (1889, rev. 2001)
  14. NED = Einar Haugen: Norwegian-English Dictionary (1965)
  15. ODE = C.T. Onions: The Oxford Dictionary of English Etymology (1966)
  16. OED = James Murrary et al: The Oxford English Dictionary (1933)
  17. R1 = Josette Rey-Debove and Alain Rey, eds. Le Nouveau Petit Robert (1993)
  18. Sal = Diccionario Salamanca de la Lengua Española (1996)
  19. SEO = Norstedts Stora Svensk-Engelska Ordbok, 3rd ed. (2000)
  20. W2I = Webster's New International Dictionary of the English Language, 2nd ed. (1959)
  21. WE = H. Meurig Evans and W.O. Thomas: Welsh-English, English-Welsh Dictionary (1969)
Other/Potential Sources
  1. CAS = John R.C. Hall: A Concise Anglo-Saxon Dictionary, 4th ed. (1960)
  2. CDD = Cassell's English-Dutch, Dutch-English Dictionary (1978)
  3. DRM = Norman Bird: The Distribution of Indo-European Root Morphemes (1982)
  4. EIE = J.P. Mallory and D.Q. Adams, eds. Encyclopedia of Indo-European Culture (1997)
  5. IE2 = T.V. Gamkrelidze and V.V. Ivanov: Indo-European and the Indo-Europeans (1995)
  6. LE = Eizenija Turkina: Latvian-English Dictionary (1973)
  7. MT = Matthias Lexers Mittelhochdeutsches Taschenwörterbuch (1983)
  8. NCG = Langenscheidt's New College German Dictionary (1995)
  9. WAS = Gerhard Köbler: Wörterbuch des Althochdeutschen Sprachschatzes (1993)

Appendix B: EIEOL Languages

Base-Forms Linked to Pokorny
  1. arm = Classical Armenian (Armenian)
  2. eng = Old English (West Germanic)
  3. got = Gothic (East Germanic)
  4. grk = Classical Greek (Hellenic)
  5. iri = Old Irish (Celtic)
  6. lat = Classical Latin (Italic)
  7. lav = Latvian (Baltic)
  8. lit = Lithuanian (Baltic)
  9. nor = Old Norse (North Germanic)
  10. ntg = New Testament Greek (Hellenic)
  11. ocs = Old Church Slavonic (Slavic)
Not Yet Linked
  1. ave = Avestan (Iranian)
  2. hit = Hittite (Anatolian)
  3. ofr = Old French (Italic)
  4. ope = Old Persian (Iranian)
  5. tok = Tocharian A (Tocharian)
  6. txb = Tocharian B (Tocharian)
  7. ved = Rigvedic Sanskrit (Indic)