Arbel Groshaus has just wrapped up his Winter 2024 co-op term as an RA at TLOW, working on bibliographies, etymologies, and OED sources. Here he describes a project to build an etymological web to parse relationships in texts.
Hello Life of Words fans! I would like to introduce a project that I’ve been working on for a few months now. This is an etymological web, showing how millions of terms across thousands of languages are related, derived from English Wiktionary data. In this post, I’ll give an overview of how I got the data and what I’m doing with it.
Data sources
The genesis of the project was an idea by Dr. Williams to create a metric for calculating the “etymological relatedness” of text — that is, how often etymologically connected words appear close together. The more obscure the connection, the higher the score: thus, apple and applesauce (obvious) gets a score of zero whereas nerve, neuron, and sinew (all from Proto-Indo-European *snéh₁wr̥, much more interesting!) would get a very high score.
The first hurdle of the project was figuring out where this kind of etymological data could come from. The first resource available to me was the American Heritage Dictionary of Indo-European Roots. This work started out as an appendix of the famous 1969 American Heritage Dictionary of the English Language and has gone through several editions since then. Although the reconstructions are outdated by current standards (not reflecting the laryngeal theory, for example), this doesn’t make a difference with respect to connecting English words. The bigger problem is that it’s a dictionary of PIE, not English: there are only a few thousand English words listed, a tiny drop in comparison with the sheer (although unknown) size of English’s vocabulary. Poems will often drop in a more obscure word in place of a common one to make a specific (perhaps etymological) point: if the wordlist isn’t large enough, it will be impossible to get good results.
Another source of etymological data is the Oxford English Dictionary, the largest print English dictionary, and one which is renowned for its detailed etymologies. Unfortunately, the OED’s etymologies are problematic for several reasons. First, they rarely give reconstructed ancestors, instead choosing to vaguely hint at “an Indo-European base”, making it somewhat difficult to figure out whether two entries are related. For example, on the number one, which has a very simple etymology, the OED gives only a massive list of cognates, followed by the somewhat misleading summary “A word inherited from Germanic”. All this must be utterly incomprehensible to anyone unversed in historical linguistics, so I’m not sure why they write in this style. Another problem is the insistence on romanizing everything. They even romanize Chinese, which is the one language a dictionary should never romanize! Oddly enough, the only script to escape the clutches of the Latin alphabet is Greek. Maybe the editors of old thought that any dignified gentleman who would be reading the OED ought to know their Greek. But to their credit, the OED is very good with etymology within English.
In the end, I found the Wiktionary data to be the most useful. Although the wiki “anyone-can-edit” philosophy allows errors or vandalism to creep in, I’ve found Wiktionary to be as reliable, if not more so, than many professional dictionaries. Wiktionary tends to attract far fewer vandals than its elder sibling Wikipedia. Wiktionary’s etymology of one begins: “From Middle English, oon, on, oan, an, from Old English ān (“one”), from Proto-West Germanic *ain, from Proto-Germanic *ainaz (“one”), from Proto-Indo-European *óynos (“single, one”).” This is something I can work with!
Etymonline is a good source for etymology as well. In fact, Wiktionary appears to rely on it for many entries, meaning that combining the two is probably of limited utility.
Downloading Wiktionary
Wiktionary is a website. How could I get this data on my computer? Unfortunately, it wasn’t at all easy. The Wikimedia Foundation, to its credit, offers a couple ways to download the site. The way that most people do it is through the XML dump (note: “XML dump” is a fancy way of saying “very large text file”) which contains the wikitext of every article. Wikitext is a special markup language that editors use to create entries on Wiktionary. However, the wikitext can be very different from what the reader actually sees!
Here’s a (small part of a) large Spanish conjugation table that appears on the entry “ir” (to go).
All this is generated by these three little lines of wikitext:
How does this happen? In this example, all of the complexity comes out of “es-conj”. es-conj (“conj” for conjugation) is a template: a script that takes in some information and returns an output that the reader sees. Needless to say, es-conj is a fiendishly complex template. And the only way to figure out what it will spit out is to run the code — several thousand lines of it — and see what we end up with. Across the site, this template is used over eight thousand times. And then multiply all that by the thousands of templates of similar complexity! Obviously, this approach wasn’t going to be easy. I needed to get the output — the HTML — directly.
If you poke around the Wikimedia Foundation’s site a bit, you’ll find a page claiming to offer “enterprise” HTML dumps. Unfortunately, these dumps are far from enterprise-quality. There are missing entries, outdated revisions, and other wackiness. This problem has been ongoing for over two years, so it was clear that I would have to do it myself.
It turns out that Wiktionary has an API which works very well. An API is essentially how computers use the Internet: you ask for data and you get the data, skipping over all the silly stuff like “making it looks nice”. Here’s an example of an API query which gets the HTML contents of a single article, compared to what humans see. So all I had to do was run this query for all eight million pages (of course, getting a list of pages was another API query!). Since the server lets you do 75 queries a second, the whole thing takes just over 29 hours — a very reasonable amount of time.
[click these images to embiggen].
Building the web
This data represents all the information you can possibly get from the English Wiktionary,1 and it comes out to over 127 GB in total. Now it was time to analyze all this.
In short, the process is:
-
- Loop through each line of the big file (which corresponds to a query for an individual page).
- In each line, get the HTML code for that page.
- Within each page, get every language section. For example, “taller” means “more tall” in English and “workshop” in Spanish. We can’t mix these up! However, pages which start with “Reconstruction:” are guaranteed to only have a single language.
- Within each language section, get every etymology section. Each etymology section corresponds to, as the name suggests, an individual etymology. Although most entries have a single etymology section, homographs will have multiple etymology sections, so under saw#English you’ll find “a tool used for cutting”, “a saying or proverb”, and “past tense of see”, which each have completely different origins. Each etymology section corresponds to a single node in the etymology web.
- Within each etymology section, find all of the etymological relationships. Wiktionary entries follow a (mostly) consistent structure, so the algorithm relies on a handful of relatively simple rules to analyze the page. For example:In the entry for Proto-Indo-European *h₂ébōl (“apple”) the algorithm finds the “Derived terms” and “Descendants” sections and picks out all of the links. However, we have to be careful not to grab *ḱun- and *eh₂, which are completely unrelated to *h₂ébōl!We also need to pick out terms from the etymology text itself, which can occasionally be pretty convoluted. Thankfully, there are several templates that are used to mark various etymological relationships.Here is the wikitext used to create the entry (helpfully spell-checked by my browser). The “inh” template indicates that the current entry was inherited from whatever’s inside the template, and the “root” template records a Proto-Indo-European root.2 There are many cases where reading the wikitext gives you a deeper understanding than the visible output (although I don’t recommend actually using the site this way, for your own sanity). Even then, there are many terms which we aren’t really sure what to do with. I’ve developed a sort of pseudo-NLP which splits the text up into sentences and looks for certain words: “from” and “+” are usually good signs, while “parallel” or “compare” probably mean we should throw the terms away. This actually works better than you might expect.
- Now that we have a web of links between different nodes, we’re faced with the problem of homographs. Take a look at this etymology:Seems clear enough, right? Unfortunately, there’s a problem: there are three saws, and the etymology doesn’t specify which one. And for that matter, there are five mills! How can we connect “sawmill” to anything with this kind of ambiguity? It turns out that what we need is bidirectional links. Taking a look under “saw”:Since “saw” (etymology 1) links to “sawmill” and “sawmill” (etymology 1) links to “saw”, we can create a link from “saw” (etymology 1) to “sawmill” (etymology 1). Unfortunately, as of writing, “mill” has no link to “sawmill” and thus we have to throw that link away as we don’t have enough information to figure out which mill is the right one.
Using this principle, I generated an etymology web with whatever connection I could get. The web is a directed graph — a set of things with arrows connecting them. Each arrow goes from an ancestor to a descendant. In the “strict” mode, we insist that two entries have to connect to one another. In the “full” mode, we use a little bit of gung-ho guesswork to grab a few more connections: if entry A only has a single etymology section, we guess that every incoming link must be referring to that etymology section, and if A doesn’t exist but has incoming links, we treat it like an entry with a single etymology section - Now we have a file recording all the connections we just came up with. The strict version is 241 MB, and the full version is 1.99 GB. Wait: why is it so big?That’s… a lot of Finnish. As it turns out, almost every Finnish word can be inflected in dozens of ways, and the script just slurped up everything it could get. Okay — Finnish isn’t the only language like this. But Finnish makes up over 45% of all the terms in the entire web, which is a number so ridiculous I had to triple check it (mind you, this web is meant to contain every known language). Finnish just has that potent combination of being both highly inflected and simultaneously having a lot of editors creating entries. But if the goal is to find etymological relationships between English words, I doubt we’ll need to consider terms like alueellistuessamme (which Google translates as “as we regionalize”). I took the huge web and created a new web which only includes terms with at least one descendant in English. This comes out to a much more compact 90 MB, and the Finnish percentage is down to 0.03%.
- In principle, we should be done here. But the web is still full of redundant edges and cycles which slowed my algorithms down to a crawl. To speed things up, I implemented the following optimization: if we have A -> B -> C, then we don’t also need to have A -> C (since there’s already a path from A to C). I then cut all of the cycles. A cycle is when a group of nodes create a loop between them. For example, in the case of A -> B -> C -> A, we can either cut A -> B, B -> C, or C -> A. After some investigating I decided that the arrow pointing at the most “prominent” term — the one with the most connections to other nodes3 — is the connection most likely to be dubious in some way. So, if B were found to be the most “prominent”, we would cut A -> B. This shrinks the file slightly to 79 MB, but makes it much easier to work with.
The process seems lengthy as I’ve laid it out here, but steps 1–7 are done together in a single Python script. Python might seem like a terrible choice for processing “big data”, given that it’s notoriously slow, but it actually wasn’t so bad: it takes around 2 hours. Parsing the HTML with Python’s regex library, written in highly optimized C, was very quick. Yes, I know that’s something you’re not supposed to do, but I ran a test to estimate how long it would take to parse the HTML “properly” and it came out to almost an entire year — needless to say, that approach wasn’t going to be on the table. Speed was particularly important, as I had to run this script probably dozens of times. Each time I ran it represented some improvement or bugfix (and occasionally, a new bug introduced). With the unholy power granted to me by regex, my output gradually went from complete sassafras to something actually resembling good etymology.
Playing around with it
Now that I had my web, I could start playing around with it. One of the most interesting results came from looking at the terms with the most descendants in English. At the top of the ladder we have Proto-Indo-European *-h₂ with a cool 167,092 descendants. Why so many? Look at the definition: “creates collective nouns”. As it turns out, there are a lot of collective nouns.
After *-h₂ follows a succession of PIE suffixes to create the rest of your nouns, verbs, and adjectives. The first non-PIE term, Proto-Germanic *-ōną appears surprisingly early: #8, with 76,709 descendants. The first few terms that aren’t suffixes are Proto-Indo-European *ḱe (#11), *ḱóm (#14), and *dʰeh₁- (#15). *ḱóm in particular is so high due to two superstar descendants: Proto-Germanic *ga- (#106), which has mostly disappeared from English but is still preserved as the initial a- in words like aware (compare Old English ġewær) and aware (compare Old English ġelīċ and ġelīċe), and Latin cum (#47), ancestral to thousands of words beginning with “co-“. Among them: compute, corrupt, conceal, correct…
I also developed an algorithm to calculate the relatedness of text by generating etymological “groups” of terms which share a common ancestor. For example, the terms listed above all belong to the “con-” group (note: Latin con- comes from cum):
Each group is given a score which is summed up to calculate the overall relatedness. This score is based on a few factors: how close the terms are to one another in the text, the size of the group, and the number of steps from each word to its common ancestor. Analyzing the British National Corpus reveals that the average relatedness score is around 32, but with substantial variation across different parts of the text.
This histogram was created by sampling 100-word windows out of the BNC and calculating the relatedness score of each window. Although most of the scores are between 0 and 75, a handful of windows achieved relatednesses of well over 100.
I created an interface to visualize the etymological connections between words within text. All you need to do is type some text and related words will automatically get highlighted. It’s actually a lot of fun to try getting the relatedness score to be as high as possible!
So that’s what I’ve got for now. There are definitely improvements to be made, and we’re still figuring out how the relatedness score varies across different corpora. Maybe one of you have ideas for interesting things to do with the data? In the meantime, here’s some etymology to feast your eyes on. Note that the first tree has the ancestor at the top, while the second tree has the descendant on top. Also, for those of you who didn’t know what a soyjak was: I sincerely apologize for what you’re about to see.
Notes
1 Not including the appendices, which are wild and woolly.
2 Actually, the code doesn’t know about “inh” or “root”, but rather looks for certain categories which are generated by those templates.
3 The most important number is actually the number of outbound connections, but the number of inbound connections is also used to break ties.
No Comments