Tag Archives: HathiTrust

Published: Women’s Words in the OED

Now published in Review of English Studies (Advance Access), an article by me on the ways in which the Oxford English Dictionary has treated texts authored by women in its marshalling of citation evidence for English language lexis, from the first edition (1884-1928) to the current OED3 revision (2000-). The approach I take is driven […]

OED Work on “Writing and Editing” Podcast

I talked with Wayne Jones the other day about my work on the Oxford English Dictionary. The result was this short piece on my plans for an updated OED bibliography and Variorum: 182. Enhancing the Oxford English Dictionary

Gender Shifts in American Names

Lately I’ve been working with several different gender-inference tools, tweaking them here and there to serve my purposes. Since I’m working with a historical dataset with about eight million records, from 1800 to today, once of the packages I’m using is the gender library for R by Lincoln Mullen, which uses historical US census and […]

One last round with metadata from Hathi and Underwood

In “Hathi’s Automatic Genre Classifier” and “Hathi Genre Again – Zero Recall“, I ran a couple of experiments comparing genre categories assigned by human taggers working on the Life of Words OED mark-up project to two sources of genre metadata associated with the HathiTrust Digital Library. The first post looked at data from the automatic […]

Hathi Genre Again – Zero Recall

In “Hathi’s Automatic Genre Classifier” [17.01.06] I compared the consolidated automatic genre metadata for a subset of HathiTrust Digital Library texts (available here) to the genre classifications arrived at for human-inspected works as part of the OED quotation tagging project under-way at The Life of Words. My process there was pretty closely supervised, but the […]

Hathi’s Automatic Genre Classifier

The HathiTrust Digital Library is a massive collection of digital books: As of 2017, it contains 5 billion pages from 15 million volumes (7 million titles). About 40% of these are public-domain works, meaning anyone can search and read them. Some of these have been marked for their textual genre. Here I do a little […]