Hathi’s Automatic Genre Classifier

The HathiTrust Digital Library is a massive collection of digital books: As of 2017, it contains 5 billion pages from 15 million volumes (7 million titles). About 40% of these are public-domain works, meaning anyone can search and read them. Some of these have been marked for their textual genre. Here I do a little verification of this mark-up based on my OED quotation dataset.

Some time ago a team led by Ted Underwood started work on automatic classification of the textual genre of HathiTrust books. Their interim report (2014) is worth reading (here), both for its discussion of the problem of genre and for its technical discussion of machine learning approaches. Basically what they did was have humans mark a bunch of pages of text for the genre represented on that page, then collected a bunch of data for each page (word counts and “structural features,” which I assume includes things like margin justification and information about the whole volume). Then the machine leaning bots got to work, trying to use the feature counts to predict the genre. The results were inspected and cleaned up, and datasets produced for three main categories: poetry, drama, and fiction (downloadable here).

I also have a biggish dataset: something like 56 million words in 2.44 million quotations, from about 164,000 authors (the number of discrete works is harder to say, because OED’s citation conventions vary so much). A large number of these quotations (about 2.1 million) have been marked up for textual genre by humans, including me and a number of trusted minions. So this ought to be a decent comparison set. Read More


OED Gender Genre

In “Sex in the OED” I  ran through some figures on female vs male representation in OED quotation evidence, comparing the original OED1 with the later Supplements that resulted in OED2. Here I look a little closer at what kinds of works by women the two editions tended to cite. Below are two charts breaking […]


Burchfield’s Reach-Backs

The vast majority of the quotation evidence in Robert Burchfield’s OED Supplements comes from after the first (1928) edition was completed. The median date for these is 1944, whereas for the first edition it’s 1742. However, in some circumstances the Supplements did reach back into periods already covered by OED1 — if it could antedate […]


Sex in the OED

Two subprojects concerning OED quotation metadata are now near enough to complete to present some preliminary results. They concern the sex of the authors quoted in the OED, in both the first edition (1928) and the later Supplements (1933, 1972-86). The most focused work on this question so far has been Baigent, Brewer, and Larminie, […]


Entitled Professor

I happen to have an interest and a certain amount of expertise in words that mean their own opposites. You might say I’m qualified to post here on that topic. You might even say I’m entitled to my opinion on a wider range of things in which I’m not necessarily expert. But if you call […]


Guest Post: Strong and Weak Genre Classification

Over the summer we’re featuring guest posts by Research Assistants at The Life of Words. Here Cosmin Dzsurdzsa – a 2nd year undergraduate in English at UW – thinks about moving from human intuition to computer rule-making in textual-genre classification: When trying to automate text classification algorithmically, one has to pay close attention to how […]


Guest Post: A Winter-Evening Conference and the Problem of Genre

Cosmin Dzsurdzsa is well into his first full-time co-op term as a research assistant at The Life of Words. Here he tells us about a case that seemed to challenge every classification rule we developed. What is “genre”? This is a question I constantly find myself asking as an RA here at The Life of […]


Morsels, a kind of poem

Latest in the “A kind of poem” series [previous: here, here, and here], I give you is “Morsels”, a kind of poem: .


Guest Post: Moving from 2.0 to 3.0

Danielle Griffin recently completed her co-op term as a full-time research assistant at The Life of Words. Here she offers some thoughts about her work on identifying the textual genre of quotations in the Oxford English Dictionary: When I started my job as an RA, Dr. Williams had me tagging quotations five days a week […]


Competition Anthology Published

We’ve published our 2016 Life of Words Anthology, presenting fifteen meritorious poems sent to us in our “Write a Poem about a Word” competition. It’s available here: Congratulations to all!