In my last post I described using HathiTrust’s Solr Proxy API to fetch Hathi genre metadata for OED quotations. But genre is not the only metadata that Hathi sends back down the intertubes when I ask it a question. For most works, I also get a Library of Congress Classification code for the volume. This would be the number you’d use to find your book in your university library.
While shelfmarks aren’t a great proxy for genre in most cases, they do a fairly decent of describing subject matter (though arbitrarily – if you were inventing a system today you would not likely include a separate top-level category for both “military science” (U) and “naval science” (V), and you would not lump together poetry, dictionaries, and books on psycholinguistics in another top-level category called “language and literature” (P).
Even so, working with subclasses and rearranging things a bit when necessary, it’s possible to get a fairly good sense of what kind of book you’re dealing with, and a strong idea about the general domain of knowledge.
My diligent Hathi looker-upper is only about 10% of the way through its job with OED texts, and only matches one out of two texts it tries, but that still gives us a fair amount of data to work with (7,000 or so texts), at least for a preliminary glance at what kinds of books have lent the most quotation evidence to the OED.
The first graph here matches Hathi LC-Classes to our Life of Words genre assignments, with LCs grouped together by colour for larger categories (e.g. all the various kinds of history are salmon-coloured; social sciences aqua, etc.):
[clicking the chart enlarges it]
There are a number of confirmatory things here: all our biblical quotes have landed in B (BS is the LC mark for the Bible – yep, seriously); almost all of our various scientific genres are in the green sciency LC marks; and poetry, drama, and fiction are mostly red, for P. Our untagged (“???”) quotations appear to be fairly evenly distributed, which is probably good. There seems to be a lot of science in our DICtionary and REFerence categories, which may be due to things like Dictionary of Medical Terms, or a disagreement about what counts as scientific. This might be a helpful indicator to integrate into our tagging process in the future.
So far, so good, if a little ho-hum. I’ll wait for LOWbot to churn its way through a few more titles before pursuing this avenue much further, but it may be interesting in the future to see, e.g., what kinds of words have the most and least diversity of quotation evidence, or what kinds of books get used in particular ways by the dictionary.
For now, here’s another preliminary view of OED quotation evidence subject matter, this time based on the period of the work:
There are a few nice tales of historical change we call tell with this chart – the decline of B (Philosophy, Religion – including the Bible) and P (Language and Literature); prominently, the rise in Scientific categories (green), especially Science (Q) and Technology (T); perhaps the brief flourishing of History (C,D,E,F) in the latter half of the 18th C; and perhaps more. This is all intriguing, and merits further exploration. As implied above, it would be especially good to break down P and Q into significant subclasses – something to mind for later.
One last thing is worth a look right now, though. The chart above counts percentages based on works (as given by OED – in reality, one work may have a number of distinct author-title combinations). To see how thoroughly these works were trawled by OED, it’s necessary to look at the number of quotations derived from them.
So here are percentages of OED quotations, by subject class, by period:
Worth investigating, I’d say, are any visible discrepancies, like the top and bottom colour groups in the first column (disproportionately more quotations from T, and disproportionately fewer from B), or the relative stability over time in the percentage of P quotations vs. P works, which implies that each work is yielding more quotations.
But these are investigations for a future day. Meanwhile LOWbot keeps on keeping on.