In “Hathi’s Automatic Genre Classifier” [17.01.06] I compared the consolidated automatic genre metadata for a subset of HathiTrust Digital Library texts (available here) to the genre classifications arrived at for human-inspected works as part of the OED quotation tagging project under-way at The Life of Words. My process there was pretty closely supervised, but the high accuracy of the Hathi genres, compared to my LOW genres, led me to believe that there might be a way to round up some fugitive OED quotes using the data in the Hathi repository.
So I wrote a program to run search queries based on OED quotations through the Hathi Solr Proxy API, which returns an XML file with all the Hathi metadata for any matches. This turned out to be slooooow – anywhere from 8 to 60 seconds per query, which is a lot when you have 6-8 million-odd queries to run. But it has been going for a couple of weeks now, so I have a good amount of data for some preliminary results.
At the time I scooped up these results for analysis the program had about 14,500 titles and found about 8,000 matches [see here for LOWbot’s latest status], accounting for around 140,000 OED quotations.
For the 85% or so of these that have been assigned a textual genre by us LOW humans, I assigned a category (in some cases combining our own more detailed categories), and then compared relevant categories to a Hathi category based on Hathi’s genre metadata for the matched text (if this existed).
The results were somewhat less than overwhelming. Below are two tables, based on work- and quotation-counts, respectively (some works have only one quotation, while others have thousands). In rows are the LOW genre categories, while Hathi categories are in columns. The square boxes represent targets for overlapping genres. A perfectly accurate system would have counts only within the boxed-in cells.
The percentages across the bottom represent “accuracy”, or the percentage of texts/quotations for which Hathi agrees with LOW when it chooses a genre category. The percentages down the far right column represent “recall”, or the percentage of all the LOW texts/quotes in each category that Hathi has matched to the same category.
Accuracy is pretty much okay for most of these. Or, when not okay, it’s understandably so – our REFerence is eating up lots of Hathi’s dictionaries, for instance, almost all due to a number of “cyclopaedias” — Chambers’ prominently — which might look like dictionaries but really aren’t. There’s also a lot of confusion between our letters and Hathi’s expository prose. That’s largely because many of our letters are quoted in biographies, which otherwise would be expository, but there does seem to be some significant Hathi misidentification of collections of correspondence. Drama doesn’t have enough Hathi data to be significant. Ditto fiction and, looking at the big picture, poetry.
Which leads to the real conclusion, which is: Recall is basically useless.
This is what you’d expect if the classifier was cautious and the data limited. Here it appears that the large majority of texts are just getting dumped into the miscellaneous “Not Fiction” category, which is not a very useful genre category at all.
At least, it isn’t for my purposes. I might be able to use this to clean up a few erroneous tags of ours, or maybe get a small number of fiction and poetry volumes out of the untagged portion of our OED corpus. But the low recall means that while those untagged works that Hathi thinks are fiction or poetry may well be that, this method will miss a whole lot of the fiction and poetry that might be there.
Sometimes algorithms are a just waste of time.
Very interesting experiment, following up on your earlier, also interesting one.
Can I ask, what form of Hathi metadata is getting used in this post? I’m not sure I know what exactly gets returned through the Solr API. Are you using Library of Congress subject headings, or …?
Unless I’m confused, I don’t think this is my algorithmically-inferred metadata; I think it’s probably human tagging.
To be honest I’m not too clear on how Hathi populates its “genre” field – it’s one of a dozen or more in the xml file returned by the Solr API:
I’d be surprised if it was all human-coded howver, as nearly every record seems to have a genre value.
I did do something else with LC headings, here: http://thelifeofwords.uwaterloo.ca/oed-subject-matter/
Your comment reminds me that my next experiment in this series was to link up hits to rows in your your consolidated metadata sheets and compare results. It might be next week though before that happens, as LOWbot has come down with something and is in need of repairs.
Thanks! I’ll have to check, myself, where the Solr API gets its metadata information. My guess is that that field is *inferred from* metadata that was ultimately created by hand — by human catalogers in a bunch of different libraries. But in mapping that info to Solr, the translation process may, as you suggest, simply dump everything into “Not Fiction” if there’s no human tag.
I’ll also be interested to see what you get with my (algorithmic) genre tags. My data was unfortunately based on a version of the library which is five years old now. Hathi now has lots of volumes that just aren’t included in my genre metadata.
I believe the Solr info comes mostly from character position #33 here:
You’ll recognize most of the categories. It’s ultimately human tagging.
Oh right okay – that makes sense, and is helpful to know. Well, to paraphrase myself, sometimes humans are a just a waste of time (though in the original, tbc, I was referring to my algorithms, not yours).
So anyway, as a highish accuracy/low recall set this will be a good third comparator to your and my ones, in those places where our texts overlap.