In “Hathi’s Automatic Genre Classifier” [17.01.06] I compared the consolidated automatic genre metadata for a subset of HathiTrust Digital Library texts (available here) to the genre classifications arrived at for human-inspected works as part of the OED quotation tagging project under-way at The Life of Words. My process there was pretty closely supervised, but the high accuracy of the Hathi genres, compared to my LOW genres, led me to believe that there might be a way to round up some fugitive OED quotes using the data in the Hathi repository.
So I wrote a program to run search queries based on OED quotations through the Hathi Solr Proxy API, which returns an XML file with all the Hathi metadata for any matches. This turned out to be slooooow – anywhere from 8 to 60 seconds per query, which is a lot when you have 6-8 million-odd queries to run. But it has been going for a couple of weeks now, so I have a good amount of data for some preliminary results.
At the time I scooped up these results for analysis the program had about 14,500 titles and found about 8,000 matches [see here for LOWbot’s latest status], accounting for around 140,000 OED quotations.
For the 85% or so of these that have been assigned a textual genre by us LOW humans, I assigned a category (in some cases combining our own more detailed categories), and then compared relevant categories to a Hathi category based on Hathi’s genre metadata for the matched text (if this existed).
The results were somewhat less than overwhelming. Below are two tables, based on work- and quotation-counts, respectively (some works have only one quotation, while others have thousands). In rows are the LOW genre categories, while Hathi categories are in columns. The square boxes represent targets for overlapping genres. A perfectly accurate system would have counts only within the boxed-in cells.
The percentages across the bottom represent “accuracy”, or the percentage of texts/quotations for which Hathi agrees with LOW when it chooses a genre category. The percentages down the far right column represent “recall”, or the percentage of all the LOW texts/quotes in each category that Hathi has matched to the same category.
Accuracy is pretty much okay for most of these. Or, when not okay, it’s understandably so – our REFerence is eating up lots of Hathi’s dictionaries, for instance, almost all due to a number of “cyclopaedias” — Chambers’ prominently — which might look like dictionaries but really aren’t. There’s also a lot of confusion between our letters and Hathi’s expository prose. That’s largely because many of our letters are quoted in biographies, which otherwise would be expository, but there does seem to be some significant Hathi misidentification of collections of correspondence. Drama doesn’t have enough Hathi data to be significant. Ditto fiction and, looking at the big picture, poetry.
Which leads to the real conclusion, which is: Recall is basically useless.
This is what you’d expect if the classifier was cautious and the data limited. Here it appears that the large majority of texts are just getting dumped into the miscellaneous “Not Fiction” category, which is not a very useful genre category at all.
At least, it isn’t for my purposes. I might be able to use this to clean up a few erroneous tags of ours, or maybe get a small number of fiction and poetry volumes out of the untagged portion of our OED corpus. But the low recall means that while those untagged works that Hathi thinks are fiction or poetry may well be that, this method will miss a whole lot of the fiction and poetry that might be there.
Sometimes algorithms are a just waste of time.