In “Hathi’s Automatic Genre Classifier” and “Hathi Genre Again – Zero Recall“, I ran a couple of experiments comparing genre categories assigned by human taggers working on the Life of Words OED mark-up project to two sources of genre metadata associated with the HathiTrust Digital Library. The first post looked at data from the automatic genre classifier developed by Underwood et al., while the second looked at the genre information gleaned by Hathi from MARC records, and included in the xml metadata file associated with each volume. In this post I return to the Underwood et al. classifier datasets with a little more of my own data, using Hathi Solr record numbers to match work-title tokens from OED to Hathi volumes.
At the time I fetched the data, LOWbot had searched for 60,000-odd works using Hathi’s Solr proxy API, finding a match there at a rate of about 50%. Of the 28,000 OED work-title tokens matched, it found a little more than 3,000 corresponding volume-IDs in the three Underwood et al. metadata files I described in the first post. Some filtering of bad matches (i.e. where one or another fields did not match from OED to Underwood) reduced this to a little over 2,000.
Because many of those 2,038 work-title tokens match to more than one Hathi record number, the final result was almost 3,000 Hathi records that had been marked for genre by us and by Underwood. Comparing these against each other produced good results, summarized in this agreement table (aka “confusion matrix”):
The highlighted boxes represent those cells where you would want all the numbers to end up in a perfect system match. Because Underwood et al. can assign multiple genres to a record (e.g. Works of Lord Byron is recorded as containing both poetry and drama), it probably makes sense to include these in the recall calculations in the rightmost column. Our categories are exclusive, but our work-title tokens tend to be more specific, so the bottom-line precision rates are probably best kept exclusive too (though in practice including mixed categories doesn’t change things much). Precision, here, means the percentage of Underwood tags in a category that LOW agrees belongs to that category, whereas recall means all the percentage of LOW tags that Underwood agrees with.
Both the recall numbers and the precision numbers are impressive, with recall especially high for fiction, and precision pretty high for poetry–almost as precise, indeed, as our human taggers, who make diverge from each other about 2.7% of the time on poetry genre assignments.
Actually the classifier is probably even more accurate than the table suggests, since a number of the mismatches are not true errors on its part, but rather part of the experiment design. Looking more closely at drama, the least accurate genre, I can see, for instance, that Underwood et al. calls Peter Pan fiction, while we have it as drama. But there are two Peter Pans — one a play (1904), one a novel (1911), and in this case my matcher has found the wrong one (the play is not in Underwood’s drama set).
Several experiment-related disagreements on the precision side have to do with Underwood’s inclusive genre assignment: many of their “drama”s we have as expository prose because the OED work-title combo refers to a preface, introduction, or note in a collection of plays. In other cases, an expository work of literary history or criticism will show up as “drama”, presumably because of quotations from dramatic works.
In all, for dramatic works, I found just four or five true errors in recall (of 20 on the chart), and just two or three true errors in precision (of 8 on the chart). That, of course, improves things a lot. And what did the classifier think was a drama that so clearly is not, you ask? Here are two examples, with screen-shots that might go some ways towards explaining why (best to look at them fuzzy first, then click to biggen):
|Charles John Smith, Synonyms and Antonyms||William Aiton, Hortus Kewensis|
So the final result, I think, is very positive for this classifier. There’s only one downbeat note, which is the very small number of actual work-title tokens from my OED stack that I was able to match to works in the Underwood dataset(s): just 2,038 out of 60,000 or so that I ran against Hathi in the first place.
And, of those 2,038, there were only 212 that we hadn’t already tagged manually already. So the actual utility for rounding up fugitive quotations is fairly limited – my RAs can tag 200 tokens in an hour or two. There may be a few hundred more to come down the pike once LOWbot gets through the remaining 160,000 or so tokens, but right now it looks like the method will be far from a game changer.