Danielle Griffin is a research assistant on her third co-op term at The Life of Words. This is the first of a few posts based on her last work-term report,”Comparative Data Visualizations of Textual Features in the OED and the Life of Words Genre 3.0 Tagging System”. Danielle’s report won the Quarry Integrated Communication Co-op English Award
During my last work placement as a full-time research assistant at The Life of Words last fall, I began toying with ways to situate our genre categorization system (known to us as G3.0) within genre theory more generally. Drawing on work by David Lee (see his extensive analysis of genre and register characteristics of the BNC corpus) and Ted Underwood’s work with HathiTrust’s corpus, I took bottom-up approach to genre, by describing and characterizing texts, rather than categorizing them.
To do this, I identified a set of 22 evenly weighted, (mostly) non-exclusive features, or attributes, that might apply to any text. Each text in my data set, derived from a list of OED sources, was then marked according to these criteria, so that each text has its own 22-feature combination (represented as True or False for each). From there I built some data visualizations, to have a look at the relationships between us human taggers and G3.0 tagging conventions, among the G3.0 genre categories themselves, and between the OED and textual genre generally.
Here’s what looks a bit like a family tree, twisted around into a circle. It’s a branched diagram representing the implied “family” relationship among items, based upon their similarities and differences. Really, this is just a made-up hierarchical dendrogram, rotated on the point of the first node for easy reading. It’s unidirectional, clockwise, meaning the end of the tree, at 3 o’clock, is actually the farthest point from the beginning, just below it). [You can click on the image to make it bigger, or download a pdf]
For now, it’s best to ignore the colours of the branches [see my next post], and focus on the individual endpoints of the dendrogram. Each is one text in the data set, represented by a unique ID number around the circumference. Those IDs are colour coded, with each colour corresponding to the G3.0 tag we assigned to it in the course of tagging OED quotations. For example, the pink chunk of IDs at about 3 o’clock on the dendrogram have all been tagged in the G3.0 “Verse Drama” category. From 4-5 o’clock on the dendrogram, the chunk of faded blue IDs are all tagged “Poetry.”
However, if you squint, you can see one faded blue guy kicking around within that pink “Verse Drama” chunk (#191 if you’ve got the PDF open.) This is because, although we categorized this text as “Poetry” according to G3.0, the attribute combination I assigned to it makes it look more like the texts we categorized as “Verse Drama”.
What the dendrogram reinforces overall is that there seems to be good general correspondence between attribute similarities and G3.0 categories. For example, between 3 o’clock and about 6:30, we can see four clusters of similarly coloured IDs: pink and light blue are “Verse Drama” and “Poetry,” as I mentioned; deep green and brownish are “Non-Verse Drama” and “Fiction,” respectively. These literary genres are fairly internally unfragmented, and they’re grouped next to each other. This is good news — if your genre categorization is rational, and your attribute tagging is accurate, and your dendrogram functions properly, this is what you get.
Let’s reshuffle the data and have a closer look. Here is a tree map. Each colour on the tree map represents a G3.0 tag, the same as the dendrogram ID colour. So, for example, the texts tagged by us as “Poetry” are all grouped together in the light blue rectangle. The size of the colour group corresponds to the number of endpoints of that colour, and the size of the smaller rectangles within each colour group corresponds to the number of unique attribute combinations those assigned to those works. The more rectangles within a colour, the more distinct attribute combinations that exist among texts of the same G3.0 tag.
What this tree map tells me is that there’s a fairly broad range of fragmentation among G3.0 genres. For example, the solid pink rectangle in the middle of the map indicates that every text that we tagged as “Verse Drama”, I described with exactly the same 22-feature combination each time. This implies that there most likely is something objectively similar about these texts and that we aren’t implementing some sort of nominalist genre theory.
On the other hand, the more fragmented rectangles tell a different story. The highly fragmented turquoise rectangle, right below the “Verse Drama” rectangle, is composed almost entirely of unique attribute combinations. At first glance it might seem as though these texts are too diverse to plausibly be categorized as the same thing, but in actuality the turquoise box corresponds a catch-all “Other” tag, used for hard-to-categorize and outlier texts. So in this case the fragmentation is a good thing.
The top left corner rectangle is not only highly fragmented–which signals a variety of attribute combinations–-it also displays a secondary clustering of these fragments, shown by the thicker vertical border two-thirds of the way across. Again, the heterogeneous composition of a G3.0 tag isn’t necessarily problematic, and might sometimes be expected. In the case of the pink top left though, the clustering into two somewhat diverse groups most likely implies that there are really two distinct subsets being captured within one G3.0 tag. Maybe this means that for G4.0 we ought to consider splitting this tag, maybe not.
There’s still plenty of work to do with inter-rater agreement and expanding my data set, but until then here’s some more info on my lonesome little “Poetry” outlier, text #191. Even with my small data set, the hierarchical dendrogram highlights an infuriating problem that we at LOW have bumped up against frequently. While there does seem to be a real difference between what G3.0 calls “Poetry” and “Verse Drama,” and while this might even seem intuitive to a human, there are always nuances and exceptions.
Text #191 is Algernon Charles Swineburne’s “Atalanta in Calydon”, a dramatic poem (say we) with a variety of characters speaking or singing in sequence, including the typical “Chorus”, but no stage direction, and not written for the stage. Because G3.0 doesn’t include a “closet drama” category, coding such works mostly as poetry, the pseudo-dramatic characteristics of the work foil the model.
This demonstrates some of the textual features that we need to stay attuned to in order to tag consistently. We could argue all day about “Atalanta in Calydon” and whether it is more poem-y or more drama-y, or whether we ought to establish a category such as “closet drama” to capture the overlap. Even such a solution would only defer the problem, however: what about songs, in verse, meant to be performed, and with characters? Should Elton John’s “Don’t Go Breaking My Heart” be categorized as “Verse Drama” just because there are two distinct speaking parts and he sang it live with Kiki Dee? Or is it poetry? Or do we need yet another genre category? Humans are pretty good at classifying, but we’re also great at finding exceptions.
More to come this summer on how I’ve used data visualizations to explore genre. Any questions are welcome meanwhile.