Guest Post: Moving from 2.0 to 3.0

Danielle Griffin recently completed her co-op term as a full-time research assistant at The Life of Words. Here she offers some thoughts about her work on identifying the textual genre of quotations in the Oxford English Dictionary:

When I started my job as an RA, Dr. Williams had me tagging quotations five days a week for two weeks straight. This was essentially a crash course on genre categorizing. The course consisted of an Excel sheet of 10,000 rows, containing a slew of data on one particular quotation in the OED. Based on all the surrounding data in the columns (Author, Work Title, Date) I was to hunt down the 10, 000 quotations and assign them a genre tag, according to a system that came to be known as “Genre 2.0.”

There were some hard and fast rules. There were also at least as many soft and loose ones. The interrelationships between the denotative dictionary and the connotative, figurative use of language in poetry is of importance to Dr. Williams, so the PO tag went to anything written in verse, with the exception of the VD tag (for “Verse Drama”). That’s a hard and fast rule. Distinguishing between PO and VD can be tricky, though, as some poems have speaking parts, and not all plays have obvious stage directions.

During my crash course I quickly saw that there was a hefty scraps bin among the tags. The OR tag (for Other), which I internalized as unorganized non-fiction, containing everything from bibliographies to philosophical treatises to science reports and beyond. There was a massive collection of “other” texts that were essentially genreless. It really became obvious that there was more to chip away at in this iceberg of OR texts. There was more to capture than just “other.”

For example, there was no category for travel writing. As a result, a large number of texts that shared lots of features were being classified as DY (for “Diaries and Journals”) or OR, depending on their format. When you can intuitively classify something, Dr. Williams said, you should pay attention to what it is that makes that light go on. These are the indicators of our complicated cognitive system working out what kind of text we’re looking at (mind you, we aren’t the only cognitive systems working at this: there’s LOWBOT, and a number of other programs running behind the scenes).

So, we revised the genre tagging system to reflect our intuitions, splitting up OR and some other tags, and eliminating redundant ones. This information can later inform us about ways in which the OED treats sources as authoritative in quotations. For Travel Writing for example, it may turn out that quotations will illustrate a significant number of foreign and borrowed words. At the same time, the OED uses a variety of genres to provide widespread representation of terms that are not so specialized, and this could be demonstrated with refined tagging system as well.

In modifying our classification scheme, it emerged that the different genres could be grouped into a smaller number of “registers”: literary, documentary, expository, scientific, journalistic, and pseudo-spontaneous. Organizing genres by register helps to make classification decisions, and it allows for broader-based comparisons. And, since each genre belongs to just one register, it requires no extra tagging!

