Over the summer we’re featuring guest posts by Research Assistants at The Life of Words. Here Cosmin Dzsurdzsa – a 2nd year undergraduate in English at UW – thinks about moving from human intuition to computer rule-making in textual-genre classification:
When trying to automate text classification algorithmically, one has to pay close attention to how humans recognize textual features, while also being reined in by a computer’s capabilities. It is easy for a human to distinguish a poem from a letter, for instance: it happens almost instantly, as the brain picks up on visual cues and simultaneously accesses memory to compare with previously encountered examples. To understand this and attempt to replicate it algorithmically requires some attention to our own human methods, which as often as not are unconscious. By slowing down and breaking up the decision making required to arrive at the right classification, it is possible to note some of the reasons we decide to genre a text a certain way.
There are strong genre problems and weak genre problems. Similarly we can make parallels to strong and weak intelligence machines. Basically, a strong problem is only solvable by sentient intelligence capabilities (like ours) while weak problems can be solved by replicating narrow “decision making” capabilities. Here at LOW, we concentrate on classifying these weak problems and discovering their underlying logic.
For example, given the genre categories “poem” and “verse drama”, could a computer distinguish between these two Shakespearean couplets?
See, where he comes: so please you, step aside;
I’ll know his grievance, or be much denied.
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
Upon closer inspection, these couplets contain both strong and weak categorization problems. For the sake of exhibition let’s assume that a corpus is not used to cross-reference these works for identification. How would a computer determine algorithmically which one is from a verse drama and which one is from his sonnets? Some weak problems are that they are both rhyming couplets and begin each line with capitalization (characterizing them as belonging to a larger “verse” category). When normalized a computer could easily and mistakenly classify both of these quotations as poetry because of those characteristics. Yet as literary-inclined humans, we instinctively know with a reasonable degree of certainty that the first is from a drama while the second is from a sonnet.
But why? This is where the sentience comes in.
In the first quotation we understand the semantics of the couplet. There are multiple subjects (he, you, I), some verb commands (see, step aside), along with other complex grammatical operations and sentence structure. Because we have sentience and familiarity with human interactions, as well as some knowledge of the conventions of literary drama, we can quite easily conclude that this is dialogue. [Of course, the possibility remains that it could be dialogue within a narrative or epic poem, but we know that Shakespeare wrote very little of the sort, compared to massive amounts of dramatic dialogue — this too is outside the computer’s ken]. Because of our strong intelligence we can identify the external world referenced in the first quotation through semantic operations (step aside, here he comes) and identify it as more than likely dramatic.
In the second quotation, semantics similarly guides us in the identification (even if we don’t recognize the lines directly, which many of us will). Since this quotation can mistakenly be identified by a human as dialogue (fitting for a soliloquy perhaps) the classification problem becomes even more difficult for a computer. But let’s investigate further, how might we know this is from a poem without additional context? Intuitively, sentence structure and rhythmic regularity might bring us to that conclusion. While the first quotation contains various breaks and pauses, mimicking natural speech, the second quotation has far fewer interruptions in the cadence of speech, leaning more towards musicality rather than the bumpy rhythms of dialogue. Also, the second example makes broader subject references (men, eyes, this) than the first, indicating a contemplative rather than a dramatic situation. Finally, anybody with a basic grasp of Shakespeare would recognize the notoriously famous couplets of Sonnet 18. I mean come on.
As you can see, the strong problems outnumber the weak. But how can we use this information to possibly teach intelligent machines to identify more accurately? Surely some sort of algorithm can pick up on these semantic and grammatical nuances? Perhaps code could scan the number of apostrophes, colons and semicolons and mark the text as irregular. Or, maybe pronouns such as “you, he, she, I” would give a hint to the program to mark something as drama. Although this is seemingly possible for a machine to do, when it gets down to the coding and capabilities of a computer, the task looks more daunting. For now, it is inefficient for a computer to solve these hard problems unless we can think of ways to transform them into weak problems. Textual markers can provide the keys to do such a thing but the sentience of human eyes is required to scan over similar situations and make those tough decisions.
That being said, luckily there exist things called corpora, and computer identification is made easier by them. Yet these examples go to show that text is more complicated than we might first assume. Although we have various word processing, word-to-sound, and scanning capabilities, computers are not yet able to grasp those strong identification problems through their narrow lenses. For example, what is a computer to do if it comes across a poem quoted within a letter? A wholly different decision making process would have to be implemented with its own special logic for such a scenario. Yet the hope remains that by accumulating and identifying these textual markers (like capitals at the beginning of verse lines) we can push the capabilities of our machines while also seeing the amazing intricacies of our own thoughts.