I’ve read Beowulf. Beowulf was a friend of mine. And Senator…
You know, when things sound stupid, they very usually are. And this headline sounds stupid:
[- The Upshot (New York Times Data Blog – click to view article)]
No one who has ever read Beowulf and heard Ted Cruz (or any other contemporary person) would credit such a stupid-sounding assertion. But of course no one ever has read Beowulf, so how would they know? Here Beowulf is just a stand-in for the idea of linguistic complexity (because everyone knows Beowulf is wicked complex – NPR host Robert Siegel, being told this by the NYT blogger himself, in a rare moment of philistinism had this says-everything reply: “Beowulf!”).
But look, headlines aren’t everything — or anything, most of the time. What the body of the post does is comment on a graph of two measures of the current presidential candidates’ language. The axes plot “complexity” against “positivity”. The data comes from this year’s candidates’ debates. Once each candidate is rated, he or she is matched to a literary text with a similar rating.
This is very stupid, for a bunch of reasons, some of which I will now enumerate (you may well come up with more).
In what world are positivity-negativity and complexity-simplicity the only — or even the two most important — or even two very important — aspects of political speech? And if they are important, why? What is their relation to each other? We are not told, because we are expected to know something already about a narrative that has Republicans peddling basic ideas in simple language [think Regan and G. W. Bush], and Democrats getting all caught up in their own mess of technicalities [think Clinton, Obama]. The subhead says:
Comparing presidential candidates to works of literature in terms of language complexity and word positivity or negativity yields odd juxtapositions.
Which is exactly what you would expect if the comparison made no sense to begin with. Which it doesn’t.
Let’s begin with this measure of “complexity,” which the post explains like this:
To conduct the analysis, we used an index called the Simple Measure of Gobbledygook, or SMOG for short[…]. The formula is based on the number of words of three syllables or more you use per sentence. This means you’ll tend to get a higher score if your sentences run longer, or for if you use a lot of very big words.
Readability scores are inherently suspect, and none are really appropriate to political speech. For thorough debunking of one of these scores applied to the same context, see a series of LLog posts: “More Flesch-Kincaid grade-level nonsense” [23.10.15], “Back to the Bushisms industry?” [17.8.15], “Another dumb Flesch-Kincaid exercise” [26.10.14], and “Real trends in word and sentence length” [31.10.11].
Like Flesch-Kincaid, SMOG uses two features of a text to correlate that text to a grade-level at which comprehension can be expected to be close to 100%. While this may be statistically sound, it bears little relevance to the context of adult political speech. For one thing, while it is true that the more education one receives, the more polysyllabic words one will encounter [“endothermic” one might learn in high school chemistry], it does not follow that the more polysyllabic words in a sentence, the more complex it is, especially in such a specialized context.
For example, in the Republican candidates’ debate of 5 August 2015, candidates and moderators spoke a total of about 19,000 words in total, of which 2,060 (representing 743 different words) contained 3 or more syllables. What were these “complex” words? Here is a list of the most-used:
governor (78), president (57), senator (34), america (33), government (33), every (31), republican (29), obama (25), security (22), unidentified (21), rubio (21), american (21), hillary (21), military (19), huckabee (19), united (18), immigration (18), candidates (17), federal (16), video (15), economic (15), everybody (14), illegal (14), nominee (14), actually (13), washington (13), israel (12), different (12); 10 times: economy, position, education, politicians; 9 times: abortion, retirement, policy, general, enemies, conservative; 8 times: remember, amnesty, florida, americans, exactly, ohio, national, understand, gentlemen, evidence, parenthood, important; 7 times: family, company, another, wisconsin, created, already, recently, families, amendment, election, anything; 6 times: religious, finally, advantage, commercial, century, mexican, businesses, nuclear, political, commander, department; 5 times: whoever, companies, liberty, earlier, somebody, radical, iranian, responsible, supported, secretary, elected, probably, republicans, exception, disagree, medicaid, obviously, candidate, continue, everything, medicare, terrorists, politics.
I’d be willing to say that the vast majority of the primary electorate would be very familiar with all of these words, even the more specialized ones. In almost every case, it is hard to imagine a shorter word that would convey the same or a related idea more simply (America!). Give these to a bunch of fifth graders, and you might have a problem with words such as “nominee”, “medicaid”, “amnesty”, etc., but fifth graders don’t vote, so who cares? The length of the words here don’t correspond in any meaningful way to their intelligibility, nor to the complexity of the utterance.
But the really dumb thing this post does is to compare these meaningless measures to works of literature. What is the basis for this comparison? Clearly, in order to reinforce poles of complexity and simplicity with the prejudices of a reading public that regards literature as an impenetrable fortress of gobbledygook.
So Trump is at the low-end of complexity along with Huckleberry Finn and Hans Christian Anderson’s Fairy Tales (read: his ideas are so basic he has to speak at a grade-school level), whereas Ted Cruz is up at the top, with Sense and Sensibility, The Double Helix, and Beowulf. Beowulf! Because, as the blog notes, he has a degree from Harvard Law School.
The blog doesn’t say what version of Beowulf it has sourced for comparison. So I tried to reproduce their analysis with , from The Poetry Foundation’s site, since my intuition was that there couldn’t be that many 3+ syllable words in the poem. ‘s translation
My results were not at all like theirs, so I went and tested four of their titles, coming up with my own raw ratios of polysyllables to sentences [not bothering to calculate SMOG], and also checking a smaller sample of each work at an online SMOG calculator. Here is a table of results:
Although online SMOG and I aren’t totally on the same page (I get identical numbers for Beowulf and Peter Pan, e.g., while they rank them significantly apart), at least we both agree that Beowulf is nowhere near as heavy on syllables as, say, Sense and Sensibility.
You might intuit this if you only just read some:
LO, praise of the prowess of people-kingsof spear-armed Danes, in days long sped,we have heard, and what honor the athelings won!Oft Scyld the Scefing from squadroned foes,from many a tribe, the mead-bench tore,awing the earls.
How many polysyllables do you hear? Athelings, maybe; but no more. What’s going on? Probably all three measures are counting syllables with different algorithms, and it’s possible my count has missed some of the polysyllabic names — not that these should count, anyway, mind — that Upshot has managed to capture. But I can’t believe this could account for all the difference. More likely, they are following the SMOG rule to treat hyphenated words as one word. This works fine for modern English, where fly-catcher is, I guess, plausibly one word. But it won’t fly for this modern rendering of Beowulf, which adopts a hyphenated form to represent Old English compounds. Would you call people-kings a three syllable word?
[I even wonder if they are using a modern English translation at all. Could it be that Cruz’s language is being compared to a highly inflected historical language, in which grammatical properties are marked by the addition of morphological units to words? I don’t know — we get scant information about methodology on the blog — but I wouldn’t put it past them.]
The big pitfall of big data is that analysis often invests too much in the mere fact of correlation, without caring to understand anything about the underlying phenomena. This is especially the case when correlations conform to prejudices (Republicans like it simple… Bernie Sanders is complex and dire fellow) [see Vivid Unconscious Biases for another example]. Just because you have numbers on everything, and can compare just about anything, doesn’t mean you should. Cruz does not sound like Beowulf. He is not as complex or simple as Beowulf. The claims make no sense. And, to boot, the one empirical claim, that his debate responses contain a similar ratio of polysyllables to sentences, is, I’m pretty sure, wrong.