Claire Caine Miller has a post on the New York Times’s “Upshot” blog on Ben Schmidt‘s visualization tool for word frequencies in 14M ratemyprofessors.com reviews, called “Is the Professor Bossy or Brilliant? Much Depends on Gender” [9.2.15]. While in general I find Schmidt’s stuff informative, fun, and provocative [for instance, his “Let’s You” finding, which I’ve discussed], here I want to demonstrate a pitfall, perhaps unavoidable, of putting digital tools in the hands of people too lazy, or innumerate, or unconsciously biased to use them responsibly.
Miller says of Schmidt’s tool:
The chart makes vivid unconscious biases. The implications go well beyond professors and college students, to anyone who gives or receives feedback or performance reviews.
It suggests that people tend to think more highly of men than women in professional settings, praise men for the same things they criticize women for, and are more likely to focus on a woman’s appearance or personality and on a man’s skills and intelligence.
All these things may in fact be true of gendered attitudes in professional settings. But the tool doesn’t show them to be. If anything, the tool–or more precisely, Miller’s use of it–shows how data can be used to confirm and propagate one’s own vivid unconscious biases.
The visualization tool doesn’t show these things about gendered attitudes because it isn’t a systematic study of the language of these online reviews. It is only a convenient way to look at relative word frequencies. But which words are important in revealing gendered attitudes? How would you know?
You would guess. And your guess, instead of being informed by the data, would be informed by your own unconscious biases. Even worse, your guess would itself inform and shape the data you retrieved, so you wouldn’t necessarily know how good your guess was. This is a perfect place to look for confirmation bias.
Miller’s biases are revealed in the word sets she chose to run through the viz tool (perhaps she ran more – these are the ones she reports):
Men are more likely to be described as a star, knowledgeable, awesome or the best professor. Women are more likely to be described as bossy, disorganized, helpful, annoying or as playing favorites. Nice or rude are also more often used to describe women than men.
Men and women seemed equally likely to be thought of as tough or easy, lazy, distracted or inspiring.
Interestingly, women were more likely to be described in reviews as role models.
First, examine the rhetoric. Let me rephrase part of the above: “Women are more likely to be described as helpful, nice, organized, and a role model. Men are more likely to be described as a star, knowledgeable, or the best professor.”
Aside from omitting the negative terms, I only made one change: I substituted organized for disorganized.
And that’s fine, because, although it is true that reviews for female professors generally have a higher incidence of “disorganized” than those for men, so do they have a higher incidence of “organized.”
Not only that, but women are more likely to be described as “organized” than they are to be described as “disorganized.” For example, the most disorganized female professors are in Education, where the word occurs 156 times per million words. But “organized” occurs at a rate of 207 wpm in this sample. I’ve reproduced both charts in thumbnail to the right–click to see a larger version.
To take this one step further, female Education professors are not only the most disorganized, they are also the most organized. This suggests that organization is of special importance to Education students. It’s also true that Education shows one of the biggest gender gaps for both “organized” and “disorganized”. That is, female professors are both much more organized than male professors, and much more disorganized, according to these reviews. What does this say about gender biases?
Something, maybe, but I’m still thinking about what exactly that might be. Just to recap some of what the pair of charts show: 1) for every discipline, women receive more instances of “organized” in their ratings than men do; 2) for every discipline, women receive more “disorganized” than men; 3) across the disciplines women tend strongly to receive more “organized” than “disorganized” [hard to easily assess, since the x-scale changes, as well as the order of Y-categories–but eye-balling it it looks like this holds for most if not all disciplines] 4) across the disciplines men also tend strongly to receive more “organized” than “disorganized.”
So Miller’s reporting on “disorganized” ratings for female professors is at best incomplete, and at worst utterly misleading. Evidence of bias perhaps, but most conclusively evidence of bias on the part of the reporter. The numbers are actually far more interesting. Can you come up with a coherent hypothesis of why we might be seeing these distributions, and then come up with a set of experiments to test that hypothesis? That’s what a scientist might do. Or anyone really interested in exploring the data.
Another choice of search term struck me right away as vividly depicting an unconscious bias on the part of the author was “bossy.” No doubt Miller has been influenced here by Cheryl Sandberg’s “Ban Bossy” campaign, which claims that “Words like bossy send a message” to little girls: “don’t raise your hand or speak up.”
The problem is that while “bossy” might be a buzzword right now in the gendered language debates, it comes out of a particular social context: that of elementary school girls. It would strike me as unusual for any subordinate to refer to a superior in this way (e.g. a student to a teacher, or a child to a parent, employee to a boss), since authority and a degree of assertiveness is already implied by the relation.
Schmidt’s graphs bear me out here. See “bossy” to the right [click for larger]. While it is true that there are, in general, more “bossy”s in the reviews of female profs by discipline, the frequencies are so low that they constitute very weak evidence indeed. The very highest frequencies are in Computer Science [1.97 wpm] and Fine Arts [1.70 wpm].
Compare that to something like “understanding” or “helpful”, two categories in which women also always outscore men, but where the highest incidences are 489 wpm and 2,185 wpm, respectively.
In what discipline are women reported to be “helpful” 2,185 times per million words? Computer Science! That’s a good three orders of magnitude greater than “bossy” in that discipline. This is not to say that the word “bossy” isn’t applied to women more frequently than men, only that the incidence of this is so small as to be meaningless, if what you’re interested in is a real social phenomenon.
Let’s look at those low frequencies. 1.97 wpm is about as frequent as you might expect to find the words “absurdity”, “coexist”, “ecclesiastical”, “vested”, and “tart” in the Corpus of Contemporary American English (COCA). If you restrict it to COCA’s spoken English sub-corpus (perhaps more comparable to online reviews), we’re talking frequencies close to “ambiguity”, “prenatal”, and “cadet”. The reviews corpus is smaller than COCA – between 25,000 reviews per gender per field for the least well represented, up to 750,000 for the most (Females in English and Males in Math). I don’t have access to the background data, so I’ll just assume a mean average of 50 words per review, implying that the biggest dots are measuring against something like 37M total words, while the smallest would have something like 1.3M. Even at the low end, that’s not terrible for commonish words, but once you get down to a rate of 2 wpm, data get(s) wonky. Now, it looks like there might be about 100,000 reviews of female Comp. Sci. profs, and perhaps 75,000 of female Fine Arts profs in the corpus. At (assumed) 50 words per review, this implies that fewer than 7 people called their female Fine Arts teacher “bossy”, while only ten Computer Science students did. In fields where there’s a reasonable amount of data, the wpms are even lower (English = 0.8 “bossy”s per million words–maybe 3,000 instances in 750,000 reviews with 37.5M words of text).
Anyone with high school science should be able to spot the flaws with “bossy” right away, even without a socio-linguistic intuition about its relevance. The X-axis shows really low frequencies, there are multiple points at the X-origin (=zero incidence), the points are scattered about. It’s also suspicious that closely related disciplines, such as Math and Computer Science, or Political Science and History, (say) score far apart. Most of the high-frequency graphs show some grouping along the Y-axis, though there’s considerable variation.
If I were interested in using the tool seriously,* instead of to confirm my own vaguely accepted social prejudices, the first thing I would do is check the top 10 or 25 most common adjectives in the corpus. I can’t do that because I don’t have access to the background data.
So, given that limitation, what I would do next is investigate a series of categories of words, trying always to look at relevant contrasting pairs of words when possible. The first three categories, with a couple examples of each, are below. You can have a look at these, and by all means go use Schmidt’s viz tool to explore your own words. But read Schmidt’s FAQ first, and always bear in mind your own biases.
1. Generic evaluative adjectives: (good, bad, best, worst, great, terrible etc.)
2. Milieu-specific adjectives (fair, helpful, easy, brilliant, knowledgeable, intelligent, stupid, dumb, etc.)
*[but actually, I’m not. At least not to confirm gender bias. There are better ways to do that than counting word frequencies, because, you know, the reviews also include numerical ratings!]