1 00:00:03,136 --> 00:00:06,573 [THIS VERSION HAS CAPTIONS WHICH CANNOT BE TURNED OFF. SEE SITE FOR UNCAPTIONED VERSION] Greetings, from Toronto, Ontario 2 00:00:06,573 --> 00:00:10,910 My name is David Williams and I'm [Associate] Professor of 3 00:00:10,910 --> 00:00:15,615 English at the University of Waterloo in Waterloo, Ontario, in Canada. 4 00:00:15,615 --> 00:00:21,721 Today I want to be talking with you about a new dataset based 5 00:00:21,721 --> 00:00:24,357 on the Oxford English Dictionary 6 00:00:24,357 --> 00:00:30,630 That data set is a subset of the quotations from the OED, 7 00:00:30,630 --> 00:00:36,836 which has been annotated for the gender of the author of that quotation. 8 00:00:36,836 --> 00:00:45,245 Now, the question of the repres- entation of women authors and 9 00:00:45,245 --> 00:00:49,516 women's language more generally in lexicography is a question that 10 00:00:49,516 --> 00:00:54,154 has had much attention over the years and some quite recently. 11 00:00:54,154 --> 00:01:00,260 I take as a starting point an introductory comment by Lindsay Rose Russell 12 00:01:00,260 --> 00:01:03,496 in her recent monograph, in which she says that 13 00:01:03,496 --> 00:01:11,571 "The scholarly consensus con- firms that mainstream lexico- graphy, past and present, - is shot with sexism and androcentrism." 14 00:01:11,571 --> 00:01:17,510 I want to bring another perspec- tive on the question of represen- tation 15 00:01:17,510 --> 00:01:20,146 by John Considine, in which he takes on the question of the 16 00:01:20,146 --> 00:01:25,885 representativeness of quotations specifically 17 00:01:25,885 --> 00:01:27,887 in the Oxford English Dictionary.  18 00:01:27,887 --> 00:01:31,057 He writes that "representative sampling was never the business 19 00:01:31,057 --> 00:01:34,928 of the OED. The makers of a corpus which was meant to provide a representative 20 00:01:34,928 --> 00:01:39,599 of the written English of a given period would have to decide very carefully what 21 00:01:39,599 --> 00:01:43,069 proportion of poetry or female authored texts was to be includ- ed. 22 00:01:43,069 --> 00:01:47,207 this is /not/ a question for lexico- graphers." 23 00:01:47,207 --> 00:01:52,245 So it seems to me that there are that there are at least two 24 00:01:52,245 --> 00:01:55,715 perspectives on the question of representativeness that are in 25 00:01:55,715 --> 00:01:59,519 tension in the lexicographical context, 26 00:01:59,519 --> 00:02:03,656 If we turn to a representation of the populations of written 27 00:02:03,656 --> 00:02:07,360 language, segmenting them according to their employment 28 00:02:07,360 --> 00:02:13,133 by the gender of the author, we actally have two related issues 29 00:02:13,133 --> 00:02:15,702 of representativeness arising. 30 00:02:15,702 --> 00:02:19,806 The first is the question of OED's coverage of population 31 00:02:19,806 --> 00:02:24,177 specific or restricted vocab- ulary. And here we're talking 32 00:02:24,177 --> 00:02:31,684 about this section of the over- lapping diagram, the section 33 00:02:31,684 --> 00:02:35,221 that is employed more by fe- male authors than by male authors. 34 00:02:35,221 --> 00:02:39,792 But there's another question here, which I think goes more 35 00:02:39,792 --> 00:02:44,998 to Considine's Point, which is the representativeness of this 36 00:02:44,998 --> 00:02:53,873 central shared lexicon, that is to say, the the general lexicon, 37 00:02:53,873 --> 00:02:58,044 or at least the lexicon that doesn't display a gender bias 38 00:02:58,044 --> 00:03:02,982 in its employment, and how these two populations, the male 39 00:03:02,982 --> 00:03:08,521 and the female authors, are rep- resented within within this seg- ment. 40 00:03:09,389 --> 00:03:15,295 I also need to describe a little bit the data behind this study, 41 00:03:15,295 --> 00:03:20,300 or the experiments, that I want to look at with you. 42 00:03:20,934 --> 00:03:25,905 It begins with a certain number of raw datasets, and those in- 43 00:03:25,905 --> 00:03:32,111 clude, most importantly, the data sets of various editions of the Oxford English Dictionary. 44 00:03:32,111 --> 00:03:36,983 On the other hand, I also have a number of public library cata- 45 00:03:36,983 --> 00:03:41,921 logues, and those are all public access, open bibliographies 46 00:03:41,921 --> 00:03:45,525 that contain a certain amount of MARC data. From these raw 47 00:03:45,525 --> 00:03:51,831 datasets one can derive a certain number of discrete datasets, 48 00:03:51,831 --> 00:03:55,635 which I have done. These datasets I then cross- 49 00:03:55,635 --> 00:04:00,873 referenced to try to match OED quotations and OED bibliography 50 00:04:00,873 --> 00:04:06,045 entries to these deduplicated public bibliographies. 51 00:04:06,045 --> 00:04:10,617 In doing so, I can bring the MARC metadata and any other 52 00:04:10,617 --> 00:04:18,424 metadata attached to the pub- lic sets into the OED biblio- graphy 53 00:04:19,492 --> 00:04:24,297 So what does that give us? Well, it produces a corpus of 54 00:04:24,297 --> 00:04:29,302 OED quotations from all three editions with the full author 55 00:04:29,302 --> 00:04:35,108 name and gender for 2.3 million quotations. 56 00:04:35,108 --> 00:04:40,313 That works out to 91% of all of the quotations in OED3 at the 57 00:04:40,313 --> 00:04:45,051 moment that have an author associated with them, 58 00:04:45,084 --> 00:04:51,424 are not part of our larger ed- ition, and are published from a.1700 or later. 59 00:04:51,424 --> 00:04:59,799 That particular subset is repre- senting 61% of all OED3 quot's. 60 00:04:59,799 --> 00:05:03,269 (Remember, a number will come from periodicals and other pub- 61 00:05:03,269 --> 00:05:07,140 lications that have no author and so can't have a gender.) 62 00:05:07,140 --> 00:05:13,546 In terms of editions, that 91% works out to 94% of pre-OED3 63 00:05:13,546 --> 00:05:19,152 material, and 81% of the new OED3 material 64 00:05:19,152 --> 00:05:23,890 -- again, within the subset that has an author and so on. Now 65 00:05:23,890 --> 00:05:28,661 this represents 337 thousand OED /sources/ (and that in 66 00:05:28,661 --> 00:05:32,965 this case means a author-title combinations) that have been 67 00:05:32,965 --> 00:05:40,673 matched to an external database, which yields 1.4 million-odd OED 68 00:05:40,673 --> 00:05:46,245 quotations that I can assign to an LC (Library of Congress) shelf mark. 69 00:05:48,514 --> 00:05:53,252 One of the things that anno- tatating so many quotations 70 00:05:53,252 --> 00:05:56,489 allows one to do, that perhaps one wasn't able to do 71 00:05:56,489 --> 00:06:01,361 in the same way before, is to look at the large population 72 00:06:01,361 --> 00:06:06,399 of women authors, and authors in general, as opposed to 73 00:06:06,399 --> 00:06:08,801 only the most-cited women authors. 74 00:06:08,801 --> 00:06:11,471 Because it's one job to look through the top hundred or top 75 00:06:11,471 --> 00:06:15,041 thousand most-cited sources in the OED and then count 76 00:06:15,041 --> 00:06:22,014 which ones of those are women versus those that are men, 77 00:06:22,014 --> 00:06:25,451 but doing that really only scratches the surface of the 78 00:06:25,451 --> 00:06:28,020 women that are represented there. 79 00:06:28,488 --> 00:06:31,457 And once we have the entire population, or close to the 80 00:06:31,457 --> 00:06:36,195 entire population within the subset, we can do things like 81 00:06:36,195 --> 00:06:41,968 split out the edition of the work in which the quotation 82 00:06:41,968 --> 00:06:45,872 was added and also the date of the quotation. 83 00:06:45,872 --> 00:06:51,978 If we look at OED1 quotations, we can plot the number 84 00:06:51,978 --> 00:06:57,950 of quotations added and split those out by gender over the 85 00:06:57,950 --> 00:07:01,687 year of that quotation. And that's what this graph is 86 00:07:01,687 --> 00:07:05,024 showing you. The blue dotted line then calculates the 87 00:07:05,024 --> 00:07:10,029 percentage within that time frame. So here we start at 1700 88 00:07:10,029 --> 00:07:14,534 and we go up to the end of the First Edition. And you can see 89 00:07:14,534 --> 00:07:20,206 that we have a generally low but rising trend in the %age, 90 00:07:20,206 --> 00:07:24,777 all he way up to nearly 10%, with the exception of this bump 91 00:07:24,777 --> 00:07:29,415 around 1800, which we'll be see- ing reoccurring throughout the 92 00:07:29,415 --> 00:07:35,421 data in different editions. The First Supplement is sim- 93 00:07:35,421 --> 00:07:40,159 ilar looking, but displays two little humps here and a much 94 00:07:40,159 --> 00:07:46,365 higher curve just between 1850 and 1900, before settling down 95 00:07:46,365 --> 00:07:50,336 about the same amount, about the same percentage, just below 96 00:07:50,336 --> 00:07:52,472 10%, as the first edition. 97 00:07:52,638 --> 00:07:56,776 Interestingly enough, it's the Second Supplement that shows 98 00:07:56,776 --> 00:08:02,682 the most variation in its in its distribution, its %age of 99 00:08:02,682 --> 00:08:06,986 female authored quotations. And we can see exaggerations 100 00:08:06,986 --> 00:08:11,691 here of that hump that I showed earlier, in the last slide, 101 00:08:11,691 --> 00:08:15,862 an exaggeration here again of the immediately post-1800 hump 102 00:08:15,862 --> 00:08:24,136 and... and a further bump here just after 1850 before coming 103 00:08:24,136 --> 00:08:29,509 down to that, that kind of reg- ion of between 7% and 10% that 104 00:08:29,509 --> 00:08:34,647 seems to govern the post-1800 landscape in the previous two 105 00:08:34,647 --> 00:08:40,052 editions and then rising slight- ly after 1925 or so. 106 00:08:40,052 --> 00:08:45,591 The remit of the Second Supple- ment was mainly, as you know, 107 00:08:45,591 --> 00:08:51,497 to attend to post-OED1 evidence. Robert Birchfield, the editor 108 00:08:51,497 --> 00:08:57,270 of the Second Supplement, when he did reach back into pre-1900 109 00:08:57,270 --> 00:09:00,506 material, was often doing so with a sense that the authors 110 00:09:00,506 --> 00:09:04,944 in question had been unjustly overlooked or unaccountably 111 00:09:04,944 --> 00:09:10,383 under-represented. And so here you see very quickly the top 112 00:09:10,383 --> 00:09:13,753 authors in each one of those humps, along with a running 113 00:09:13,753 --> 00:09:19,258 total of the percentage of female authors from that period in that supplement. 114 00:09:19,525 --> 00:09:25,565 Now, looking at OED3, we see somewhat familiar curve here, 115 00:09:25,565 --> 00:09:33,039 but on a different, somewhat different scale. We have a gen- 116 00:09:33,039 --> 00:09:37,643 erally increasing curve. We do have a certain number of humps 117 00:09:37,643 --> 00:09:42,081 here. There is this familiar 1800+ hump. There is this other 118 00:09:42,081 --> 00:09:45,585 in the early 20th century and then we have a curve that 119 00:09:45,585 --> 00:09:48,721 accelerates rapidly towards the end towards 30% 120 00:09:48,721 --> 00:09:49,455 (and that's the highest percentage reached in any edition). 121 00:09:49,455 --> 00:09:52,091 Now I can take all of those percentage graphs and put them 122 00:09:52,091 --> 00:09:58,230 on the same axis and you can see how they fare comparatively. 123 00:09:58,230 --> 00:10:02,468 But again, it's really only after 1970~1980 that things 124 00:10:02,468 --> 00:10:09,442 take off beyond the 12 or 13% mark, up again towards that 30%. 125 00:10:09,442 --> 00:10:18,718 To look at the OED3 curve here for another moment from 1800 to 126 00:10:18,718 --> 00:10:24,323 2000, now, we can also look at these three increases and 127 00:10:24,323 --> 00:10:28,494 observe the same kind of pattern occurring. 128 00:10:28,494 --> 00:10:36,235 Where the top-10 authors in this early 1800-1825 hump, 129 00:10:36,235 --> 00:10:42,274 which we've seen in every ed- ition, is largely due to the 130 00:10:42,274 --> 00:10:47,413 literary writings of important authors of the time, and ac- 131 00:10:47,413 --> 00:10:50,449 counts for 50% of all of the quotations by women that OED 132 00:10:50,449 --> 00:10:52,652 is able to add there. 133 00:10:52,652 --> 00:10:56,989 Now, things get somewhat more fragmented in that second time 134 00:10:56,989 --> 00:11:01,160 frame there, and by the time we get to the most recent 25 years 135 00:11:01,160 --> 00:11:04,497 of new quotations added in OED3, you can see a much higher frag- 136 00:11:04,497 --> 00:11:11,470 mentation rate and a much more diverse group of authors. 137 00:11:11,470 --> 00:11:15,574 It's not until we get to Zadie Smith at number seven that a 138 00:11:15,574 --> 00:11:18,077 literary author is represented there. 139 00:11:19,345 --> 00:11:22,682 So I would argue that represent- ativeness could be and perhaps 140 00:11:22,682 --> 00:11:26,819 even should be a question for lexicographers in both of the 141 00:11:26,819 --> 00:11:32,591 contexts I described earlier. That is, both in the case of 142 00:11:32,591 --> 00:11:37,763 vocabulary that is dispropor- tionately used by female wri- 143 00:11:37,763 --> 00:11:42,501 ters, perhaps disproportionate- ly read by female readers, and 144 00:11:42,501 --> 00:11:48,307 also in the common lexis, the visibility of the female writ- 145 00:11:48,307 --> 00:11:53,646 ing population within its rep- resentation as lexicographical 146 00:11:53,646 --> 00:11:56,615 evidence in the /Oxford English Dictionary/, a dictionary that 147 00:11:56,615 --> 00:12:05,324 we know carries a certain amount of public prestige and authori- tativeness. 148 00:12:05,357 --> 00:12:08,360 One way of thinking about that first part -- the question of 149 00:12:08,394 --> 00:12:12,298 restricted vocabulary--is to look at OED's own subject cat- 150 00:12:12,298 --> 00:12:18,270 egories, that is to say the senses that the OED marks out 151 00:12:18,270 --> 00:12:23,642 as belonging to particular do- mains of of of thought and activity. 152 00:12:24,043 --> 00:12:27,646 So, for all the quotations in the current OED3, from 153 00:12:27,646 --> 00:12:36,655 1700 to 2020, what you see on this graph is the average of 154 00:12:36,689 --> 00:12:44,730 all quotations with a category, the average of all quotations 155 00:12:44,730 --> 00:12:49,602 with no category, and the aver- age of each individual category 156 00:12:49,602 --> 00:12:55,508 in terms of its representation of female authors. And you'll 157 00:12:55,541 --> 00:13:02,214 see that the only three categor- ies in the OED that have more 158 00:13:02,214 --> 00:13:06,685 female representation than un- categorized, or let's say more 159 00:13:06,685 --> 00:13:15,227 general vocabulary, are these: Consumables, Trades and Crafts, 160 00:13:15,227 --> 00:13:19,265 and Education. And we can look at subcategories and see that 161 00:13:19,265 --> 00:13:24,470 within Consumables, it's Food & Cooking that have 30% female au- 162 00:13:24,470 --> 00:13:28,808 thorship, whereas Drinking & To- bacco has only 10%. The top 3 163 00:13:28,808 --> 00:13:32,945 categories in Crafts and Trades are Basket Making, Textiles, and 164 00:13:32,945 --> 00:13:39,018 Hairdressing--all between 20 and 30%--and in Education, where 165 00:13:39,018 --> 00:13:46,292 General Education sits at 16% and University Education at 10%. 166 00:13:48,894 --> 00:13:53,599 Now, we should expect as we look at different subsections of OED 167 00:13:53,599 --> 00:13:59,438 quotations is that as we get farther in the historical timeline, we 168 00:13:59,438 --> 00:14:03,776 get a higher representation of female quotations. 169 00:14:03,776 --> 00:14:06,979 That's what the first charts showed us, for every single 170 00:14:06,979 --> 00:14:10,049 edition. And this indeed looks at every edition. We see that 171 00:14:10,049 --> 00:14:13,285 all of these categories have been pushed upwards. The Y axis 172 00:14:13,285 --> 00:14:18,324 remains the same as previous, but also the three top catego- 173 00:14:18,324 --> 00:14:20,826 ries remain the same as prev- ious. 174 00:14:20,826 --> 00:14:23,329 Everything has been pushed up- wards, but nothing has been 175 00:14:23,329 --> 00:14:28,834 pushed leftwards in terms of the relation between the no-sub- 176 00:14:28,834 --> 00:14:36,442 ject, general lexis, and the more specific topic-oriented lexis. 177 00:14:36,909 --> 00:14:40,112 The second thing that ought to push quotations up- 178 00:14:40,112 --> 00:14:43,749 wards is the date of the addition of the quotation. 179 00:14:43,749 --> 00:14:46,886 You'll have guessed, of course, that new quot's in new entries 180 00:14:46,886 --> 00:14:55,160 that are written in the most recent time frame have the highest%age. 181 00:14:55,160 --> 00:15:02,434 And indeed, that is what we see. And we do have some shift here in 182 00:15:02,434 --> 00:15:05,671 terms of what kind of categories are being more re 183 00:15:05,671 --> 00:15:09,541 presented with the addition of the Social Sciences where Socio- 184 00:15:09,541 --> 00:15:14,513 logy is among the highest in terms of its representation by 185 00:15:14,513 --> 00:15:15,547 female authors. 186 00:15:20,853 --> 00:15:24,857 In the course of this ex- periment, I did all sorts of 187 00:15:24,857 --> 00:15:30,596 cross-comparisons, I tried to apply as many parameters 188 00:15:30,596 --> 00:15:35,167 in terms of the lemma features and the publication features 189 00:15:35,167 --> 00:15:40,973 to crosstabulate inall sorts of ways. And the conclusion I came to is that 190 00:15:40,973 --> 00:15:44,944 there's virtually /no/ parameter combination that can produce 191 00:15:44,944 --> 00:15:48,213 anything like a 50% authored quotations. 192 00:15:48,213 --> 00:15:50,582 That is to say, you can shift the date, you can shift the 193 00:15:50,582 --> 00:15:54,253 subject,you can shift the ed- ition of the revision. 194 00:15:54,253 --> 00:16:01,627 And you're still, in almost all cases, below 30 or 35% female authorship. 195 00:16:01,627 --> 00:16:04,630 There are, of course, some ex- ceptions. If you get down into 196 00:16:04,630 --> 00:16:09,435 the granular nitty gritty, if you look at OED3 New Entries 197 00:16:09,435 --> 00:16:14,406 between 1950 and 2000, within the category of Hairdressing, 198 00:16:14,406 --> 00:16:17,910 yes, you're at 88% female au- thorship, but you're only talk- 199 00:16:17,910 --> 00:16:22,715 ing about eight quotations there in total. 200 00:16:22,715 --> 00:16:28,153 And similarly, there must have been one or two women who wrote 201 00:16:28,153 --> 00:16:31,323 on Roman Law between 1990 and 2000 that have been picked up. 202 00:16:31,323 --> 00:16:33,392 But it's a total of three quo- tations that we're looking at. 203 00:16:33,392 --> 00:16:35,894 There may be one or two more that there that aren't part of 204 00:16:35,894 --> 00:16:40,432 the dataset within those those categories. We're looking at 205 00:16:40,432 --> 00:16:43,068 very small data sets of these are almost like random and mean- 206 00:16:43,068 --> 00:16:46,805 ingless fluctuations. And you look at large groupings. You're 207 00:16:46,805 --> 00:16:52,378 almost always under 35% and you're and you're never, never close to 50%. 208 00:16:54,380 --> 00:16:58,417 In the next couple of slides, what I want to do is bring in 209 00:16:58,417 --> 00:17:03,122 the Library of Congress data set in a more focused way. 210 00:17:03,122 --> 00:17:06,025 And what I've been able to do here is look at a subset of 211 00:17:06,025 --> 00:17:11,330 the OED that I can match to the Library of Congress shelfmarks. 212 00:17:11,330 --> 00:17:13,565 Because chronology is so important, and the date 213 00:17:13,565 --> 00:17:16,235 of the quotation is so im- portant, what I have here is the most 214 00:17:16,235 --> 00:17:21,473 complicated graph that I am going to be showing you. And 215 00:17:21,473 --> 00:17:26,245 what it does is split out for every year the proportion of 216 00:17:26,245 --> 00:17:31,417 female authorship within every one of those top-level 217 00:17:31,417 --> 00:17:36,755 LC shelf mark classes. So what you see in the middle, that 218 00:17:36,755 --> 00:17:43,195 bright orange line, is the total difference between the OED and 219 00:17:43,195 --> 00:17:49,435 the Library of Congress in terms of its representation of female authors per year. 220 00:17:49,935 --> 00:17:53,472 Now, I've taken out the middle of this 221 00:17:53,472 --> 00:17:56,542 curve because it's too noisy. The middle just represents 222 00:17:56,542 --> 00:18:00,946 the maximum and the minimum values achieved by the gen- 223 00:18:00,946 --> 00:18:05,117 eral line average. So we don't have a bunch of dots there, 224 00:18:05,117 --> 00:18:11,590 but we do have isolated the percentages by shelf mark that 225 00:18:11,590 --> 00:18:18,063 exceed the minimum or the max- imum in each year. And we can 226 00:18:18,063 --> 00:18:25,270 pick out a couple of interesting effects. We have Education which 227 00:18:25,270 --> 00:18:32,711 is lifting the curve both in this phase here where OED is 228 00:18:32,711 --> 00:18:36,048 more proportionally female than the LOC. Remember, this 229 00:18:36,048 --> 00:18:40,052 is never close to 50% propor- tionally female. It's just +1 230 00:18:40,052 --> 00:18:46,125 or +2 points above how propor- tionally female the Library of Congress is. 231 00:18:46,125 --> 00:18:49,528 Music and Fine Arts do a lot of that work in the middle. And 232 00:18:49,528 --> 00:18:54,533 History does a little bit at the edges, just as the line 233 00:18:54,533 --> 00:19:04,276 begins to go below the 0% mark in about 1920, 1925 or so. 234 00:19:04,276 --> 00:19:09,381 Then we get this big gap where the OED is underperforming, if 235 00:19:09,381 --> 00:19:13,652 you want to put it that way, the Library of Congress. 236 00:19:13,652 --> 00:19:20,058 And important in that curve are as you might you might have 237 00:19:20,058 --> 00:19:27,432 guessed Medicine and Bibliography, but, you know, there's also a 238 00:19:27,432 --> 00:19:31,270 fair amount of Music and Fine Arts in this area here. 239 00:19:31,270 --> 00:19:35,307 I think most important to this curve here is how underrepres- 240 00:19:35,374 --> 00:19:40,379 ented Literature and Language is That is to say, this is the big- 241 00:19:40,379 --> 00:19:44,016 gest part of all of the two corpora And so to have it down here at 242 00:19:44,016 --> 00:19:49,788 minus, let's say -8, -7 or -8 points 243 00:19:49,788 --> 00:19:53,458 is putting a lot of down- wards pressure on this graph. 244 00:19:53,659 --> 00:19:54,893 The final graph... 245 00:19:55,394 --> 00:20:01,300 That is my exploration of this newly annotated dataset marking 246 00:20:01,300 --> 00:20:05,771 gender in the Oxford English Dictionary. I wanted really to 247 00:20:05,771 --> 00:20:10,809 raise more questions than I could answer in doing this, so 248 00:20:10,842 --> 00:20:15,314 if there are questions that occur to anyone in the course 249 00:20:15,314 --> 00:20:16,915 of listening to this presenta- tion that they'd like to bring 250 00:20:17,382 --> 00:20:22,821 to the Q&A session, I'd be de- lighted to hear them and take 251 00:20:22,821 --> 00:20:27,659 a shot at answering them, or at least giving some ideas in that direction. 252 00:20:27,659 --> 00:20:31,163 The final slide that I'll put up right now and leave at the end 253 00:20:31,163 --> 00:20:35,567 of this video is a set of ref- erences, including where you 254 00:20:35,567 --> 00:20:39,571 can find this presentation going forward, and how you might be 255 00:20:39,571 --> 00:20:43,976 able to get in touch with me if you wish to pursue any lines of inquiry. 256 00:20:44,176 --> 00:20:45,043 Thank you. :-)