I’ve been spending the weekend experimenting with vector space modelling and poetic language.
Vector space word embedding models use learning algorithms on very large corpora in order map a unique location in n-dimensional space to each token (=word) in the corpus. “N-dimensional space” is just a mathy-sounding way of saying that multiple (or n) features are being measured against each other on the same basis.
I’m using the Word2Vec module in Python to do word embedding with four corpora: 1) a very large poetry corpus (POE); 2) a medium sized selection of English Wikipedia (WIK); 3) all of the OED’s quotation data (OED); and the British National Corpus (BNC). Word2Vec basically looks at what words are near what other words, and then thinks about the relationships that emerge from this a few billion times. It’s important to keep in mind that the model isn’t told anything about semantics–all of its knowledge is purely contextual.
Using the word mappings arrived at by the model, distances between words can be calculated. These models have been shown to perform well at identifying semantic relationships such as synonymy, association, and analogy.
So, for instance, if you ask my BNC model for the 10 words with the shortest distance from ‘man’ (for instance), it will return something like:
woman 0.85687
boy 0.79566
girl 0.78907
soldier 0.7879
chap 0.75913
lad 0.75345
guy 0.74846
bloke 0.74239
sailor 0.73324
creature 0.72413
Note again that these results have nothing to do with the distance of one these words to each other in the linear text, but rather the distance in a multidimensional representation of the interrelations between these words and all the other words in their immediate context.
And it works pretty well: the list corresponds to our intuitions about association with ‘man’, albeit representing different kinds of association. Some are members of the same class (woman, boy, girl – all types of person), some are subclasses (soldier, sailor – all types of man), some super-classes (creature – of which man is a type), some are more like synonyms used in specific contexts or with specific connotations (chap, guy, bloke).
This kind of model is also pretty intuitive at narrowing classes. We can isolate the type of relationship exemplified by a group of terms by “adding” (actually more like averaging) their vector representations together and finding the closest words to the sum. So [man+guy] returns a set with more words like both of those:
bloke 0.87162
chap 0.85805
fella 0.81898
lad 0.81394
feller 0.80844
boy 0.80681
girl 0.80502
geezer 0.79548
We can also subtract vectors, creating analogies. So, working from the examples above, the analogy “man is to guy as woman is to X” could be represented like this: what is a close to “guy,” with “man” taken out and “woman” added back in, or [guy-man+woman]. The model returns “girl” as the top answer. For more examples and a longer discussion of using vector space models with text corpora, see Ben Schmidt, “Word Embeddings for the digital humanities” [25.10.15].
If these models are good at reproducing conceptual relationships based on texts, I’m thinking, maybe they can tell us something about the internal logic of distinct subsets of human discourse. Like poetry.
There are many factors that make these corpora different from each other, including register, subject matter, and time span. POE and OED are historical, running from c.800 to today, and weighted heavily towards the 17th and the 19th centuries. WIK and BNC are contemporary (1980-present). WIK and OED are broad in subject matter, whereas POE and BNC are more restricted. WIK is technical in register (and partial – only the first 100mb or so of the entire Wiki corpus), POE poetic by definition, BNC (more or less) journalistic and/or colloquial, and OED mixed.
For this approach to be of any use, the results of the different models based on these four corpora ought to agree more or less on those basic relationships (such as “man : woman :: boy : girl”) which we would expect to hold across time and over type of text, while also presenting telling divergences in other relationships.
My four models do agree on a set of basic test analogies. Although I’ve just begun to explore divergences, I have come across a few examples that appear to show significant differences in how words function within different kinds of text. I’ll share and comment on a few of these below. Although there’s nothing really revelatory about these at this stage, I see them as a good first step in validating the overall approach. I’ve put them in increasing order of interest.
5 Most Similar: ‘BANK’
I start with this word because it’s a classic test word for polysemy and ambiguity, with its various meanings belonging to very different subject matter. Sure enough, while each corpus has a different spelling/inflection of ‘bank’ as the nearest word (that’s a good thing), further down the lists we see different meanings of ‘bank’ are being activated in the different corpora, in ways that we might expect: POE has words to do with landscape (remember Daffodils: ‘Along the margin of a bay…’); WIK has words to do with banking [‘Clearstream’, just by the way, “is a post-trade services provider owned by Deutsche Börse AG”, according to Wikipedia]; and so does our BNC model (but weighted towards examples of UK financial institutions); and our dictionary model has both kinds of word, with perhaps an emphasis on the topographical, incorporating features of the modern landscape.
We can see a similar thing going on, perhaps more dramatically, with ‘content’ and ‘apple’:
5 Most Similar: ‘CONTENT’
5 Most Similar: ‘APPLE’
Here we see the effect of WIK being not only contemporary, but also a work of reference. There may be some use of “content, adj” in the Wiki article on “contentment”, but the corpus is much more likely to employ “content, n.” in the modern sense of “that which is contained in a document (or other medium).” Similarly, while the main Wikipedia article for “apple” is indeed for the fruit, the model is showing is that there’s little there that correlates strongly to how “apple” is used elsewhere in the corpus. However, the next article, ‘Apple inc.’ actually does use “apple” in contexts representative of the WIK corpus as a whole, alongside other words to do with computing.
So far, so ho-hum. Things start to get a little bit more interesting when you compare how the models diverge in the type of associations they cluster around, rather than the semantics of the associated words. For instance:
5 Most Similar: ‘BLUE’
I have repeated this with several colour words, and for most the pattern is the same: POE gives a list of members of the colour class (i.e. varieties or shades of the colour) while the others give a list of other members of the same class (i.e. other colours). What might this be telling us, beyond the fact that poetry has a bigger and more finely grained colour vocabulary (a simple word frequency count might have told us that)?
5 Most Similar: ‘LIKE’
This interesting, again because POE is consistently giving a whole other class of words: While the other models are mostly giving alternatives to two meanings of ‘like’ (i.e. either ‘resembling’ or ‘fancying’/’preferring’) POE is giving functional alternatives. In first position is “as” the other word that introduces similes.
Intend do a few more tests along these lines, more systematically perhaps, to see what kinds of words are gravitating towards different types of relationship in the different corpora. But I’m also intrigued by the ability of word-embedding to reproduce analogies, because analogies are types of comparisons, and so are related to metaphor.
Metaphor is present in all of language, to one degree or another, but we tend to think of poetry and being more aware of and more interested in its metaphorical dimension. So, for your contemplation, here are a few common or conventional metaphors, reformulated as analogies that I’ve asked the various models to complete. Some work, some don’t. If the model returns a list that makes no sense at all, as some do, I think it’s fair to say that that metaphor is not a part of the logic of that corpus.
An interesting conversation developing over on my Facebook share of this article:
Hi, I’m a machine learning enthusiast interested in messing around with word2vec trained on poetry data and was wondering, where did you found your POE corpus? Is it available to the public?
I used a commercial corpus for this, which isn’t freely available to the public, unfortunately.