A Beginner’s Guide to word2vec AKA What’s the Opposite of Canada?

The idea for writing this post came from a single line in the appendix to a presentation: “what’s the opposite of Canada?”. While this could be the set up for some pretty poor jokes, it’s actually the entrance to a rabbit warren of fascinating geeky distractions.

It turns out that while we can typically group similar or related words together - identifying that there is some connection between “Canada” and “snow” for example - we generally have a much weaker intuition for opposites. There are obviously a relatively small set of words where we’d likely have consensus on opposites - mainly adjectives like “dark”, “tall”, “cold” etc. - but in general, “oppositeness” is a less well-defined concept than “similarity”.

This is one of those insights that helped researchers build better machine models for understanding language. These models are important because they underpin search - and advancements in this area are enabling trends like the move from keywords to intents.

An intro to 'keywords to intent' from Emerging Trends in Online Search

Those who’ve been around the industry for a while may remember Latent Dirichlet Allocation (LDA) being a topic of discussion at our first ever conference back in 2009. LDA is one model for understanding the topic of a page. The approach that I want to talk about today is less about pages (or “documents” in the computer science literature) and more about understanding words.

Surprising and cool results from thinking of words as vectors

The particular development that I want to talk about today is a model called word2vec. If you have a mathematical or computer science background, you should head straight on over to the TensorFlow tutorial on word2vec and get stuck in. If you don’t, I wanted to share some surprising and cool results that don’t rely on you knowing any complex mathematics.

[Sidenote: if you haven’t checked out TensorFlow yet, it’s a really important glimpse into Google’s strategy, and worth reading more about].

The easiest way to think about word2vec is that it figures out how to place words on a “chart” in such a way that their location is determined by their meaning. This means that words with similar meanings will be clustered together. This represents the intuitive part of my opening example - words with semantic relationships with them will be closer together than words without such relationships.

The surprising part, though, is that the gaps and distances on the chart turn out to have meanings as well. If you go to the place where “king” appears, and move the same distance and direction between “man” and “woman”, you end up in the area of the chart where “queen” appears. This is true of all kinds of semantic relationships:

Image source

You can think of this as:

  • [king] - [man] + [woman] ~= [queen] (another way of thinking about this is that [king] - [queen] is encoding just the gendered part of [monarch])

  • [walking] - [swimming] + [swam] ~= [walked] (or [swam] - [swimming] is encoding just the “past-tense-ness” of the verb)

  • [madrid] - [spain] + [france] ~= [paris] (or [madrid] - [spain] ~= [paris] - [france] which is presumably roughly “capital”)

You can read some more interesting examples here and here (such as “[library] - [books] = [hall]”).

Although it gets a little technical in places, this post from Stitch Fix has some great discussion and a brilliant animated gif that explains the vector thing visually.

Bringing it back to the idea that started this post (“what’s the opposite of Canada?”) - we might imagine “opposites” as being to do with direction of the vector - though it seems that doesn’t result in entirely intuitive outcomes. This post explores this line of thought and highlights how unintuitive it is:

“If we just look for the word closest to negative hitler (-1 × hitler), we find Plantar fascia, the connective tissue that supports the arch of your foot.” [source]

Some academics have discussed [warning: highly theoretical paper, PDF] that it’s no surprise that the negative of word vectors has little intuitive meaning because humans aren’t really equipped with a general concept of “opposite”.

So what else can we do with these vectors?

Well, it turns out that given the right corpus of text to learn from, you can do word-by-word translation [PDF]. In other words, you can’t expect to translate a document - because the model has no concept of sentences or context, but you can ask “what is the Spanish word for ‘one’?”. The amazing thing is that in the vector space, this amounts to a rotation - in other words the relationship between words is roughly preserved across language:

Applications

I tend to learn about this stuff in large part to understand the cutting edge of what computer language processing is capable of - I find it gives me a better understanding of how Google might be working, and where the limits might be. This is a bigger piece of the puzzle for me than actually finding real-world applications where I can use this stuff.

It’s interesting to note for example that a Google spokesperson said, of RankBrain:

“It’s related to word2vec in that it uses 'embeddings’ - looking at phrases in high-dimensional space to learn how they’re related to one another.”

Having said that, there are applications for this kind of technology in building compelling websites at scale. Two examples I’ve come across are:

Dice building scalable skills pages

Dice uses it to power skills pages. They use a word vector model to understand that, for example, analytics should be classified as similar to business intelligence. This allows them to group together job adverts even if they don’t use the exact same language to describe the role.

Stitch Fix classifying items of clothing

The afore-mentioned post from Stitch Fix describes how they use this technology to build a recommendation engine that can, for example, find a top like this one:

That is suitable for someone who is pregnant:

Concerns

There are a variety of theoretical challenges and constraints that come with this kind of approach - the biggest being that context is a huge part of language, and any “bag of words” model will be limited in its usefulness. I’m inclined to say that the biggest leaps forward from here are more likely to be document models than word models, but this stuff is still fascinating.

The broader concern is that any kind of machine learning is only as good as its source information (the “corpus” of text in this case). Specifically, when we’re talking about understanding language, we have to realise that any biases embedded in the corpus will remain in the model. If the training set has sexist language, for example, the machine will make sexist associations. It’s ridiculously easy to find spurious correlations in large datasets - and we know that existing humans and human language have a ton of biases. Without a lot of care, we could easily end up in a situation where our supposedly rational machines encode these biases. Techcrunch had an interesting article on the dilemma.

Recommended Reading

If you’ve enjoyed this high-level overview, and have more of an academic / mathematical mind-set, you might like to dive into some of the more theoretical areas:

Get blog posts via email

About the author
Will Critchlow

Will Critchlow

Will founded Distilled with Duncan in 2005. Since then, he has consulted with some of the world’s largest organisations and most famous websites, spoken at most major industry events and regularly appeared in local and national press. For the...   read more