Douwe Osinga's Blog: Semantics, Maps and Word2Vec

Thursday, September 29, 2016

This week I published a new project: worldmapof. It uses Word2Vec to calculate the distance between a given word and the name of a country and then colors each of the countries according to that distance. The results are often what you expect with some interesting surprises thrown in. You want to see Colombia and Ethiopia light up for coffee, but wonder why Greenland also features prominently only to learn one Google search later that Greenlandic coffee is a thing.

Coffee projected on a map
Word2Vec analyses large amounts of text - in this case from Google News. By building a model to predict a word given a context, it associates a 300 dimensional vector with that word. The interesting thing about this vector is that it has some semantic meaning. If the distance between two vectors is small, the two words are related. So if the distance between the word Colombia and the word coffee is small, that means that Colombia and coffee are related and we can paint it in a brighter color green.

To make this work in an online situation, I imported the word2vec data into postgres. If you want to play with that yourself, you can find the code on github.

Since the underlying model is trained on a Google News archive, some biases shine through. There are some countries that don't appear often in the news - Chad, the Central African Republic and the Republic of Congo (not to be confused with the Democratic Republic of Congo) spring to mind. This makes the vectors of those countries unstable. One article about a guy who went walking in Chad and now Chad lights up for the word walk, even though it isn't particularly related.

The US has the opposite problem. American news talks about "the average American" or "in the US" when the subjects discussed aren't particularly American at all. So the US tends does well for day-to-day terms and maybe underscores a bit for international queries. I created a small spin-off thing, usmapof that uses the names of the US states instead. Comparing the maps for "Germany", "Sweden" and "Norway" gives you an idea where migrants from those countries ended up. Or if you want to know where hockey is popular:
Hockey lights up the north

It's fun to play with, but sometimes you see the limits of the model shine through. The data is somewhat old, so you can't use it well to illustrate current political events. Moreover, names of states are somewhat poor representations of the underlying entities. Washington usually does not mean the state. England makes New England light up for the US, but probably not because so many English settlers went there.

So I wonder if we can do better. What if instead of running a skip-gram algorithm over windows of words, we preprocessed the text into entities first? Then quite possibly the model would learn which entities have similar roles, rather than which words have similar roles. We might want to incorporate somehow even the roles of entities in sentences, which might allow the model to learn from a fragment like "Oil was found in Oklahoma" that oil is something that can be found and that Oklahoma is a place.

Maybe I should try SyntaxNet out for this and see what happens.





0 comments: