Home || Blog || Projects || Google Hacks || Artificial Life || Search || About

Friday, October 21, 2016


When I was 11 or so, I saw my first Times World Atlas. I was blown away. It was so much better in every way compared to the school atlas I was used to. The maps were huge, detailed and beautiful. The thematic maps and diagrams visualized everything from land use to economy to desertification.
Countries resized to reflect their populations

These days, with Google Maps and data visualizations of every type, you don't hear so much about atlases other than that people cut up old ones to sell the individual maps as wall decorations. There was one type of visualisation in the Times World Atlas that impressed me very much that I haven't seen online though: resized countries.

The idea is that you change the size of a country according to some statistic while trying to minimize the overall distortion of the map. As a kid, I wondered how you would calculate this - now I am pretty sure they just had an artist do their best. It is therefore with some pride that I am publishing an algorithm here to do something similar online: WorldSizer.

It's a two step process. The first, offline step takes the country data from the CIA Factbook and the shape information from Natural Earth. The CIA data is nice, because it is more evenly edited than, say, Wikipedia, but does need some massaging. Especially the country codes used by the CIA are seen nowhere else. One wonders if this has ever led to the wrong government being thrown over in a small Latin American country.

The shape files from Natural Earth have one or more shapes per country. The first step is to apply an area preserving projection to the shapes. Resizing a mercator map according to population would only cause confusion since it would still show India too small and Canada too big. The next step replace the points of the shapes with list of indexes into a global list of points. Here we also try to make sure that points on shared borders are only stored once. Since now points on borders are shared, we modify the shape of one country, it will also modify the shape of the other.

The online step takes the output of this and allows the user to pick a measure that the world will be "resized" to. For each country, we calculate the deflation or inflation factor needed to reflect the chosen measure. Then in an iterative process, we calculate for each shape how far the current area is off from the target area. If the shape is too small, all points of the shape are pushed away from the center, if the shape is too big, they are pushed towards the center.

For islands, this is enough. For countries that share borders, a tug of war process plays out. Especially in a region where all shapes need to grow or shrink, it takes a while for things to stabilize. Each shape also tries to maintain its original shape - for example, if you look at the map for population, you see India grow and sort of fall out of Asia as a blob before regaining its normal shape (only much bigger).

The shared borders and the country shape maintenance keep continents mostly in shape, too. But without extra care, islands will just drift into the mainland or each other. We stop this from happening by calculating "bridges". For each island, we look for a larger land mass nearby and anchor it to it. This keeps Ireland next to Great Britain, Great Britain next to France and Sri Lanka close to India.

Finally, we create some bridges by hand. Neither Spain nor Morocco are islands, but we'd still don't want them to crash into each other. Similarly, we attach Yemen to Somalia, Australia to New Zealand and Sweden to Denmark. In some situations this still leads to overlap. If you use proved oil reserves as a measure, the Middle East of course increases in size by a lot. But it has nowhere to go, so it pushes into the Mediterranean and squashes Syria into Greece.
More sad: scaled to number of people living with AIDS
One could add more bridges to force more map preservation, but this does come at a cost. The more forced the map, the less freedom the model has to preserve the shapes and get to the target sizes and it starts to behave more and more like a water balloon where you squeeze on one side and it just bulges out on the other.

As usual the source code is on GitHub. It should be fairly straight forward to use the worldsizer.js on another website with different data.

Wednesday, October 12, 2016

Styled Museums

The Prisma app became an overnight hit after it launched because of its great photo filters. Rather than just applying some face feature transformation or adjusting the colors, it rerenders a picture in the style of a famous artwork. The results are remarkable and quite recognizable. And it is no secret how this is done. The basic algorithm is described in a paper published a year ago called "A Neural Algorithm of Artistic Style"

Besides the scientific paper and the startup executing on it, there's also an open source implementation of the algorithm. I played around with it a bit and it also works well, although it seems roughly 100x slower than Prisma. It got me thinking, what happens if you use this to re-render pictures of museums in the style of their most famous work? That way you see what the building is like and at the same time what to expect when you go in.

Styled Museums does exactly that. It has the top 100 museums (by their wikipedia page view count) and their most popular works (same measure) and shows them on a world map. It uses the wiki_import frame work to get the data. You can click around and find your favorite museum and see what happens to it.

I think the fact that artistic style transfer is available as a scientific paper, a startup and an open source implementation is indicative of a wider trend. We now live in a world where we have three forms of innovation. The traditional scientific method where publicly financed institutions produce papers describing new ideas; the startup world that funnels large amounts of private money into ideas to see if they come to commercial fruition; and finally the open source world where individuals build something and share it with the world to build up their public profile.

These three engines of innovation aren't silos. Google started out as a scientific experiment, became a startup and a commercial success and now publishes scientific papers and open sources part of their technology. Github is a startup that is not only based on an open source project, but also hosts other open source projects. Twitter open sourced their data processing engine, which now helps academics keep up with what their peers in Silicon Valley are up to.

It doesn't always seem fair. The founders of Google became billionaires with technology developed while being employed by Stanford University, while the inventor of the world wide web works for a non profit. For years Werner Koch maintained the GnuPG email encryption package on the salary of a postman, while the founder of Hotmail is worth more than a 100 million dollars.

The 1980 Bayh-Dole Act in the US explains some of the difference between there and Europe. It allows universities and companies to claim patent rights on research undertaken with federal funding. On one level this doesn't seem right - if the government paid for research, shouldn't the patents end up with the government too? Then again, it turns out that the government isn't particularly good at doing interesting stuff with those patents - startups do much better.

And so we end up with Styled Museums. Inspired by Prisma, a VC funded company, I found the original paper which is based on research paid for mostly by the University of Tübingen, which in turn led me to an Open Source implementation. You can find the code used to get to Styled Museums (of interest is mostly the matching of museums and paintings) on Github, of course.

Wednesday, October 5, 2016

Introducing Karakame

This weeks project is a camera project: Karakame. Sorry, Android guys, iOS only. The app takes 5 pictures with 3 seconds in between. After adjusting for small movements of the camera, it will then for each pixel in the five images, pick the median one. This has the effect that when pointed at a scene where people walk in and out, it will remove those people in the aggregate picture.

It works reasonably well. The app is by all means no replacement for the main camera app, more a proof of concept. It seems like the sort of thing main stream camera apps should add - if you have an app like that you can get the source for this at https://github.com/DOsinga/Karakame. We were in Leipzig this weekend and I tried it out on a statue of Bach:
Bach in Leipzig

See? No people.

Karaoke famously means "Empty Orchestra" in Japanese - "hauntingly beautiful". Except for that it doesn't quite. Kara means empty (see also Karate - empty hand), but the "oke" bit is just the last bit of the English word orchestra. So I called the app Karakame, from the almost Japanese for "empty camera".

Some notes on the implementation. The app uses OpenCV which you can quite easily integrate into iOS these days. I extracted the interoperability code into a OpenCVBitmap class, so have a look if you're interested in that sort of thing. The image stabilization works really well. I normalize to the middle bitmap (i.e. the third one if you take five pictures). Image stabilization leads to the fact that some of the border pixels will be missing from some of the pictures, but by picking the median pixel value, most of the time we'll have values from other bitmaps.

I also experimented with object detection. OpenCV comes with a set of detectors called haar cascades that can detect faces, cars and people - no deep learning needed. It works well for face detection, but for cars and people I didn't get a lot of good results. The idea was to leave pixels inside rectangles that were classified as cars or people out of the median voting, but I took that out again.

Finally the median pixel implementation. Calculating medians in higher dimensions is expensive so I decided to just calculate the medians for the red, green and blue channels. This could lead to weird results, but in my testing it seemed ok. I suppose I could do a little better by calculating the median for the three colors and in the case where there is a disagreement, pick whatever pixel has the smallest distance to the other candidates.

If you have read this far, you're probably ready to get the project from github: https://github.com/DOsinga/Karakame

Thursday, September 29, 2016

Semantics, Maps and Word2Vec

This week I published a new project: worldmapof. It uses Word2Vec to calculate the distance between a given word and the name of a country and then colors each of the countries according to that distance. The results are often what you expect with some interesting surprises thrown in. You want to see Colombia and Ethiopia light up for coffee, but wonder why Greenland also features prominently only to learn one Google search later that Greenlandic coffee is a thing.

Coffee projected on a map
Word2Vec analyses large amounts of text - in this case from Google News. By building a model to predict a word given a context, it associates a 300 dimensional vector with that word. The interesting thing about this vector is that it has some semantic meaning. If the distance between two vectors is small, the two words are related. So if the distance between the word Colombia and the word coffee is small, that means that Colombia and coffee are related and we can paint it in a brighter color green.

To make this work in an online situation, I imported the word2vec data into postgres. If you want to play with that yourself, you can find the code on github.

Since the underlying model is trained on a Google News archive, some biases shine through. There are some countries that don't appear often in the news - Chad, the Central African Republic and the Republic of Congo (not to be confused with the Democratic Republic of Congo) spring to mind. This makes the vectors of those countries unstable. One article about a guy who went walking in Chad and now Chad lights up for the word walk, even though it isn't particularly related.

The US has the opposite problem. American news talks about "the average American" or "in the US" when the subjects discussed aren't particularly American at all. So the US tends does well for day-to-day terms and maybe underscores a bit for international queries. I created a small spin-off thing, usmapof that uses the names of the US states instead. Comparing the maps for "Germany", "Sweden" and "Norway" gives you an idea where migrants from those countries ended up. Or if you want to know where hockey is popular:
Hockey lights up the north

It's fun to play with, but sometimes you see the limits of the model shine through. The data is somewhat old, so you can't use it well to illustrate current political events. Moreover, names of states are somewhat poor representations of the underlying entities. Washington usually does not mean the state. England makes New England light up for the US, but probably not because so many English settlers went there.

So I wonder if we can do better. What if instead of running a skip-gram algorithm over windows of words, we preprocessed the text into entities first? Then quite possibly the model would learn which entities have similar roles, rather than which words have similar roles. We might want to incorporate somehow even the roles of entities in sentences, which might allow the model to learn from a fragment like "Oil was found in Oklahoma" that oil is something that can be found and that Oklahoma is a place.

Maybe I should try SyntaxNet out for this and see what happens.

Tuesday, September 20, 2016

Project: Offline Movie Reviews

I used my first week of freedom to write a little toy-app: Offline Movie Reviews.

Airplanes don't fly faster than they did 40 years ago, nor do they provide us with more legroom. But we did make a lot of progress when it comes to personal entertainment on board. Most airlines these days will provide you with your own screen and a selection of movies to while the time away. They'll also usually insist that all their movies are just great. And while most will improve with each consumed Gin & Tonic, it still helps to pick one with a good base score. This is where the offline movie reviews app comes in.

It ships with the reviews of 15 000 of the most popular movies. Each usually has a thumbnail version of the movie poster and always the section from the wikipedia article describing the reception. This will typically contain the scores on rotten tomatoes, metacritic and/or the imdb and often comes with a quote or two from a movie critic. Enough to make a somewhat informed decision on how to spend the next two hours.

If you're not interested in how it technically works, you should download the app now, keep it on your phone for your next flight and stop reading.

Apart from the usefulness of the app, I wanted to accomplish two things: learn Swift and share with the world how at Triposo we process and massage data. When Swift came out, I liked some of the things, but was disappointed by how errors were handled and the lack of real garbage collection. Meanwhile, error handling has improved and overall I must say it is a very pleasant language to develop in (even more so if you compare it directly to Objective-C and all [its awkwardness]). The app is not very complicated - master/detail with a tiny bit of care to make sure it executes searches over the movies in smooth fashion.

The data processing builds on my wiki_import project. Wiki_import imports dumps of the Wikipedia, Wikidata and the Wikistats into a Postgres database, after which we can query things conveniently and fast. In this case we want to get our hands on all the movies from the wikipedia sorted by popularity. The wikipedia contains roughly a hundred thousand movies - including all of them would create a db of 700MB or so. We're shooting for roughly 100MB or 15 000 movies. The query to get these movies is then quite straightforward:

SELECT wikipedia.*, wikistats.viewcount 
FROM wikipedia JOIN wikistats ON wikipedia.title = wikistats.title WHERE wikipedia.infobox = 'film' 
ORDER BY wikistats.viewcount 
DESC limit 15000

For each movie, we collect a bunch of properties from the infobox using the mwparserfromhell package, an image and the critical reception of the movie. The properties have standard names, but their values can be formatted in a variety of ways, which requires some tedious stripping and normalizing - as always with wikipedia parsing. The image processing is quite straightforward. I crop and compress the image up to the pain limit to keep the size down. I switched to using Google's WebP which makes images a lot better at these high compression levels.

As you'd expect from user generated content, the critical reception section on the Wikipedia can hide under a number of headings. I might not have gotten all of them, but the great majority for sure. So we find the first of those headings and collect all the wikitext until we encounter a heading of the same or less indent. Feed that into mwparserfromhell wiki stripper and voilá: a text with the reception and only a minimum of wiki artifacts (some image attributes go awry it seems).

We then stick everything into a sqlite database with a full text search index on the title of the movie and the starring field, so we can search for both the name of a movie and who appears in it. That last bit isn't needed for when you decide which movie to watch, but I find myself often wondering, where did I see this actress before? Full text search on iOS works fast and well these days and even gives you prefix search for free.

You can find all the code on github.

Monday, September 12, 2016

Leaving Triposo

Wednesday, August 31 2016 was my last day as full time employee at Triposo, the travel guide company I started 5 years ago with my brothers and Jon Tirsen.

Triposo will continue to exist and will focus on delivering content and technology solutions for other companies. While I think that this is the best strategy for the company, it just isn't me. So Nishank Gopal will take over as CEO who has a lot more experience executing this sort of B2B strategy. I'll remain on the board and be involved as an adviser.

I'm taking some time off to think, write, code, learn and travel. With the company continuing, this isn't quite one of those Startup Post Mortems. I did want to share some thoughts on running travel companies though:

What worked and what didn't?

Triposo started out with a three pronged plan:
  • Build travel guides from targeted web crawls
  • Make the travel guides sticky by adding a travel log
  • Make money by selling tours and travel services on the go
The first prong worked rather well. We went from a few city guides that were basically mash ups of Wikipedia and Wikitravel to a travel guide that covered the world within the first year and kept improving the data quality from there on. I was especially proud when we launched the system that matched web pages automatically to our poi database and then ran opinion mining and fact extraction over those pages.

With this we could rank pois not just on one score, but on a variety of aspects - coffee, drinks, location, which in turn we could use for recommendations and personalization. On top of that we developed a nifty similarity measure for pois powering our "people that like this place, also like."

The second prong of adding a travel log, started promising. Being able to add photos and notes to entries in a travel guide and building a story that way, was fun. For us. Our users didn't use the feature very much though. They used Facebook for sharing their travel experiences. And so we were confronted with a choice: do we keep betting on two things, or do we focus on the thing that really works well, our core travel guide? We went with the last one and killed the travel log.

Sometimes I think we shouldn't have. 5 years ago, Facebook was the place to share this sort of thing, but I wonder if nowadays there would be room for a sharing platform specifically for travel. Breadtrip seems to do well in this space. But you know what they say, being too early is just as bad as being too late.

We didn't pay a lot of attention to our third prong in the first years. People spend a lot of money on travel and half of that is spent during the trip. We figured that once we had a large enough user base, they could start spending that through us. The conversion rates we got linking to web pages from our app were quite low and it seemed to us that just natifying those flows should do the trick.

It didn't. Or not enough. In our presentations we always talked about the shift from desktop to mobile and from booking before a trip to during a trip. This trend is real, but we still have a long way to go. People are happy to research a hotel on their phone, but when it is time to make a booking and enter those credit card details, they'll often quickly switch to the desktop browser, leaving your poor travel guide without its margin.

The other issue was that for tours and activities we had almost no options that had same day availability. When your model is based on telling people at the breakfast table what they should be doing that day in the city where they are, this is a problem. Again, I'm sure this will get better in the next few years, but it didn't in time for us.

What do you do when things don't work?

This is a question people in the start-up world don't talk about much. The general opinion is that when you have a start-up, you focus on that one thing that you do best. That's how you become successful, that's how Google and Facebook did it. Only when you are huge do you diversify.

That's all very well, but what if the one thing you are good at isn't enough? Initially we were doing great, our user base was growing exponentially. But that growth wasn't really viral, it was just Apple and Google sending us downloads. With the travel log shut down, we were seeing bad retention numbers. With our bookings on the go not really taking off, we didn't have a real ecommerce play either.

So what do you do? "Pivot" is a popular answer. But for every success story about pivots there are ten failures and to me it always seemed like spending the money of your investors on an idea they didn't invest in. So you start thinking about things you could add that would fix retention or fix conversion. 

City walks, mini guides, a chat room for triposo users in the location, printable posters, sponsored free wifi, audio guides, a chat bot that advises users about hotels and attractions, partly powered by a human - we built all these things and launched them. And then when the feature doesn't quite take off, you are faced with the choice of removing it and disappointing the users that enjoyed it, or have it clutter up an already complex app.

Maybe this is the right strategy. You try stuff until you hit it out of the park or run out of money. But often I think we should just have focused on building the best travel guide possible. Improve the data quality, the data coverage and the smartness. And if that's not enough, well, then there just wasn't enough a market for the original plan.

Can a travel planning app be a success?

A few month ago there was a popular blog post titled "Why you should never consider a travel planning startup." I was asked a few times about my opinion. Triposo was of course never a travel "planning" startup - we always focused on being helpful when you are on the road. But the arguments against it are very similar.

In short the article says: Getting lots of users for travel is hard, because people do it only once or twice a year. Getting people comfortable with something as complicated as a travel planning app is hard. Getting people to trust you enough to book through you rather than through an OTA they know is hard. Outbidding the site that pays you a commision for a hotel sale is hard.

This is all true and we've seen all of these things first hand at Triposo. But even though I'm writing a post about why Triposo as a consumer product hasn't taken off, I would still answer the question of whether a travel planning app can be a success with a yes.

First of all, these arguments are about all travel startups, not just the ones that do planning or help you while on the road. And yet using Kayak has become a habit. We actually succeeded in attracting a fair amount of users organically. And while we had trouble getting people to book through our app, Tripadvisor figured this out - I could read the reviews there and then go to Expedia to make my booking. And outbidding the guy who pays you a commission is the hallmark of the entire travel industry. How can Booking.com outbid the hotels themselves on Google?

We focused on being a travel guide that is helpful when you are at the destination, because people don't like to plan. It seems inevitable that there will be an app that will let you have a perfect experience on your trip without you doing more planning than necessary. An app that has all the travel information in the world and knows who you are, where you are and your mood. Unfortunately it looks like it won't be Triposo.

So what's next?

I'm taking some time off to learn, write, code, read and travel. I think that when it comes to technology things have never been as interesting as they are now, so taking a bit of time to figure out what's next seems like the best approach. I'll be doing some smaller projects around stuff I want to try out. A first small one you can find here: https://github.com/DOsinga/wiki_import - some scripts to import the wikipedia, wikidata and wikistats into postgres and make them searchable.

Triposo as a consumer product will continue and will remain "probably the best travel guide" in the app store. The engineering team will focus on data quality, coverage and smartness - in a way executing on the "focus on the one thing you're good at" strategy.  If you are interested in using the Triposo data and smartness for your own business, get in touch. There's some wonderful stuff there.

Sunday, June 5, 2016

Predictions for Euro 2016

Two years ago I coded up a small python model to simulate the world cup. The results back then were more or line with what the general predictions were; Brazil to win.

I updated the model for the Euro 2016 tournament. My data source for matches had gone, so I had to adjust that and I also introduced weights for previous games. Games that are friendly, or longer ago weigh less. The oldest matches I am taking into account are from just after the World Cup.

The results seem more different from the pundits than last time around. France is the favorite (25%), but that is because the home advantage which I set at 0.25 - historically the model has it between 0.2 and 0.3. Poland is the surprising number two with 21%. They did a decent job qualifying, had some good friendlies, so I find it hard to argue with.

Spain and England are basically tied at 11%.  Of course Englands performance could very well decide whether Brexit happens or not, so this is important.

The model does not like Germany's chances much at 8%. The results from two years ago are now weighed only at 30% because of the time gone by.

Just to put my money where my model is, I made an actual bet for Poland to win
(c) Douwe Osinga 2001-2005, douwe.webfeedback@gmail.com Goede Vertaling Nederlands Duits?