Douwe Osinga's Blog: Project: Offline Movie Reviews

Tuesday, September 20, 2016

I used my first week of freedom to write a little toy-app: Offline Movie Reviews.

Airplanes don't fly faster than they did 40 years ago, nor do they provide us with more legroom. But we did make a lot of progress when it comes to personal entertainment on board. Most airlines these days will provide you with your own screen and a selection of movies to while the time away. They'll also usually insist that all their movies are just great. And while most will improve with each consumed Gin & Tonic, it still helps to pick one with a good base score. This is where the offline movie reviews app comes in.



It ships with the reviews of 15 000 of the most popular movies. Each usually has a thumbnail version of the movie poster and always the section from the wikipedia article describing the reception. This will typically contain the scores on rotten tomatoes, metacritic and/or the imdb and often comes with a quote or two from a movie critic. Enough to make a somewhat informed decision on how to spend the next two hours.

If you're not interested in how it technically works, you should download the app now, keep it on your phone for your next flight and stop reading.


Apart from the usefulness of the app, I wanted to accomplish two things: learn Swift and share with the world how at Triposo we process and massage data. When Swift came out, I liked some of the things, but was disappointed by how errors were handled and the lack of real garbage collection. Meanwhile, error handling has improved and overall I must say it is a very pleasant language to develop in (even more so if you compare it directly to Objective-C and all [its awkwardness]). The app is not very complicated - master/detail with a tiny bit of care to make sure it executes searches over the movies in smooth fashion.

The data processing builds on my wiki_import project. Wiki_import imports dumps of the Wikipedia, Wikidata and the Wikistats into a Postgres database, after which we can query things conveniently and fast. In this case we want to get our hands on all the movies from the wikipedia sorted by popularity. The wikipedia contains roughly a hundred thousand movies - including all of them would create a db of 700MB or so. We're shooting for roughly 100MB or 15 000 movies. The query to get these movies is then quite straightforward:

SELECT wikipedia.*, wikistats.viewcount 
FROM wikipedia JOIN wikistats ON wikipedia.title = wikistats.title WHERE wikipedia.infobox = 'film' 
ORDER BY wikistats.viewcount 
DESC limit 15000

For each movie, we collect a bunch of properties from the infobox using the mwparserfromhell package, an image and the critical reception of the movie. The properties have standard names, but their values can be formatted in a variety of ways, which requires some tedious stripping and normalizing - as always with wikipedia parsing. The image processing is quite straightforward. I crop and compress the image up to the pain limit to keep the size down. I switched to using Google's WebP which makes images a lot better at these high compression levels.

As you'd expect from user generated content, the critical reception section on the Wikipedia can hide under a number of headings. I might not have gotten all of them, but the great majority for sure. So we find the first of those headings and collect all the wikitext until we encounter a heading of the same or less indent. Feed that into mwparserfromhell wiki stripper and voilá: a text with the reception and only a minimum of wiki artifacts (some image attributes go awry it seems).

We then stick everything into a sqlite database with a full text search index on the title of the movie and the starring field, so we can search for both the name of a movie and who appears in it. That last bit isn't needed for when you decide which movie to watch, but I find myself often wondering, where did I see this actress before? Full text search on iOS works fast and well these days and even gives you prefix search for free.

You can find all the code on github.

0 comments: