Tuesday, October 9, 2012
This week we got the Triposo team together in Sitges, a nice beach town near Barcelona, for some thinking, discussions and general strategizing but also a hackathon in our finest of traditions.
I decided to look at language similarities. I have always been interested in the evolution of languages. It seems though that discussions about the similarity of languages are always a bit arbitrary. You need to compare lists of words, but how do you pick them? If you take the word 'town' in English you'd translate that into German to 'Stadt' and in Dutch it would be 'plaats'. Those words aren't very similar at all. However German also has the word 'Zaun' which sounds very similar to 'town' and means 'fence'. In Dutch there's a word 'tuin' which means 'garden'.
I wanted to take out this arbitrary part and do the comparisons fully automatic and occurred to me that if you only take the cardinal numbers (one to nine for example) into account, you'd take this arbitrariness away. I wouldn't expect those words to change their meaning easily.
We've had phrasebooks in the Triposo apps for a long while now, based on the content from Wikitravel. So I went ahead and wrote a script to extract from those Wikitravel pages for each language the phonetic version of the words one to nine. I then calculated the similarity between all language pairs by calculating some sort of edit distance between the corresponding pairs of words.
Traditionally the edit distance between two words is the minium amount of edits (i.e. deletions and insertions) to change one word into another. So 'town' and 'Zaun' have an edit distance of 6 (delete the 't', 'o' and the 'w', then insert a 'z', 'a' and a 'u' and from that perspective they're not very similar. You can do better by assigning a likelyhood of a specific transitions. The 't' and the german 'z' are a bit similar. Vowel changes are also quite likely to happen etc, etc.
Based on these pairs I then calculated a tree of languages. We start by creating for each language a language group consisting out of only that language. We then merge language groups that have small distances to each other. Subgroups that match well together and slightly less well with the other languages remain subgroups and so a tree is built.
The result is below. As you can see, there are clear groups of the Germanic, Slavic and Roman language groups. They fold together with some other languages into a Indo-European group. There's some other smaller groups that jump out (Turkic, Philippines, Arab) but most really are islands. Finnish and Estonian match up quite nicely, too. I left out some of the languages that the model turns into singletons.
It works surprisingly well given that the data is rather noisy and that is based on phonetic spelling rendered in English which just isn't great.