Douwe Osinga's Blog: Calculating the set of universal numbers

Tuesday, November 22, 2016

Calculating the set of universal numbers

Frederick II allegedly tried to find out what the universal language of humanity was, by depriving a bunch of young children from any language input. He expected them to start speaking Latin, Greek or maybe whatever Adam and Eve spoke in paradise. Instead they went insane.

Esperanto and a set of even less successful competitors tried to construct a universal language (the rumor that Klingon has more speakers than Esperanto is not true, but the fact that people believe it tells you something). Linguists have gone the other way. Languages evolve and seem to have common ancestors. By studying these evolutions, you can come up with theories as to what that common ancestor might have been. If you are very daring you can take this all the way to a supposed Proto-Human Language.

Trying to follow language evolution this far back is tricky and the approach has been widely criticized. We can only observe language evolution for the last 2000 years or so - applying rules learned from that on the 198 thousand years before is extrapolating by a large margin. For example, most languages have become simpler in the last 2000 years, but how did they become complex in the first place?

Algorithmically, there is a much simpler way to determine the most common language. Given a reasonable edit distance between two words, find for each word the median translation over all languages. The median translation is the translation that has the lowest squared edit distance to all the others.
I've done just that for the number one to ten using the phrasebooks from wikivoyage:

(Expanding this approach beyond the numbers one to ten might be doable, but is harder - words don't just change pronunciation, but also meaning. The Dutch word "tuin", the English "town" and the German "Zaun" all have the same root, but mean respectively garden, town and fence)

It is somewhat remarkable that this approach works. Wikivoyage uses a rather unscientific phonetic spelling based on how English speakers would pronounce a word. The edit distance I'm using is Levenshtein with some SoundEx thrown in - both approaches pre-date microprocessors. The languages Wikivoyage cover are whatever their volunteers found interesting enough to add and of those I can only use the ones that happen to parse. But it does look reasonable to me.

Is there a way to support this intuition? Why, yes there is! By aggregating the distances between the numbers on the language level, a language distance matrix can be calculated. This in turn we can use to calculate a language tree. Here it is:

There are some weird bits in the tree, but by and large you see the major language groupings appear as we know them from linguistics. The Slavic and Germanic groups look quite convincing as does the Latin group although the insertion there of Welsh and Irish seems debatable. Malagasy and Hawaiian get their own minigroup, which is quite interesting.

I think this is a promising approach. Using IPA for pronunciation, getting a more representative set of languages in (and maybe weigh them by number of speakers) and using a distance measure based on linguistic theories could all improve performance quite dramatically. If you want to play with the code so far, have a look at