Esperanto and a set of even less successful competitors tried to construct a universal language (the rumor that Klingon has more speakers than Esperanto is not true, but the fact that people believe it tells you something). Linguists have gone the other way. Languages evolve and seem to have common ancestors. By studying these evolutions, you can come up with theories as to what that common ancestor might have been. If you are very daring you can take this all the way to a supposed Proto-Human Language.
Trying to follow language evolution this far back is tricky and the approach has been widely criticized. We can only observe language evolution for the last 2000 years or so - applying rules learned from that on the 198 thousand years before is extrapolating by a large margin. For example, most languages have become simpler in the last 2000 years, but how did they become complex in the first place?
I've done just that for the number one to ten using the phrasebooks from wikivoyage:
(Expanding this approach beyond the numbers one to ten might be doable, but is harder - words don't just change pronunciation, but also meaning. The Dutch word "tuin", the English "town" and the German "Zaun" all have the same root, but mean respectively garden, town and fence)
It is somewhat remarkable that this approach works. Wikivoyage uses a rather unscientific phonetic spelling based on how English speakers would pronounce a word. The edit distance I'm using
is Levenshtein with some SoundEx thrown in - both approaches pre-date microprocessors. The languages Wikivoyage cover are whatever their volunteers found interesting enough to add and of those I can only use the ones that happen to parse. But it does look reasonable to me.
Is there a way to support this intuition? Why, yes there is! By aggregating the distances between the numbers on the language level, a language distance matrix can be calculated. This in turn we can use to calculate a language tree. Here it is:
There are some weird bits in the tree, but by and large you see the major language groupings appear as we know them from linguistics. The Slavic and Germanic groups look quite convincing as does the Latin group although the insertion there of Welsh and Irish seems debatable. Malagasy and Hawaiian get their own minigroup, which is quite interesting.
I think this is a promising approach. Using IPA for pronunciation, getting a more representative set of languages in (and maybe weigh them by number of speakers) and using a distance measure based on linguistic theories could all improve performance quite dramatically. If you want to play with the code so far, have a look at https://github.com/DOsinga/universal_numbers