New AI Algorithm is Cracking Undeciphered Languages
Scientists have created a new algorithm to hunt down similarities in ancient languages and it’s set to untangle the mystery of all undeciphered languages.
According to a new MIT report “most languages that have ever existed are no longer spoken .” The study of lost and “ undeciphered languages ” is made exceptionally challenging as so few ancient records exist to assist common machine-translation tools and algorithms like Google Translate. Since nowhere nearly enough is understood about the grammar, vocabulary, or syntax of ancient languages, many texts remain undeciphered. Without these, an entire body of knowledge about the people who spoke them has been inaccessible - until now says the team from MIT.
Tracking the Evolution of Undeciphered Languages
The team of researchers from MIT ’s Computer Science and Artificial Intelligence Laboratory ( CSAIL) recently created a new computer system that has the ability to “automatically decipher lost languages” without needing advanced knowledge of their relation to other languages - including pauses, punctuation, and inflection. Furthermore, this new system was tested for its capability of automatically determining any relationships between language groups, and in these tests it was established that the Iberian language of Spain is not related to Basque.
One of the five plaques of the Monumento a los Fueros (Paseo de Sarasate, Pamplona). This one was written in 1905 in Basque language and in an adaption of north-eastern Iberian Script. ( CC BY SA 3.0 )
In this new project, that was partly funded by the Intelligence Advanced Research Projects Activity, ( IARPA), MIT professor Regina Barzilay explains in a new paper that the system “relies on several principles grounded in insights from historical linguistics” because languages evolve in predictable ways. Dr. Barzilay explains that languages rarely add or omit entire sounds and that certain sound substitutions are likely to occur, for example, words with the “p” sound in the parent language might evolve a “b” sound in the offspring languages, but because of the significant pronunciation gap it is less likely a “p” would become a “k”.
- 10 of the World’s Oldest Languages Still Used Today
- Artificial Intelligence Inching Closer to Deciphering Long Lost Languages
- The Schoyen Collection: 20,000 Ancient Manuscripts from 134 Countries in 120 Languages
Translating Sounds in the Vast Silence of Cyber Space
By assembling all of the known linguistic patterns, the team of scientists developed a new “decipherment algorithm” that is designed to process and interpret what the researchers describe as the “vast space of possible transformations and the scarcity of a guiding signal in the input.” The new algorithm self-learns by embedding language sounds “into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors.”
What this means is that the new system, or algorithm, enables researchers to isolate language patterns expressing change, and it uses these to form new computational constraints and restrictions, and once these are segmented into words in a lost language, similarities with related languages can be mapped. Basically, it hunts down commonalities in sounds and suggests possible links.
Apart from identifying some signs for numbers, Linear A is still an undeciphered language. (Olaf Tausch/ CC BY 3.0 )
Programing the Vampire Phonetic Mirror
Floating in a conceptual cyber-space, the new algorithm acts like a ‘vampire phonetic mirror,’ (my words) in that it reflects any sound structures that it recognizes as similar to others, yet it offers no reflection from unrelated, or unconnected, sounds, (hence the vampire). The system can also identify the proximity between any two given languages and it can accurately determine “ language families .” This is why the team applied the new test (algorithm) on the Iberian and Basque languages, “as well as less likely candidates from Romance, Germanic, Turkic and Uralic families.”
While Basque and Latin were found to be closer to Iberian than other languages, they were still far too different to be considered “related,” and the team of scholars are currently in disagreement on the actual related language, with some scholars claiming Iberian doesn’t relate “to any known language,” according to the new paper.
The MIT researchers hope their connecting ancient texts to related words in known languages, a process known as “cognate-based decipherment,” is only the first step in the creation of a super-advanced system that will ultimately be able to identify the semantic meaning of words, even if it is unknown how exactly these ancient words were originally spoken.
Top Image: Rongorongo script is an undeciphered language. Source: Arthur Chapman/ CC BY NC 2.0
By Ashley Cowie