How are languages deciphered

MIT researchers have discovered a novel technique for automatically deciphering lost-languages using machine-learning and neural networks.

In their paper, titled: Neural Decipherment via Minimum-Cost Flow: from Ugaritic to Linear B, Jiaming Luo, Yuan Cao, and Regina Barzilay use the new method to decipher two languages: Ugaritic, which dates from the 12th - 14th century BCE, and was discovered in Syria in 1929, and Linear B, the oldest preserved form of written Greek.

As linguist Andrew Robinson wrote in his 2002 book Lost languages: the enigma of the world’s undeciphered scripts, a typical decipherment spans over decades and requires encyclopedic domain knowledge, prohibitive manual effort and sheer luck.

Traditional techniques used to decipher one language are almost never applicable to another, meaning that with new discoveries, decades of work often need to be completed again. This new method has the potential to change that.

The researchers note, “When applied to the decipherment of Ugaritic, we achieve a 5.5% absolute improvement over state-of-the-art results. We also report the first automatic results in deciphering Linear B, a syllabic language related to ancient Greek, where our model correctly translates 67.3% of cognates.”

Cognates are words which share the same etymological origin, and the novel method involves using what is known about how words change over time to predict meaning, and mapping the older texts against what they evolved into. 

Typically machine-learning models need huge datasets to achieve this, this new approach changes that.

The translation is achieved by using a neural sequence-to-sequence model to make comparisons with other known languages, essentially mapping them against each other to capture vocabulary level structural sparsity.

Simplified, it uses what is known about how words are used in context, and in relationship with others, to then find those relationships in the undeciphered language.

Beyond Ugaritic and Linear B, it has also demonstrated significant improvement over existing work on Romance languages, according to the paper’s authors.