Microsoft Research project helps languages survive — and thrive

By

Amal Shiyas

A woman named Boa Sr was the last link to a 65,000-year-old pre-Neolithic culture on the Andaman Islands in the Indian Ocean. When she died in 2010, the Bo language died, too, becoming extinct.

If that sounds like an isolated incident, it isn’t. Every two weeks, a language is lost somewhere in the world.

Take the Mundas, a community of about a million people spread across the eastern Indian states of Jharkhand, Orissa and West Bengal.

“I learnt Mundari very late in life as my parents lived in another state where they were working, so we didn’t speak the language at home,” says Dr. Meenakshi Munda, a member of the Munda community and an assistant professor in the anthropology department at a university in Ranchi, Jharkhand. “I understand how identity matters for a community and our younger generation is losing its identity because they don’t know their language.”

The Munda community is concerned about the longevity of their language as only prominent languages like Bengali, Hindi and Odiya are taught to kids in schools.

While there’s a written script for Mundari, it has negligible digital content or presence online, giving even fewer incentives for people to invest in learning the language.

A handful of researchers at the Microsoft Research (MSR) lab in India have been working toward creating digital ecosystems for languages, like Mundari, that don’t have enough presence in the digital world.

“The way I define my job for myself is that no person in this world should be excluded from using any technology because they speak a different language,” says Kalika Bali of MSR India.

Bali is an expert in Natural Language Processing, the subfield of linguistics and artificial intelligence (AI) that focuses on training computer systems to understand spoken and written languages.

Her team works with local communities and native speakers to create the base datasets that will be used to build AI technologies for low-represented languages. By involving the community in the data collection process, they hope to create a dataset that is both accurate and culturally relevant.

The internet’s language, since its earliest years, has been English. Since then, with improved access to the internet and a demand for content in native languages, seven other widely spoken languages — including Chinese and Spanish — can somewhat match English in terms of technological compatibility. But that’s only eight out of nearly 6,000 languages around the world.

This means 88 percent of the world’s languages do not have enough of a presence on the internet. It also means that a whopping 1.2 billion people — 20 percent of the world’s population — can’t use their language to navigate the digital world.

“As a result, the distinction between haves and have-nots became pretty stark,” explains Monojit Choudhury, principal data and applied scientist at Microsoft’s Turing India and Bali’s colleague. The researchers call languages that do not have resources required to build technology for a digital presence “low-resource languages.”

Under Project ELLORA— Enabling Low Resource Languages — building digital resources has a dual purpose: First, it is a step to preserving a language for posterity; and second, it ensures that users of these languages can participate and interact in the digital world.

Project ELLORA, launched in 2015, began with basics. The first step was to map out what resources were already available, such as printed material like literature and the extent of a digital presence. In a 2020 paper, Bali and her colleagues outlined a six-tier classification, with the top tier representing resource-rich languages like English and Spanish, and the bottom tiers reflecting languages with little-to-no resources.

The work of Project ELLORA is collecting the required resources for these languages and building language models to meet their speakers’ digital needs.

Project ELLORA’s researchers work with the communities to define what this need is and what base technology can help fulfill it. “No language technology can be isolated from the people who are going to use it,” says Bali.

For Mundari, the researchers collaborated with IIT Kharagpur in 2018 and sponsored a study to find what the community needs to keep the language alive.

What started off as a simple vocabulary game for school children to get them to learn the language soon morphed into sophisticated technology projects.

MSR researchers are currently working on a Hindi-to-Mundari text translation as well as a speech recognition model that will provide the community access to more content in Mundari.

A text-to-speech model, funded under the “Forward – Artificial Intelligence for all” initiative by the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) on behalf of the German Ministry for Economic Cooperation and Development, is also in the works.

But creating language translation models for a language that doesn’t have any significant digital content to train machine learning models is no easy feat.

The team, led by professors of IIT Kharagpur, initially worked with members of the community to have them manually translate sentences from Hindi to Mundari.

To speed the translation, MSR researchers developed new technology called Interneural Machine Translation (INMT), which helps predict the next word when someone is translating between languages.

“It (INMT) allows for humans to translate from one language to another more effectively. If I’m translating from Hindi to Mundari, when I start typing in Mundari, it gives me predictive suggestions in Mundari itself. It’s like the predictive text you get in smartphone keyboards, except that it does it across two languages,” Bali explains.

To build the dataset for text to speech, they collaborated with Karya, which started off as a research project by Vivek Seshadri, a principal researcher at MSR. Karya is a digital work platform for capturing, labeling and annotating data for building machine learning and AI models.

The team identified a male Mundari speaker and Dr. Munda as the female speaker, who were given the translated sentences to record. They recorded the sentences on the Karya app on Android smartphones.

The recordings, along with the corresponding text, are securely uploaded to the cloud and are accessible for researchers to train text to speech models.

“The idea is that between Microsoft Research, Karya and IIT Kharagpur, we will have data for machine translation, speech recognition and text-to-speech synthesis, so that all these three technologies can be built for Mundari,” elaborates Bali.

These connections between language and technology are basic building blocks that eventually could enable sophisticated systems like translation services on government websites or streaming platforms. These systems are already a reality for the language you are reading this article in.

The Munda community is not the only one incorporated in Project ELLORA‘s work. Other native language development efforts include:

Aiding Gondi speakers, very few of whom understand other languages, gain access to information. Project ELLORA worked with partners CGNETSwara and IIIT Naya Raipur, to build Adavasi Radio, a hub where news, videos and books can be accessed. The team produced 60,000 parallel sentences between Gondi and Hindi, which has led to the development of a machine translation service.
Working with the Idu Mishmi community in Arunachal Pradesh, in north-eastern India, to create a framework for a digital dictionary for the Idu Mishmi language, which now has less than 12,000 speakers. The digital dictionary will be used in schools to teach Idu Mishmi to children.

A group of men and women sitting cross-legged on floor cushions working on two laptops. — Members of the Idu Mishmi community collaborate with Pamir Gogoi (second from right), a research intern at MSR India, in Hunli, Arunachal Pradesh. Photo by Niyaldeep Boruah for Microsoft.

“We want to shorten the time cycle that it might otherwise take for these languages to have enough data to take advantage of the technology,” Bali says. “If AI can do all these wonderful things for speakers of English, then it should be able to do all these wonderful things for any other human being who does not speak English.”

Top photo: Dr. Meenakshi Munda records speech samples of text on Karya to help build text-to-speech models for Mundari. Photo by Sunil Bisoyi for Microsoft.

Amal Shiyas is an assistant editor at FiftyTwo.in.