How AI and Digital Tools Are Racing to Save the World's Endangered Languages
As nearly 40% of the world's languages face extinction, indigenous communities and technologists are using artificial intelligence to rapidly archive, translate, and revitalize endangered tongues.
By Factlen Editorial Team
- Indigenous Data Sovereignty Advocates
- Argue that communities must own and control their linguistic data to prevent exploitation by big tech.
- Open-Source Technologists
- Focus on building massive, accessible AI models to break down language barriers globally.
- Linguistic Anthropologists
- Emphasize that while tech is a useful tool, saving a language requires addressing social and political marginalization.
What's not represented
- · Government Policymakers
- · Elder Native Speakers
Why this matters
Language is the vessel for a community's history, ecological knowledge, and cultural identity. As AI breaks down the technical barriers to preserving low-resource languages, it offers a crucial lifeline to marginalized communities fighting to keep their heritage alive in the digital age.
Key points
- Nearly 40% of the world's 6,700 languages are currently at risk of extinction.
- Tech companies are developing AI models capable of translating hundreds of 'low-resource' languages without relying on massive datasets.
- Indigenous groups are pioneering 'data sovereignty' frameworks to ensure they retain ownership of their linguistic data.
- Experts warn that while AI is a powerful tool, true language preservation requires addressing the political and social marginalization of native speakers.
The world is currently speaking roughly 6,700 distinct languages, but linguists are sounding a quiet, urgent alarm: nearly 40% of them are at risk of extinction by the end of the century. [1] When a language fades, it takes with it centuries of ecological knowledge, unique cultural worldviews, and ancestral history. [8][1][8]
For decades, the work of language preservation was a painstaking, analog process of recording elders on cassette tapes and compiling physical dictionaries. Today, a new wave of technologists and indigenous communities are turning to artificial intelligence to radically accelerate that timeline. [2][2]
Generative AI and machine learning models are now capable of transcribing, translating, and teaching languages that were previously entirely excluded from the digital realm. [2] However, the intersection of Silicon Valley technology and indigenous heritage has sparked a complex debate about who owns linguistic data and whether code can truly save a culture. [1, 7][1][2][7]
The fundamental technical hurdle for endangered languages has always been data scarcity. [5] Large language models like those powering ChatGPT or standard Google Translate require billions of words of text to learn a language's grammar and vocabulary. [1][1][5]

Languages that are primarily oral, or those spoken by marginalized communities, simply do not have that volume of digitized text. [1] In the AI industry, these are known as "low-resource languages," and until recently, they were effectively invisible to modern algorithms. [2][1][2]
That bottleneck is beginning to break. Meta's "No Language Left Behind" (NLLB) project recently released an open-source AI model capable of direct translations across 200 languages, including many low-resource tongues like Luganda and Māori. [2][2]
To achieve this without massive datasets, researchers utilized a "sparse mixture-of-experts" architecture, which routes translation tasks through specialized sub-networks to prevent the AI from overfitting on limited data. [2] Crucially, NLLB removes the need to translate a language into English first before translating it into a third language, preserving local idioms and cultural nuances. [2][2]
But while big tech companies are building massive, universal models, indigenous communities are increasingly demanding control over how their languages are digitized. [1] This movement, known as Indigenous Data Sovereignty, argues that communities—not for-profit corporations—must own and benefit from their linguistic heritage. [1, 8][1][8]
A pioneering example is unfolding in New Zealand. Te Hiku Media, a Māori broadcasting organization, refused to hand over its decades of archival audio to global tech giants. [3] Instead, they built their own automatic speech recognition (ASR) model using the open-source NVIDIA NeMo toolkit. [3][3]

Te Hiku Media, a Māori broadcasting organization, refused to hand over its decades of archival audio to global tech giants.
By crowdsourcing voice recordings from the community under a strict "Kaitiakitanga" (guardianship) license, Te Hiku developed an AI that transcribes spoken Te Reo Māori with 92% accuracy. [3] The license legally ensures that the data can only be used in ways that benefit the Māori people, preventing commercial exploitation. [3][3]
The integration of AI into daily life is also proving vital for language visibility. In New Zealand, the media organization Stuff partnered with Microsoft and Straker Translations to build a custom AI tool that rapidly translates entire news articles into Te Reo Māori. [4][4]
Because Te Reo Māori is highly contextual—where a single word can carry multiple meanings depending on the speaker's intent—the AI does not publish directly. [4] Instead, it generates a high-speed draft that is then reviewed and refined by a human kaiwhakamāori (translator). [4] This human-in-the-loop system allows the newsroom to scale its bilingual output dramatically without sacrificing cultural accuracy. [4][4]
Beyond translation, AI is being used to resurrect scripts that are already on the brink of vanishing. At Dartmouth College, researchers developed "NüshuRescue," an AI framework dedicated to preserving Nüshu, a centuries-old script created by women in China's Hunan province to communicate in secret. [5][5]
Using just 35 pairs of matching sentences, the researchers trained a large language model to understand and translate the rare script. [5] The project demonstrated that modern AI can rapidly produce valuable linguistic resources even from microscopic amounts of training data. [5][5]

In Australia, digital tools are stepping in where human speakers have almost entirely disappeared. The Ngalia language, native to the Western Desert, currently has only three known fluent speakers remaining. [6][6]
To prevent the language from dying with them, developers launched the Mamutjitji Story app, an interactive educational tool that uses a local Dreamtime story about an antlion to teach Ngalia vocabulary and science concepts to children. [6][6]
Despite these technological triumphs, linguists and anthropologists caution against viewing AI as a panacea. [7] Critics argue that artificial intelligence cannot solve the material, social, and political problems that drive language endangerment in the first place. [7][7]
As anthropologist Gerald Roche points out, languages do not simply "die" of natural causes; they are often actively suppressed by state policies that defund minority schools, mandate dominant languages in workplaces, and marginalize indigenous populations. [7][7]

An AI app cannot build a bilingual school, pass protective legislation, or erase the systemic discrimination that forces a community to abandon its mother tongue for economic survival. [7, 8][7][8]
Ultimately, technology is a powerful amplifier, but it is not a substitute for human agency. [8] AI can archive the past and build bridges for the future, but the true survival of the world's endangered languages will depend on the communities fighting to speak them, and the political will to let those voices be heard. [1, 7, 8][1][7][8]
How we got here
2022
The United Nations declares the start of the International Decade of Indigenous Languages.
Late 2022
Meta launches the 'No Language Left Behind' project to translate 200 languages.
Jan 2024
Te Hiku Media achieves 92% accuracy with its community-owned Te Reo Māori speech recognition model.
May 2024
The Mamutjitji Story app launches to preserve the Ngalia language, which has only three remaining speakers.
Viewpoints in depth
Indigenous Data Sovereignty Advocates
Argue that communities must own and control their linguistic data.
This camp, heavily represented by indigenous media organizations and legal scholars, views the rush to digitize endangered languages with cautious optimism. They argue that historical patterns of extraction—where outside researchers take cultural artifacts for their own benefit—are repeating themselves in the AI era. By utilizing specialized licenses like Kaitiakitanga, they aim to ensure that the data used to train AI models remains the property of the community, preventing tech giants from monetizing their ancestral tongues without permission or compensation.
Open-Source Technologists
Focus on building massive, accessible AI models to break down language barriers.
Researchers at major tech firms and universities emphasize the sheer scale of the language extinction crisis, arguing that only automated, open-source AI can work fast enough to archive and translate these languages before they disappear. They point to breakthroughs like sparse mixture-of-experts architectures, which allow models to learn from tiny datasets. For this camp, making these powerful models freely available to developers worldwide is the most effective way to democratize language preservation and integrate marginalized voices into the global internet.
Linguistic Anthropologists
Emphasize that saving a language requires addressing social and political marginalization.
Scholars studying the root causes of language death argue that AI, while technologically impressive, treats the symptom rather than the disease. They point out that languages do not die naturally; they are often actively suppressed by state policies, economic pressures, and systemic racism that force communities to adopt dominant languages. From this perspective, an AI translation app is useless if a community is legally barred from speaking its language in schools or workplaces. They advocate for pairing technological tools with fierce political advocacy and material funding for indigenous communities.
What we don't know
- It remains unclear how effectively AI-generated language tools will translate into actual fluency for younger generations.
- The long-term legal enforceability of indigenous data sovereignty licenses against global tech scraping operations is still being tested.
Key terms
- Low-resource language
- A language that lacks large amounts of digitized text or audio data, making it difficult to train standard AI models.
- Data sovereignty
- The principle that indigenous communities should own, control, and benefit from their own cultural and linguistic data.
- Automatic Speech Recognition (ASR)
- Technology that converts spoken language into written text.
- Sparse mixture-of-experts
- An AI architecture that routes tasks through specialized sub-networks, allowing the model to learn efficiently from limited data.
- Kaitiakitanga
- A Māori concept of guardianship and protection, used as a licensing framework to protect indigenous data from commercial exploitation.
Frequently asked
Can AI actually make someone fluent in an endangered language?
AI cannot replace human immersion, but it provides crucial tools like real-time translation, interactive apps, and vast digital archives that make learning and practicing the language much easier for new generations.
Why don't standard translation apps support these languages?
Standard apps rely on massive datasets of digitized text to train their algorithms. Endangered languages often lack this digital footprint, making them 'low-resource' and invisible to traditional AI.
What is data sovereignty in the context of AI?
It is the legal and ethical framework ensuring that the communities who provide the language data retain ownership of it, preventing tech companies from exploiting their cultural heritage for profit.
Sources
[1]Brookings InstitutionIndigenous Data Sovereignty Advocates
Adapting small language models for Indigenous languages
Read on Brookings Institution →[2]Digital DigestOpen-Source Technologists
No Language Left Behind: Meta's AI-driven quest to preserve linguistic diversity
Read on Digital Digest →[3]NVIDIAIndigenous Data Sovereignty Advocates
Te Hiku Media Uses Trustworthy AI to Preserve Māori Language
Read on NVIDIA →[4]MicrosoftOpen-Source Technologists
How generative AI is transforming te reo Māori translation
Read on Microsoft →[5]Dartmouth CollegeOpen-Source Technologists
Computer scientists and linguists build AI tech to strengthen endangered languages
Read on Dartmouth College →[6]Green Network AsiaOpen-Source Technologists
Mamutjitji Story: Preserving the Ngalia Language with an Educational App
Read on Green Network Asia →[7]Global VoicesLinguistic Anthropologists
Artificial intelligence cannot solve the material, social and political problems that are driving global language endangerment
Read on Global Voices →[8]Bowdoin CollegeIndigenous Data Sovereignty Advocates
AI in Revitalizing and Teaching Endangered Languages
Read on Bowdoin College →
Every angle. Every day.
Get culture stories with full source coverage and perspective breakdowns delivered to your inbox.











