How AI is Being Used to Save the World's Endangered Languages
Indigenous communities and computer scientists are pioneering new artificial intelligence techniques to preserve and revitalize languages on the brink of extinction.
By Factlen Editorial Team
- Indigenous Data Sovereignty Advocates
- Argues that language data belongs to the community and must be protected from commercial exploitation by tech giants.
- Computational Linguists
- Focuses on the technical mechanisms of training models on extremely small datasets using phoneme transfer and modular architectures.
- Commercial AI Developers
- Aims to build massive, universal models that can scale to support thousands of languages simultaneously for global digital inclusion.
What's not represented
- · Elder native speakers who do not use digital technology
Why this matters
Language is the vessel for human history, culture, and identity. As AI reshapes global communication, ensuring it can understand and teach endangered languages prevents the digital erasure of minority cultures and offers a blueprint for how technology can serve, rather than exploit, vulnerable communities.
Key points
- Nearly 40% of the world's 7,000 languages are at risk of extinction due to a lack of digital presence.
- AI models traditionally require massive datasets, leaving 'low-resource' languages behind.
- New AI techniques, like phoneme transfer and zero-shot translation, allow models to learn from minimal data.
- Indigenous communities are building their own AI tools to preserve oral histories and teach pronunciation.
- Data sovereignty licenses are being used to prevent tech companies from exploiting cultural heritage.
The world is currently experiencing a quiet extinction crisis. Nearly half of the globe's 7,000 spoken languages are at risk of vanishing within the next century, taking with them irreplaceable cultural knowledge, oral histories, and unique ways of understanding the human experience.
Historically, the digital age accelerated this decline. The architecture of the internet was built on a handful of dominant languages, forcing minority language speakers to assimilate to English, Mandarin, or Spanish in order to participate in the global economy and access digital services.
But a surprising reversal is now underway. Artificial intelligence, a technology long criticized for homogenizing global culture, is being re-engineered by indigenous communities and computational linguists to preserve and revitalize the world's most vulnerable languages.
The core technical hurdle in this endeavor is what researchers call the "low-resource" problem. Modern AI systems, such as the large language models powering popular chatbots, are notoriously data-hungry, requiring billions of words of text to accurately understand grammar, syntax, and context.[6]

Endangered languages simply do not possess this massive digital footprint. Many are primarily oral traditions, lacking standardized orthography, parallel translated texts, or extensive internet archives to scrape for training data.[6]
To bridge this gap, computer scientists are developing decoupled architectures and "phoneme transfer" techniques. For example, the WARDEN AI system was recently built to transcribe Wardaman, an endangered Australian indigenous language, using a mere six hours of annotated audio.[2]
By separating the transcription and translation processes, and leveraging the acoustic similarities between related regional languages, these modular AI systems can learn to recognize complex sound patterns with a fraction of the data previously required by unified models.[2]
One of the most successful community-led initiatives is happening in New Zealand. Te Hiku Media, a Māori broadcasting organization, embarked on a mission to digitize decades of archival recordings from native speakers born in the late 19th century.[1]

One of the most successful community-led initiatives is happening in New Zealand.
Realizing that manual transcription of these archives would take lifetimes, Chief Technology Officer Keoni Mahelona and CEO Peter-Lucas Jones built bespoke natural language processing tools specifically tailored for Te Reo Māori.[1][5]
Using 300 hours of annotated audio, they trained an automatic speech recognition model that outperforms commercial alternatives. They also launched Rongo, an application that helps users practice their pronunciation by providing real-time AI feedback to restore the authentic native sound of the language.[1]
Similar breakthroughs are happening with "no-resource" languages. Researchers at Loyola Marymount University recently developed a system to translate Owens Valley Paiute. Instead of feeding the model massive datasets, they explicitly taught the AI the language's grammar and vocabulary rules first, mimicking how a human linguist would learn.[4]
Big tech companies are also entering the space, driven by the need to expand their global reach. Google's "1,000 Languages Initiative" aims to build a Universal Speech Model (USM) capable of supporting the world's most spoken languages, including marginalized ones.[3]

Trained on 12 million hours of speech and 28 billion sentences across 300 languages, the USM utilizes "zero-shot" machine translation, where the model learns to translate a language without ever seeing a direct, human-translated example of it.[3]
However, the intersection of AI and indigenous heritage is fraught with ethical tension. When data is extracted without explicit consent, communities risk having their cultural assets appropriated and commodified by Silicon Valley.[4]
This fear is not hypothetical. In late 2024, a series of language-learning books sold online were discovered to be entirely AI-generated, containing fabricated and erroneous translations for languages like Mi'kmaq, Mohawk, and the extinct Siberian language Omok.[4]

To protect against exploitation, communities are pioneering new frameworks for data sovereignty. Te Hiku Media operates under a Kaitiakitanga (guardianship) license, which ensures that the Māori people retain ownership of their data and strictly prohibits its use in commercial surveillance or human rights violations.[1]
How we got here
2014
Te Hiku Media begins digitizing 19th-century Māori archival recordings.
Nov 2022
Google announces its 1,000 Languages Initiative to build a Universal Speech Model.
Jul 2023
Te Hiku Media launches bespoke AI speech recognition tools for Te Reo Māori.
Dec 2024
AI-generated books with fabricated translations of extinct indigenous languages are discovered online.
Early 2026
Researchers successfully deploy modular AI systems to transcribe languages with as little as six hours of audio.
Viewpoints in depth
Indigenous Data Sovereignty Advocates
Argues that language data belongs to the community and must be protected from commercial exploitation.
Advocates in this camp emphasize that language is not merely a communication tool, but a sacred vessel of cultural identity. They argue that when tech companies scrape indigenous data to train commercial models, it constitutes a modern form of digital colonialism. To combat this, groups like Te Hiku Media have pioneered legal frameworks like the Kaitiakitanga license, ensuring that the community retains absolute ownership over their linguistic data and dictates exactly how and where it can be used.
Computational Linguists
Focuses on the technical mechanisms of training models on extremely small datasets.
For researchers in this field, the challenge is fundamentally mathematical: how to teach a machine a language when the data simply does not exist. They advocate for moving away from the massive, data-hungry architectures favored by big tech, and instead focus on modular systems. By using techniques like phoneme transfer and explicitly programming grammatical rules into the model, they argue that AI can be effectively deployed for languages with only a few hours of recorded audio.
Commercial AI Developers
Aims to build massive, universal models that can scale to support thousands of languages simultaneously.
Commercial developers argue that the most efficient way to prevent digital language extinction is through scale. Initiatives like Google's Universal Speech Model are built on the premise that languages are interconnected. By training a single, massive model on hundreds of languages simultaneously, they believe the AI can use the underlying structure of high-resource languages to infer the rules of low-resource ones, ultimately bringing digital inclusion to billions of marginalized speakers.
What we don't know
- Whether AI-generated language tools can capture the deep cultural context and idioms that human elders provide.
- How intellectual property laws will adapt to protect indigenous data sovereignty on a global scale.
- If the rapid advancement of commercial AI models will eventually override community-led preservation efforts.
Key terms
- Low-resource language
- A language with limited digital text or audio data available for training computational models.
- Phoneme transfer
- An AI technique that leverages the acoustic similarities between related languages to transcribe audio with very little training data.
- Zero-shot translation
- A machine learning capability where a model translates between two languages without having seen direct translations between them during training.
- Data sovereignty
- The principle that indigenous communities should control the collection, ownership, and application of their own cultural and linguistic data.
Frequently asked
Can AI perfectly translate endangered languages?
Not yet. Because these languages lack massive written datasets, AI models still struggle with cultural nuance and complex grammar, though new techniques are rapidly improving accuracy.
What is a 'low-resource' language?
In computer science, it refers to a language that lacks the large digital datasets (like millions of translated web pages or transcribed audio hours) typically required to train AI models.
How are communities protecting their language data?
Many are using specialized data licenses, such as the Kaitiakitanga license in New Zealand, which ensures the community retains ownership and prevents tech companies from commercializing their heritage.
What is zero-shot translation?
It is a machine learning technique where an AI model learns to translate a language without ever being trained on direct, human-translated examples of that specific language pair.
Sources
[1]ITUIndigenous Data Sovereignty Advocates
How AI is helping revitalise indigenous languages
Read on ITU →[2]StartupHub.aiComputational Linguists
WARDEN: Tackling Low-Resource Language AI
Read on StartupHub.ai →[3]Silicon RepublicCommercial AI Developers
Google marks new milestone in its 1,000 languages AI initiative
Read on Silicon Republic →[4]Viterbi Conversations in EthicsIndigenous Data Sovereignty Advocates
Preserving the Past: AI in Indigenous Language Preservation
Read on Viterbi Conversations in Ethics →[5]Hybrid VigorIndigenous Data Sovereignty Advocates
Robotic Revival: Can AI be used to revitalize endangered languages?
Read on Hybrid Vigor →[6]International Journal of ResearchComputational Linguists
Artificial Intelligence Translation Approaches for Endangered Language Preservation and Revitalization
Read on International Journal of Research →[7]Factlen Editorial TeamComputational Linguists
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →
Every angle. Every day.
Get culture stories with full source coverage and perspective breakdowns delivered to your inbox.










