Factlen ExplainerLanguage TechExplainerJun 20, 2026, 12:15 AM· 5 min read· #2 of 2 in culture

How Indigenous Communities Are Using AI to Save Endangered Languages

Faced with the threat of digital extinction, indigenous and minority communities are pioneering bespoke AI models to preserve their languages on their own terms.

By Factlen Editorial Team

Indigenous Data Sovereignty Advocates 40%Open-Source AI Collaborators 40%Commercial Tech Developers 20%
Indigenous Data Sovereignty Advocates
Argue that language data is a cultural asset that must be protected from corporate extraction and governed by the community.
Open-Source AI Collaborators
Believe that decentralized, open-access research is the fastest way to build technological capacity for underrepresented languages.
Commercial Tech Developers
Focus on scaling massive, generalized multilingual models to bridge the digital divide across thousands of languages simultaneously.

What's not represented

  • · Elderly native speakers who do not use digital technology

Why this matters

As the world becomes increasingly digitized, languages that cannot interface with modern technology risk vanishing entirely. By taking control of AI development, marginalized communities are ensuring their cultural heritage survives the digital age.

Key points

  • Nearly 40% of the world's 7,000 languages are currently endangered, facing the threat of 'digital extinction.'
  • Small Language Models (SLMs) allow developers to build AI tools using a fraction of the data required by massive commercial models.
  • New Zealand's Te Hiku Media built a highly accurate Māori speech recognition AI while strictly protecting their data sovereignty.
  • The Masakhane network is using open-source collaboration to build translation tools for over 40 African languages.
  • AI cannot replace human speakers, but it provides the digital infrastructure necessary for younger generations to learn.
7,000
Approximate global languages spoken
40%
Proportion of languages classified as endangered
92%
Accuracy of Te Hiku's Māori speech recognition AI
35
Sentence pairs used to train the Nüshu AI model

The digital age has long been viewed as a double-edged sword for global linguistics. While the internet connected the globe, it also enforced a strict linguistic hierarchy, heavily favoring dominant languages like English, Mandarin, and Spanish.[3]

For thousands of indigenous and minority communities, this digital dominance created a crisis known as "digital extinction." If a language cannot be typed on a smartphone keyboard, searched on Google, or understood by a voice assistant, it risks vanishing from the daily lives of younger, digitally native generations.[3]

Today, nearly 40 percent of the world’s roughly 7,000 languages are considered endangered. In places like southwestern Ethiopia, the Ongota language has virtually no digital footprint, leaving its remaining elderly speakers invisible to the modern web and their knowledge systems isolated from the digital record.[2][3]

But a profound shift is underway. Artificial intelligence, the very technology that initially threatened to accelerate linguistic homogenization through massive, English-centric Large Language Models (LLMs), is now being repurposed by indigenous communities as a powerful tool for cultural preservation.[1][7]

Only a tiny fraction of the world's languages currently possess the digital footprint required to thrive online.
Only a tiny fraction of the world's languages currently possess the digital footprint required to thrive online.

The breakthrough lies in the shift from massive, resource-heavy models to Small Language Models (SLMs) and targeted machine learning frameworks. Unlike commercial chatbots, which require billions of parameters and vast oceans of scraped internet text, these new tools are designed to operate efficiently on minimal data.[2]

This is crucial for "low-resource languages"—tongues that lack the massive digitized archives of books, articles, and websites required to train traditional AI. By utilizing techniques like transfer learning and targeted automated speech recognition, computer scientists can now build functional AI tools with a fraction of the data.[4][5]

At Dartmouth College, researchers recently demonstrated this potential by building an AI framework called NüshuRescue. Nüshu is a centuries-old script created by Yao women in China's Hunan province to communicate in secret, which faced near-total erasure in the modern era.[4]

Using just 35 pairs of matching sentences in Chinese and Nüshu, the researchers trained a model to accurately translate and expand the digital database of the rare script. The team is now exploring how this low-data framework can be applied to other endangered languages, such as Cherokee.[4]

Researchers have successfully trained AI models to translate rare scripts using as few as 35 sentence pairs.
Researchers have successfully trained AI models to translate rare scripts using as few as 35 sentence pairs.

Beyond academic labs, the most impactful AI revitalization efforts are being driven directly by the communities themselves. In New Zealand, the Māori-owned nonprofit broadcaster Te Hiku Media has become a global pioneer in indigenous artificial intelligence.[1][5]

Beyond academic labs, the most impactful AI revitalization efforts are being driven directly by the communities themselves.

In 2018, Te Hiku launched a grassroots campaign, asking Māori speakers across the country to record themselves reading text. They gathered over 300 hours of annotated audio, which they used to train a bespoke automatic speech recognition system tailored specifically to the nuances of te reo Māori.[5]

The results were staggering. Te Hiku’s model achieved a 92 percent accuracy rate in transcribing the language, outperforming general-purpose models built by massive tech conglomerates. The tool now powers real-time transcription and pronunciation feedback for language learners.[1][5]

But Te Hiku’s most significant contribution to the field isn't just technical; it is ethical. The organization strictly adheres to the principle of indigenous data sovereignty, arguing that language data is a deeply sacred cultural asset, akin to physical land.[6][7]

When large technology companies offered to buy their dataset to improve commercial voice assistants, Te Hiku refused. Instead, they established specialized data licenses ensuring that any AI built with Māori voices directly benefits the Māori people, preventing corporate extraction of their cultural heritage.[2][6]

Community-led AI models often outperform general commercial systems by training specifically on local dialects and nuances.
Community-led AI models often outperform general commercial systems by training specifically on local dialects and nuances.

Across the globe, a different but equally powerful model of community-led AI is taking root in Africa. Masakhane, a grassroots organization of researchers spanning over 30 African countries, is working to build natural language processing tools for the continent's 2,000 languages.[1][6]

Unlike Te Hiku’s strictly guarded data approach, Masakhane embraces an open-source, collaborative ethos. Their philosophy is "by Africans, for Africans," and their decentralized network has successfully developed machine translation models for over 40 African languages that were previously ignored by Silicon Valley.[6]

These tools are not just academic exercises; they have real-world stakes. In regions where critical information regarding healthcare, agriculture, and government services is only broadcast in colonial languages, AI translation can be a matter of survival and economic inclusion.[7]

The success of these grassroots initiatives has forced the broader tech industry to take notice. Major players are now launching grants and partnerships, such as the LINGUA Africa initiative, to fund community-led AI projects rather than simply scraping data without permission.[7]

Small Language Models require vastly less data and computing power, making them ideal for grassroots preservation efforts.
Small Language Models require vastly less data and computing power, making them ideal for grassroots preservation efforts.

Ultimately, linguists and activists agree that artificial intelligence cannot save a language on its own. A language only survives if it is spoken at home, taught in schools, and used in daily life to form meaningful human connections.[3]

What AI provides is critical digital infrastructure. By automating the painstaking work of transcription, building interactive learning apps, and ensuring that indigenous languages can interface with modern software, these tools remove the technological barriers to fluency.[5]

For the first time in the digital age, technology is not just accelerating the loss of ancient voices. In the hands of the communities who own them, it is helping those voices speak to the future.[7]

How we got here

  1. 2018

    Te Hiku Media launches a crowdsourcing campaign to collect annotated Māori audio data.

  2. 2020

    The Masakhane community forms to advance natural language processing for African languages.

  3. 2022

    The UN launches the International Decade of Indigenous Languages to highlight the extinction crisis.

  4. 2024

    Te Hiku's CEO is recognized globally for pioneering indigenous data sovereignty in AI.

  5. 2025

    Dartmouth researchers successfully train an AI model on the rare Nüshu script using just 35 sentence pairs.

Viewpoints in depth

The Data Sovereignty View

Protecting language data from corporate extraction.

Advocates for indigenous data sovereignty, such as New Zealand's Te Hiku Media, view language as a deeply sacred cultural asset, akin to physical land. They argue that historical patterns of colonization are repeating themselves in the digital realm, where massive tech companies scrape indigenous data to train commercial models without compensating the communities. For this camp, the priority is strict community governance: ensuring that the people who own the language control how it is digitized, who has access to it, and who profits from the resulting AI tools.

The Open-Source Collaborative View

Accelerating progress through shared, decentralized research.

Grassroots networks like Africa's Masakhane champion an open-source philosophy, arguing that the sheer scale of the language preservation crisis requires radical collaboration. Because low-resource languages lack the financial incentives to attract Silicon Valley investment, this camp believes that progress depends on researchers freely sharing datasets, code, and breakthroughs across borders. By decentralizing AI development, they aim to rapidly build technological capacity for dozens of languages simultaneously, prioritizing broad access and utility over strict data siloing.

What we don't know

  • Whether small, community-led AI models can secure the long-term funding needed to maintain their digital infrastructure.
  • How international copyright law will ultimately handle the scraping of indigenous language data by commercial AI companies.

Key terms

Small Language Model (SLM)
An AI model designed to perform specific tasks efficiently using far less data and computing power than massive models like ChatGPT.
Low-Resource Language
A language that lacks the large volumes of digitized text and audio required to train traditional artificial intelligence systems.
Data Sovereignty
The principle that indigenous communities have the right to own, control, and govern their own cultural and linguistic data.
Digital Extinction
The process by which a language falls out of use because it cannot be utilized on modern digital devices or the internet.
Natural Language Processing (NLP)
A branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language.

Frequently asked

Can artificial intelligence actually save an endangered language?

AI cannot save a language on its own, as languages only survive if they are actively spoken in daily life. However, AI provides critical digital infrastructure—like automated transcription and translation apps—that makes it much easier for younger generations to learn and use the language.

What makes a language 'low-resource' in AI?

A low-resource language is one that lacks large amounts of digitized text and audio data. Because traditional AI models require massive datasets scraped from the internet to function, languages without a heavy digital footprint are often left behind.

Why don't tech companies just buy the data from these communities?

Many indigenous communities refuse to sell their language data to commercial tech companies, citing data sovereignty. They argue that language is a cultural asset that should not be monetized by outside corporations, and that the communities themselves should own the resulting AI tools.

Sources

Source coverage

7 outlets

3 viewpoints surfaced

Indigenous Data Sovereignty Advocates 40%Open-Source AI Collaborators 40%Commercial Tech Developers 20%
  1. [1]World Economic ForumIndigenous Data Sovereignty Advocates

    How AI is helping preserve Indigenous languages

    Read on World Economic Forum
  2. [2]Brookings InstitutionIndigenous Data Sovereignty Advocates

    Indigenous language models: Small models, big impact

    Read on Brookings Institution
  3. [3]Tech Policy PressCommercial Tech Developers

    How Multilingual AI Can Protect Language and Improve Global Technology

    Read on Tech Policy Press
  4. [4]Dartmouth College

    Computer scientists and linguists build AI tech to strengthen endangered languages

    Read on Dartmouth College
  5. [5]IGI GlobalOpen-Source AI Collaborators

    Case Studies of AI-Driven Language Revitalization

    Read on IGI Global
  6. [6]arXivOpen-Source AI Collaborators

    Generative AI and Large Language Models for Language Preservation

    Read on arXiv
  7. [7]Factlen Editorial Team

    Synthesis by Factlen editorial team

    Read on Factlen Editorial Team
Stay informed

Every angle. Every day.

Get culture stories with full source coverage and perspective breakdowns delivered to your inbox.