Factlen ExplainerLanguage TechExplainerJun 20, 2026, 1:48 PM· 6 min read· #3 of 3 in culture

How Indigenous Technologists Are Rewiring AI to Save Endangered Languages

Faced with a digital landscape that favors English, Indigenous communities are building custom AI models and wearable robots to revitalize their native languages on their own terms.

By Factlen Editorial Team

Indigenous Technologists 45%Academic AI Researchers 35%Digital Rights Advocates 20%
Indigenous Technologists
Argue that language AI must be built on a foundation of data sovereignty, ensuring communities control their cultural knowledge rather than surrendering it to tech giants.
Academic AI Researchers
Focus on developing new model architectures and transfer-learning techniques that can accurately process languages with very small amounts of training data.
Digital Rights Advocates
Highlight the structural inequalities of the global language data gap, warning that AI will exacerbate digital exclusion if it remains English-centric.

What's not represented

  • · Elders and traditional knowledge keepers who may be skeptical of digitizing sacred or nuanced oral histories.
  • · Public school administrators tasked with integrating these experimental AI tools into formal language curricula.

Why this matters

Roughly 40 percent of the world's 7,000 languages are at risk of disappearing, taking centuries of cultural and scientific knowledge with them. By forcing AI to work for low-resource languages, technologists are proving that the digital age does not have to be an extinction event for global diversity.

Key points

  • Mainstream AI models struggle with non-dominant languages due to a lack of digitized training data, known as the global language data gap.
  • Indigenous technologists are building custom AI tools, like the Anishinaabemowin-speaking 'Skobot', to teach endangered languages to youth.
  • New Zealand's Te Hiku Media crowdsourced 300 hours of speech to build a highly accurate Māori speech recognition model.
  • Communities are using 'Kaitiakitanga' licenses to maintain data sovereignty, preventing tech giants from extracting their cultural knowledge.
  • Researchers are developing 'transfer learning' techniques to train AI on low-resource languages using significantly less data.
40%
Global languages at risk of extinction
92%
Accuracy of Te Hiku Media's Māori ASR model
300 hours
Labeled speech data crowdsourced in 10 days
50%+
Web domains written in English

For decades, the internet has operated as a linguistic homogenizer. With more than half of all web domains written in English, the digital age has inadvertently accelerated the decline of minority languages. According to the United Nations, roughly 40 percent of the world's 7,000 languages are currently at risk of extinction, with one Indigenous language lost every two weeks. But a new generation of Indigenous technologists is flipping the script, transforming artificial intelligence from a tool of cultural erasure into an engine for language revitalization.[1][2][3][5]

One of the most visible examples of this shift sits on the shoulders of children. Danielle Boyer, a 24-year-old Anishinaabe roboticist, recently designed the "Skobot"—a small, brightly colored wearable robot shaped somewhat like a parrot. Equipped with an internally developed AI model, the motion-activated toy converses fluently in Anishinaabemowin, the endangered language of the Anishinaabe nation in North America. When a child asks the robot how to say a specific word, the AI interprets the audio and responds in real-time, simulating a natural, immersive conversation that is often missing in modern digital environments.[2]

Innovations like the Skobot are necessary because the broader AI industry has largely left the global majority behind. A 2025 paper from the Stanford Institute for Human-Centered Artificial Intelligence highlighted a structural disparity known as the "global language data gap." Mainstream large language models (LLMs) developed by major tech firms rely heavily on publicly available text scraped from the internet. Because the web is overwhelmingly English-centric, these models perform exceptionally well in dominant languages but fail spectacularly when tasked with anything else.[1][3][4]

Mainstream AI models rely on web scraping, heavily skewing their capabilities toward English and leaving low-resource languages behind.
Mainstream AI models rely on web scraping, heavily skewing their capabilities toward English and leaving low-resource languages behind.

In the field of artificial intelligence, languages outside this dominant cluster are termed "low-resource." This designation has nothing to do with the number of native speakers a language has; even widely spoken languages like Urdu fall into this category. Instead, "low-resource" refers strictly to a scarcity of machine-readable, digitized, and annotated text available to train algorithms. Without this massive corpus of data, standard AI transcription and translation tools struggle with cultural nuance, introduce inherent biases, and frequently hallucinate incorrect grammar.[3][4][6]

To bridge this gap, Indigenous communities are taking data collection into their own hands, refusing to wait for Silicon Valley to notice them. In New Zealand, Te Hiku Media, an iwi-led (tribal) broadcasting organization, recognized that te reo Māori needed a robust digital presence to survive. Rather than relying on existing tech giants, they launched "Kōrero Māori," a massive crowdsourcing initiative designed to build a custom automatic speech recognition (ASR) model from scratch.[1][5]

The response from the Māori community was unprecedented. In just ten days, over 2,500 individuals signed up to read more than 200,000 phrases, generating over 300 hours of highly accurate, labeled speech data. Using the open-source NVIDIA NeMo toolkit and advanced tensor core GPUs, Te Hiku Media trained a speech-to-text model that now transcribes te reo Māori with 92 percent accuracy. It can even seamlessly transcribe bilingual speech, switching between English and te reo with an 82 percent accuracy rate.[5]

In just ten days, over 2,500 individuals signed up to read more than 200,000 phrases, generating over 300 hours of highly accurate, labeled speech data.

The success of Te Hiku Media is not just a technical triumph; it is a blueprint for Indigenous data sovereignty. Historically, marginalized communities have seen their cultural artifacts and knowledge extracted and monetized by outside entities. To prevent this, Te Hiku Media collected its data under a strict "Kaitiakitanga" (guardianship) license. This legal and cultural framework ensures that the data, and the AI models built from it, remain under Māori control and are used exclusively for the benefit of the Māori people.[1][5][6]

Wearable AI robots, like the Anishinaabemowin-speaking 'Skobot', simulate natural conversation to help children learn endangered languages.
Wearable AI robots, like the Anishinaabemowin-speaking 'Skobot', simulate natural conversation to help children learn endangered languages.

This insistence on sovereignty is reshaping how AI research is conducted globally. At the prestigious NeurIPS AI conference, recent workshops have centered entirely on building LLM architectures tailored to low-resource linguistic features through ethical, community-centered dataset collection. Researchers are moving away from brute-force data scraping and toward "transfer learning"—a technique where an AI model applies the underlying structural knowledge it learned from a high-resource language to a low-resource one, drastically reducing the amount of native data required.[6]

Institutions like Mila, the Quebec AI Institute, are pushing this further through their First Languages AI Reality (FLAIR) initiative. Developing an ASR model for a new language typically requires hundreds of hours of pristine audio. FLAIR is pioneering foundational research to create custom voice models for endangered languages using a fraction of that data. These lightweight models can then power voice-controlled technology, audio transcription, and immersive virtual reality experiences for Indigenous youth.[8]

The shift from static preservation to dynamic interaction is critical for intergenerational transfer. Younger speakers engage primarily through smartphones and interactive media, making traditional, static digital archives less effective. To address this, New Zealand-based software company Kiwa Digital partnered with Amazon Web Services to launch CultureQ, a generative AI platform. By embedding conversational AI into cultural archives, users can ask questions and hear the language spoken aloud, turning historical texts into living dialogues.[7]

Te Hiku Media successfully crowdsourced 300 hours of labeled te reo Māori speech data in just ten days.
Te Hiku Media successfully crowdsourced 300 hours of labeled te reo Māori speech data in just ten days.

Despite these breakthroughs, significant technical and ethical uncertainties remain. AI models, by their nature, recognize patterns and calculate probabilities; they do not "understand" culture. Linguists warn that general-purpose AI can inadvertently simplify or misrepresent Indigenous knowledge, stripping away the deep contextual tones and morphologies that give these languages their meaning. There is a persistent fear that synthetic data generation—using AI to create artificial training text—could slowly dilute the authenticity of the language over time.[3][4][6]

To mitigate these risks, the consensus among researchers and community leaders is that AI must remain a "human-in-the-loop" technology. In newsrooms and classrooms experimenting with these tools, "hybrid translation"—where AI outputs are rigorously reviewed by native speakers before publication—is becoming the gold standard. The goal is not to replace human teachers or elders, but to give them infinitely scalable tools to amplify their reach.[3][7]

Under a Kaitiakitanga framework, data remains under community control, preventing extraction by outside tech companies.
Under a Kaitiakitanga framework, data remains under community control, preventing extraction by outside tech companies.

The implications of this work extend far beyond linguistics. Studies have shown that a strong connection to linguistic heritage correlates with tangible public health benefits in Indigenous communities, including lower rates of teen suicide, diabetes, and excessive alcohol consumption. Language is the vessel for identity, and preserving it has a profound stabilizing effect on a community's social fabric.[2][6]

By forcing cutting-edge technology to adapt to their needs, Indigenous technologists are proving that the future of AI does not have to be a monolith. From wearable robots in Michigan to sovereign data centers in New Zealand, these initiatives demonstrate that with the right ethical frameworks and community leadership, artificial intelligence can be harnessed to protect the very diversity it once threatened to erase.[1][2][5]

How we got here

  1. 2013

    Te Hiku Media convenes with community elders to form a strategy for sharing Māori content in the digital era.

  2. 2024

    Te Hiku Media's CEO is recognized on the TIME100 AI list for pioneering Indigenous data sovereignty in machine learning.

  3. 2025

    Stanford HAI publishes a paper detailing how mainstream large language models fail users in the global majority.

  4. 2025

    The NeurIPS conference hosts dedicated workshops on centering low-resource languages in the age of LLMs.

Viewpoints in depth

Indigenous Technologists' View

Emphasizes that language revitalization must be paired with strict data sovereignty to protect cultural heritage.

For Indigenous developers, the rush to build multilingual AI models by massive tech corporations represents a new form of digital colonialism. They argue that simply scraping the internet for native languages extracts cultural knowledge without compensating or empowering the communities it belongs to. By building their own models under frameworks like the Kaitiakitanga license, these technologists ensure that the tools serve the community first. They view AI not just as a translation engine, but as a sovereign digital asset that can foster intergenerational connection on their own terms.

Academic AI Researchers' View

Focuses on the technical challenge of rewiring AI architectures to learn efficiently from scarce data.

Computer scientists and linguists at institutions like Mila and Stanford are tackling the 'global language data gap' from an architectural standpoint. Standard LLMs are data-hungry, requiring billions of parameters to function smoothly. Because low-resource languages will never have the same volume of digitized text as English, researchers are pioneering techniques like transfer learning and synthetic data generation. Their goal is to create lightweight, highly adaptable models that can grasp complex morphologies and tonal nuances without needing massive, brute-force datasets, thereby democratizing access to AI technology.

What we don't know

  • Whether synthetic data generation will eventually dilute the authentic nuances and idioms of endangered languages.
  • How quickly these custom, community-led AI models can scale to cover the thousands of other low-resource languages currently at risk.

Key terms

Low-Resource Language
A language that has very little digitized text or annotated data available online, making it difficult to train standard artificial intelligence models.
Automatic Speech Recognition (ASR)
Technology that allows a computer to identify and process human voice inputs, converting spoken language into written text.
Data Sovereignty
The concept that a community or nation has the right to control the collection, ownership, and application of its own data.
Kaitiakitanga
A Māori concept of guardianship and protection, used in this context as a licensing framework to protect Indigenous data from commercial exploitation.
Transfer Learning
A machine learning technique where an AI model applies knowledge gained from a data-rich task (like English translation) to help it learn a data-poor task (like translating an endangered language).

Frequently asked

What makes a language 'low-resource' in AI?

A low-resource language is one that lacks a large volume of digitized, machine-readable text and audio data on the internet, which is necessary to train standard AI models. It does not necessarily mean the language has few human speakers.

How did Te Hiku Media build its Māori AI model?

Te Hiku Media launched a crowdsourcing campaign that gathered 300 hours of labeled speech from over 2,500 Māori speakers in just ten days. They used this data to train a custom speech recognition model that operates with 92% accuracy.

What is data sovereignty?

Data sovereignty is the principle that data is subject to the laws and governance structures of the nation or community it comes from. For Indigenous groups, it ensures their cultural knowledge and language data cannot be exploited or monetized by outside tech companies.

Can AI perfectly translate cultural nuances?

Not currently. AI models recognize statistical patterns rather than truly understanding culture, meaning they can sometimes simplify or misrepresent deep linguistic nuances. Researchers recommend keeping human experts in the loop to review AI outputs.

Sources

Source coverage

8 outlets

3 viewpoints surfaced

Indigenous Technologists 45%Academic AI Researchers 35%Digital Rights Advocates 20%
  1. [1]Factlen Editorial Team

    Synthesis by Factlen editorial team

    Read on Factlen Editorial Team
  2. [2]Smithsonian MagazineIndigenous Technologists

    How a 24-Year-Old Roboticist is Preserving Indigenous Languages

    Read on Smithsonian Magazine
  3. [3]Nieman Journalism LabAcademic AI Researchers

    Studies on AI transcription and translation in journalism reveal “low-resource” language gap

    Read on Nieman Journalism Lab
  4. [4]Global VoicesDigital Rights Advocates

    Lost in translation: How AI models impact low-resource language communities

    Read on Global Voices
  5. [5]NVIDIAIndigenous Technologists

    Māori Speech AI Model Helps Preserve and Promote New Zealand Indigenous Language

    Read on NVIDIA
  6. [6]NeurIPSAcademic AI Researchers

    Centering Low-Resource Languages and Cultures in the Age of Large Language Models

    Read on NeurIPS
  7. [7]Amazon Web ServicesIndigenous Technologists

    A GenAI Approach to Revitalizing Indigenous Language for the Digital Age

    Read on Amazon Web Services
  8. [8]MilaAcademic AI Researchers

    First Languages AI Reality (FLAIR)

    Read on Mila
Stay informed

Every angle. Every day.

Get culture stories with full source coverage and perspective breakdowns delivered to your inbox.