How Indigenous Engineers Are Using AI to Revitalize Endangered Languages
Faced with a global crisis of language extinction, Indigenous technologists are leveraging small language models and data sovereignty frameworks to build bespoke AI tools that preserve and teach ancestral tongues.
By Factlen Editorial Team
- Indigenous Technologists
- Advocate for community-led AI development and strict data sovereignty to protect cultural heritage from commercial exploitation.
- Computational Linguists
- Focus on the technical mechanisms of Small Language Models and NLP to efficiently process low-resource languages.
- Cultural Ethicists
- Warn about the risks of AI hallucinations and digital colonialism, emphasizing that technology cannot replace human elders.
What's not represented
- · Elders without digital access
- · Linguists favoring traditional analog documentation
Why this matters
Language is the vessel for cultural identity and ecological knowledge. By proving that AI can be tailored to low-resource languages without exploiting community data, these projects offer a scalable blueprint for preserving human diversity in the digital age.
Key points
- Approximately 40 percent of the world's 6,700 languages are currently endangered, with one disappearing every two weeks.
- Indigenous engineers are using natural language processing to rapidly transcribe decades of archival audio that would take lifetimes to process manually.
- Small Language Models (SLMs) allow developers to build accurate AI tools using limited datasets, bypassing the massive data requirements of Big Tech models.
- Communities are implementing data sovereignty licenses to prevent commercial tech companies from scraping and exploiting their cultural heritage.
- Despite AI's promise, experts warn that digital tools must complement, rather than replace, traditional human-to-human language transmission.
Language is the ultimate vessel of human culture, carrying unique worldviews, ecological knowledge, and ancestral histories. Yet, the world is currently experiencing a silent mass extinction of human expression. Of the roughly 6,700 languages spoken globally today, linguists estimate that at least 40 percent are endangered. The statistics are grim: on average, one Indigenous language dies every two weeks as its last fluent elder passes away. For decades, the standard response to this crisis was manual documentation—linguists recording elders on cassette tapes and painstakingly transcribing the audio into dictionaries. But this analog approach is fundamentally mismatched to the speed of the crisis. Now, a new generation of Indigenous engineers and computational linguists is turning to artificial intelligence to reverse the tide, transforming machines from tools of global homogenization into instruments of cultural resilience.[2][3]
The traditional bottleneck in language preservation has always been the sheer volume of labor required to process archival data. When media organizations or universities digitize old recordings of native speakers, the audio often sits in servers for years because only a handful of living people possess the hyper-specific regional fluency required to transcribe it accurately. Manual transcription can take up to ten hours for every one hour of audio. Artificial intelligence, specifically natural language processing (NLP), offers a way to clear this backlog at an unprecedented scale. By training machine learning models to recognize the phonetic patterns and grammatical structures of endangered languages, technologists are building automated speech recognition systems that can process decades of archival tape in a matter of days, unlocking a treasure trove of linguistic data for younger generations.[1][7]
The technological mechanism driving this revitalization is a shift away from massive, generalized AI toward Small Language Models (SLMs). Large Language Models (LLMs) like GPT-4 or Gemini rely on billions of parameters and require massive datasets—often millions of parallel sentence pairs scraped from the internet. This brute-force approach works flawlessly for high-resource languages like English or Spanish, but it fails spectacularly for Indigenous languages that have little to no digital footprint. SLMs, by contrast, are designed to be highly efficient and targeted. They can be trained or fine-tuned on much smaller, meticulously curated datasets. This allows developers to build functional translation tools, spellcheckers, and word predictors using only a few hundred hours of audio or a few thousand pages of text, making AI accessible to communities that Big Tech's massive models have historically left behind.[3]

The efficiency of Small Language Models also solves a critical infrastructure problem for remote Indigenous communities. Because SLMs require significantly less computational power than their massive counterparts, they can be deployed locally on affordable hardware or mobile devices without requiring a constant, high-bandwidth internet connection. This is particularly vital for communities in rural areas or low-and-middle-income countries where cloud-based AI is impractical. Furthermore, technologists are using a technique called chain-of-thought distillation, where a larger model's reasoning capabilities are compressed into a smaller, language-specific model. This creates an AI that is not only highly accurate within its specific linguistic context but also transparent—allowing human linguists to verify the model's internal processing steps and ensure that cultural nuances are being handled correctly.[3]
One of the most successful blueprints for this approach comes from New Zealand, where the Māori media organization Te Hiku has pioneered the use of AI for language revitalization. Decades of colonial repression had severely threatened Te Reo Māori; by 1960, only one in four Māori spoke their native tongue. In 2014, Te Hiku began digitizing a vast archive of recordings from Māori elders born in the late 19th century. Realizing that manual transcription would take lifetimes, Chief Technology Officer Keoni Mahelona and CEO Peter-Lucas Jones decided to build their own bespoke natural language processing tools. They launched an open-source app called Kōrero Māori, crowdsourcing voice donations from the community to build a robust dataset that accurately reflected the language's unique cadence and regional variations.[1][4]
The community response to Kōrero Māori was overwhelming. In just the first ten days of the app's launch, Te Hiku's data science team collected over 300 hours of annotated audio recordings. Using this hyper-local dataset, they trained an automatic speech recognition system that vastly outperformed commercial models built by Silicon Valley giants. Today, Te Hiku uses this AI to transcribe historical broadcasts, power pronunciation apps like Rongo, and develop voice assistants that allow New Zealanders to interact with everyday technology entirely in Te Reo Māori. By proving that high-quality AI can be built from the ground up by the community it serves, Te Hiku has provided a scalable template for other Indigenous groups worldwide.[1]
In just the first ten days of the app's launch, Te Hiku's data science team collected over 300 hours of annotated audio recordings.
In North America, similar community-led initiatives are racing against a ticking clock. Michael Running Wolf, a Northern Cheyenne engineer and founder of the First Languages AI Reality (FLAIR) initiative, notes that the United States stands to lose the majority of its Native American languages within the next decade if drastic action is not taken. FLAIR, operating in collaboration with the Mila-Quebec Artificial Intelligence Institute, is currently building speech recognition models tailored for over 200 endangered Indigenous languages across the continent. Running Wolf's team is actively training a new cohort of Native American, Alaska Native, and Native Hawaiian computer scientists, ensuring that the people building the technology intimately understand the cultural weight of the data they are processing.[2]

The movement is also gaining momentum in South America. In Colombia, researchers at the Universidad de los Andes are developing machine translation models for low-resource languages like Wayuunaiki, Nasa Yuwe, Inga, and Arhuaco. Working directly with native speakers and expert translators, the team gathers high-quality translations of short phrases to fine-tune high-resource language models. In 2025, their datasets were officially included in the AmericasNLP initiative, marking a historic milestone for Colombian Indigenous representation in computational linguistics. These tools are not designed to replace human learning, but to create digital bridges—allowing a young Wayuu student to instantly translate educational materials or government documents into their ancestral tongue.[5]
However, the intersection of artificial intelligence and Indigenous heritage is fraught with ethical tension. The primary concern is data sovereignty—the right of Indigenous communities to control how their languages and cultural knowledge are collected, stored, and monetized. Historically, Western academic and corporate institutions have treated Indigenous data as a public resource to be extracted. Today, many communities fear a new wave of digital colonialism, where Big Tech companies scrape Indigenous language data to train commercial AI models without providing compensation or attribution. When a massive tech conglomerate absorbs an endangered language into a global LLM, the community loses control over how their ancestors' voices are used, potentially exposing sacred or restricted knowledge to the public domain.[6]
To combat this, Indigenous technologists are pioneering new legal and ethical frameworks for AI development. Te Hiku Media, for example, licenses its technology under the Kaitiakitanga license. Rooted in the Māori concept of guardianship, this license explicitly prohibits the use of the community's data by commercial tech giants or for any applications that violate human rights. It ensures that the data remains the sovereign property of the Māori people. By rejecting the Silicon Valley ethos that all data wants to be free, these communities are proving that AI development can be rigorous, innovative, and deeply respectful of cultural boundaries. They are building walled gardens where their languages can flourish safely.[4]

The necessity of community oversight was starkly illustrated in December 2024, when a series of AI-generated language-learning books appeared on major online retail platforms. The books claimed to teach endangered languages like Abenaki, Mohawk, and Omok—a Siberian language that has been extinct since the 18th century. However, Indigenous linguists quickly discovered that the AI had hallucinated the content, filling the books with fabricated words, incorrect grammar, and nonsensical translations. For communities fighting desperately to preserve the authenticity of their heritage, these AI-generated fabrications were not just inaccurate; they were actively harmful, threatening to pollute the limited pool of learning resources with algorithmic noise and undermining the trust of new learners.[4]
This incident underscores a fundamental truth recognized by Indigenous engineers: artificial intelligence is a tool for preservation, not a replacement for human connection. Language is a living, breathing ecosystem that requires human relationships, cultural context, and intergenerational transmission to survive. An AI chatbot can simulate a conversation in Cherokee or Navajo, but it cannot impart the spiritual significance of the words or the ancestral stories attached to them. Technologists emphasize that digital tools must be designed to complement, rather than supplant, traditional practices like immersive classroom learning and time spent with community elders. The goal is not to create a perfect digital archive of a dead language, but to lower the barrier to entry for the living.[2][6]

Looking ahead, the synthesis of AI and Indigenous knowledge promises a profound shift in how marginalized languages interact with the modern world. Researchers are developing real-time translation tools that could soon allow a native speaker of an endangered language to navigate the internet, access healthcare, or interact with government services without having to default to English or Spanish. Educational chatbots are being designed to act as tireless, interactive tutors, providing young learners with a safe space to practice their ancestral language without the fear of making mistakes in front of fluent elders. These innovations are transforming the digital landscape from a threat to linguistic diversity into its most powerful ally.[3][6]
Ultimately, the application of artificial intelligence to endangered languages represents a paradigm shift from preservation as static archival storage to preservation as active, participatory engagement. By taking ownership of the technology, Indigenous communities are ensuring that their languages do not merely survive in museum databases, but thrive in the digital media ecosystems of the 21st century. The work of these engineers and linguists proves that the future of artificial intelligence does not have to be an English-only monolith. When guided by the communities themselves, AI can help resurrect the voices of the past and secure them for the generations to come.[6][7]
How we got here
1960
Following decades of colonial repression, only one in four Māori people in New Zealand spoke their native language.
2014
Te Hiku Media begins digitizing archival recordings of Māori elders, eventually realizing that manual transcription would take too long.
2022
The United Nations declares 2022-2032 the International Decade of Indigenous Languages to draw global attention to the extinction crisis.
Dec 2024
AI-generated books for endangered languages are discovered to contain fabricated translations, sparking widespread ethical concerns.
2025
Colombian Indigenous languages are included in the AmericasNLP initiative for the first time, marking a milestone for South American language tech.
Viewpoints in depth
Indigenous Technologists
Prioritizing data sovereignty and community control over AI development.
For Indigenous engineers, the primary concern is preventing a new wave of digital colonialism. They argue that if Big Tech companies are allowed to scrape endangered languages to train commercial Large Language Models, communities lose control over their ancestral knowledge. By utilizing frameworks like the Kaitiakitanga license, these technologists ensure that AI tools are built by and for the community, keeping cultural data out of the public domain and protecting it from commercial exploitation.
Computational Linguists
Leveraging Small Language Models to solve the low-resource data problem.
Researchers in this camp focus on the technical hurdles of building AI for languages with virtually no digital footprint. They advocate for the use of Small Language Models (SLMs) and chain-of-thought distillation, which require significantly less training data and computational power than massive models like GPT-4. By combining audio recordings with translated text, these linguists are proving that highly accurate, localized AI tools can be deployed on affordable hardware in remote areas.
Cultural Ethicists
Warning against the dilution of culture and the dangers of AI hallucinations.
Ethicists and cultural preservationists emphasize that AI is a double-edged sword. They point to incidents where AI generated fabricated translations for extinct languages, polluting the limited pool of educational resources. This camp argues that while AI can process archival data and simulate conversation, it cannot transmit the spiritual and cultural context of a language. They insist that digital tools must remain supplementary to traditional, human-to-human intergenerational learning.
What we don't know
- It remains unclear how international intellectual property laws will adapt to enforce Indigenous data sovereignty licenses globally.
- The long-term impact of AI-assisted language learning on the organic evolution and natural cadence of endangered languages is still unknown.
Key terms
- Small Language Model (SLM)
- An artificial intelligence model designed to be highly efficient and targeted, requiring significantly less training data and computational power than massive models like GPT-4.
- Data Sovereignty
- The right of a community or nation to govern the collection, ownership, and application of its own data, protecting it from external exploitation.
- Natural Language Processing (NLP)
- A branch of artificial intelligence that helps computers understand, interpret, and generate human language.
- Kaitiakitanga License
- A legal framework rooted in the Māori concept of guardianship, used to ensure that digital data is used ethically and remains under community control.
- Low-Resource Language
- A language that lacks large amounts of digital text or audio data, making it difficult to train traditional machine learning models.
Frequently asked
How can AI learn a language without a large written internet presence?
Developers use Small Language Models (SLMs) and multimodal datasets. By combining a few hundred hours of spoken audio recordings from elders with translated text, they can train highly accurate, targeted models without needing millions of written documents.
What is data sovereignty in the context of AI?
Data sovereignty is the principle that Indigenous communities have the right to control their cultural and linguistic data. It prevents commercial tech companies from scraping their languages to train global AI models without permission or compensation.
Can artificial intelligence completely save an endangered language?
No. While AI is a powerful tool for transcribing archives and creating educational apps, linguists and elders agree that a language only truly survives through active human relationships, cultural context, and intergenerational transmission.
What went wrong with the AI-generated language books in 2024?
In December 2024, AI-generated books claiming to teach endangered languages like Abenaki and Omok were found to contain fabricated words and incorrect grammar. This highlighted the risk of AI 'hallucinations' polluting the limited resources available to new learners.
Sources
[1]World Economic ForumIndigenous Technologists
This Māori leader trained AI to speak his language and preserve its wisdom
Read on World Economic Forum →[2]NBCU AcademyIndigenous Technologists
How Indigenous Engineers Are Using AI to Preserve Their Culture
Read on NBCU Academy →[3]Brookings InstitutionComputational Linguists
Can small language models revitalize Indigenous languages?
Read on Brookings Institution →[4]Viterbi Conversations in EthicsCultural Ethicists
Preserving the Past: AI in Indigenous Language Preservation
Read on Viterbi Conversations in Ethics →[5]Universidad de los AndesComputational Linguists
Machine Translation for Indigenous Language Preservation
Read on Universidad de los Andes →[6]Journal of Humanities Research SustainabilityCultural Ethicists
Digital Resurrection: AI's Role In Revitalizing Endangered Languages
Read on Journal of Humanities Research Sustainability →[7]Factlen Editorial TeamIndigenous Technologists
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →
Every angle. Every day.
Get culture stories with full source coverage and perspective breakdowns delivered to your inbox.








