Factlen ExplainerEndangered LanguagesExplainerJun 16, 2026, 11:06 AM· 6 min read· #4 of 4 in culture

How Artificial Intelligence is Racing to Save the World's Endangered Languages

With nearly half of the globe's 7,000 languages facing extinction, technologists and indigenous communities are using advanced AI models to document, translate, and revitalize fading linguistic heritage.

By Factlen Editorial Team

Share this story

Indigenous Technologists 40%Big Tech AI Researchers 35%Cultural Anthropologists 25%

Indigenous Technologists: Argue that language revitalization must be community-led, prioritizing data sovereignty and cultural accuracy over sheer technological scale.
Big Tech AI Researchers: Focus on building massive, universal models that can overcome the 'low-resource' data barrier through advanced machine learning techniques.
Cultural Anthropologists: Warn against the risks of AI creating synthetic, dominant-culture-biased versions of languages, emphasizing that language is a worldview.

What's not represented

· Elder native speakers who do not use digital technology
· Government education policymakers

Why this matters

Language is the primary vessel for human culture, history, and worldview. By democratizing the tools to preserve fading dialects, AI is helping marginalized communities reclaim their heritage and ensuring that centuries of unique human knowledge are not permanently erased by globalization.

Key points

Nearly 40% of the world's 7,000 languages are at risk of extinction by the end of the century.
Generative AI and self-supervised learning allow models to process languages with very little digital data.
Meta and Google are building massive universal models capable of translating hundreds of underserved languages.
Indigenous technologists are leading community-driven projects to ensure cultural accuracy and data sovereignty.
Experts warn that AI must be carefully guided to avoid imposing dominant-culture biases onto indigenous translations.

7,000

Approximate global languages

40%

Languages at risk of extinction

Language lost every two weeks

1,100+

Languages supported by Meta's MMS

The human race is currently experiencing a silent, unprecedented crisis of cultural erasure. Of the approximately 7,000 languages spoken across the globe today, linguists estimate that nearly 40 percent are at risk of extinction by the end of the century. The United Nations paints an even starker picture, noting that an indigenous language disappears roughly every two weeks. When the last fluent speaker of a language passes away, humanity loses more than just a unique vocabulary; it loses entire frameworks of philosophy, ecological knowledge, humor, and historical memory. For decades, the sheer scale of this linguistic erosion outpaced the ability of researchers to document it.[4][5][6]

Historically, the digital revolution accelerated this decline rather than halting it. The internet and early natural language processing (NLP) systems were overwhelmingly built for dominant languages like English, Mandarin, and Spanish. Languages spoken by smaller populations were classified by computer scientists as "low-resource languages"—meaning they lacked the massive datasets of digitized text, translated books, and transcribed audio required to train traditional machine learning algorithms. If a language could not be processed by search engines, translation apps, or digital keyboards, its speakers—especially the youth—were economically and socially incentivized to abandon it in favor of a dominant tongue.[1][5][7]

The rapid pace of global language extinction has prompted urgent digital intervention.

However, a profound shift in artificial intelligence architecture is transforming technology from a threat to linguistic diversity into one of its most powerful guardians. The advent of generative AI and self-supervised learning has fundamentally altered how machines understand human speech. Instead of requiring millions of perfectly translated sentence pairs to learn a language, modern AI models can learn the underlying "shape" and acoustic patterns of a language directly from raw, untranslated audio. This breakthrough is allowing technologists to build robust digital infrastructure for languages that have very little written history.[1][2][7]

The scale of this technical ambition is largely being driven by massive investments from major technology companies. Meta's Fundamental AI Research (FAIR) lab launched the "No Language Left Behind" (NLLB) project, an open-source machine translation engine capable of delivering high-quality translations directly between 200 languages, bypassing the traditional need to route translations through English first. To achieve this, Meta developed a "Sparsely Gated Mixture of Experts" model, a routing system that ensures languages with minimal data still receive robust computational support without the AI simply memorizing and overfitting the tiny datasets.[1][7]

Meta subsequently expanded this effort into the Massively Multilingual Speech (MMS) project, which scales audio transcription to over 1,100 languages. Crucially, the system utilizes "zero-shot" speech recognition, allowing the AI to transcribe audio in languages it has never explicitly been trained on by inferring phonetic rules from its vast, multilingual baseline. Parallel to this, Google launched its "1,000 Languages Initiative," aiming to build a single, universal AI model supporting the globe's most spoken tongues. Google's Universal Speech Model (USM) was trained on 12 million hours of speech and 28 billion sentences, proving that a single, massive neural network can cross-pollinate linguistic rules to better understand under-represented dialects.[1][2][7]

Advancements in self-supervised learning have drastically increased the number of languages AI can process.

Meta subsequently expanded this effort into the Massively Multilingual Speech (MMS) project, which scales audio transcription to over 1,100 languages.

While Silicon Valley provides the computational horsepower, the most effective preservation efforts are being led by the communities themselves. Indigenous technologists are taking open-source AI frameworks and adapting them to their specific cultural needs. The First Languages A.I. Reality (FLAIR) Initiative, co-founded by Northern Cheyenne technologist Michael Running Wolf, focuses on developing adaptable AI tools for global Indigenous language revitalization. The project's core philosophy is that AI should not just archive a language for academic study, but actively increase the number of daily speakers through accessible, interactive technologies.[5][7]

In New Zealand, Te Hiku Media has pioneered a similar community-first approach. Built by Māori technologists, the organization developed an AI model specifically aimed at preserving and revitalizing te reo Māori. By keeping the development in-house, the community ensures that the nuances of their language are respected and that the resulting data remains under Māori sovereignty, rather than being extracted and monetized by external corporations. This model of digital self-determination is becoming a blueprint for other indigenous groups worldwide.[4][6]

The power of AI to work with microscopic datasets was recently demonstrated at Dartmouth College, where researchers built an AI-driven framework called NüshuRescue. Nüshu is a 400-year-old script created by Yao women in China's Hunan province to communicate in secret. With many texts lost to history, the research team, led by graduate student Ivory Yang, used a mere 35 pairs of matching sentences to train a large language model. The AI successfully learned to expand the database of the rare script, proving that generative AI can rapidly produce valuable linguistic resources from almost nothing.[3][6]

Researchers at Dartmouth used just 35 sentence pairs to train an AI to translate Nüshu, a 400-year-old script created by women in China.

Despite these uplifting breakthroughs, the intersection of AI and cultural heritage is fraught with complex ethical challenges. Cultural anthropologists and linguists warn against the risk of creating "synthetic" languages. Because foundational AI models are still predominantly trained on Western, English-centric data, they can inadvertently impose dominant-culture biases onto indigenous translations. If an AI translates an indigenous idiom using Western philosophical framing, the output may be grammatically correct but culturally hollow, resulting in a "stilted" version of the language that native elders do not recognize.[4][5][7]

Furthermore, the rush to digitize endangered languages has sparked intense debates over data privacy and cultural appropriation. Many indigenous communities possess sacred oral traditions, stories, and songs that are not meant for public consumption. When tech companies scrape the internet for audio data to train their universal models, they risk ingesting and commodifying this protected knowledge without informed consent. Consequently, advocates are pushing for robust "data sovereignty" frameworks, ensuring that communities retain legal and technical ownership over their linguistic data.[5][7]

Zero-shot learning allows AI to transcribe languages it has never explicitly been trained on by inferring phonetic rules.

To navigate these risks, researchers emphasize that AI must be viewed as a supportive tool rather than a standalone savior. Technology can transcribe thousands of hours of archival tape, generate interactive learning apps for children, and provide real-time translation bridges. However, it cannot replicate the intimate, intergenerational transfer of knowledge that occurs when a parent speaks to a child. The ultimate goal of these AI initiatives is not to replace human speakers with chatbots, but to lower the barrier to entry so that fading languages can once again become living, breathing mediums of daily life.[4][5][6]

As the International Decade of Indigenous Languages progresses, the collaboration between community leaders, linguists, and computer scientists is proving that globalization does not have to be a one-way street toward cultural homogenization. By harnessing the very technologies that once threatened to erase them, marginalized communities are ensuring that their ancestral voices will continue to resonate, adapt, and thrive in the digital age.[1][4][7]

How we got here

2022
Meta launches the No Language Left Behind project and Google announces its 1,000 Languages Initiative.
2023
Google details its Universal Speech Model, trained on 12 million hours of speech across 300+ languages.
2024
Meta adds zero-shot speech recognition to its Massively Multilingual Speech project, scaling to 1,100 languages.
2025
Community-led initiatives like NüshuRescue demonstrate that generative AI can revitalize scripts using minimal data.

Viewpoints in depth

Indigenous Technologists

Advocating for digital self-determination and community control over linguistic data.

For indigenous developers and community leaders, the preservation of language is inseparable from the protection of cultural sovereignty. Organizations like Te Hiku Media and the FLAIR Initiative argue that while Big Tech's tools are powerful, the data used to train them must remain under the control of the communities themselves. They emphasize that language revitalization is not just an academic exercise in archiving vocabulary, but a deeply human effort to increase the number of active speakers. By building and controlling their own AI models, these groups ensure that sacred texts are not commodified and that the technology serves the specific educational needs of their youth.

Big Tech AI Researchers

Focusing on the computational scale required to break the 'low-resource' data barrier.

Computer scientists at major tech firms view the language extinction crisis as a massive data routing problem that can be solved through advanced neural architectures. By developing techniques like self-supervised learning and Sparse Mixture-of-Experts, researchers at Meta and Google are proving that AI no longer needs millions of translated sentences to understand a language. Their perspective is rooted in scale: by building a single, universal model that understands the phonetic baseline of human speech, they can provide digital infrastructure for hundreds of marginalized languages simultaneously, integrating them into global platforms like search engines and social media.

Cultural Anthropologists

Warning against the loss of nuance and the risk of creating synthetic, culturally hollow languages.

Linguists and anthropologists offer a cautious counterweight to the technological optimism surrounding AI preservation. They point out that language is not merely a system of information exchange, but a reflection of a unique worldview, complete with untranslatable idioms, humor, and ecological philosophies. Because foundational AI models are largely trained on Western data, anthropologists warn that AI can inadvertently impose English-centric logic onto indigenous translations. This risks creating a 'stilted' or synthetic version of the language that is grammatically correct but stripped of its historical and cultural soul, fundamentally altering the heritage it is meant to protect.

What we don't know

Whether AI-generated language tools will actually result in a sustained increase in fluent, daily speakers among younger generations.
How international intellectual property laws will adapt to protect indigenous data sovereignty from being scraped by commercial AI models.
The long-term cultural impact of having a machine, rather than a human elder, serve as the primary reference point for a revitalized language.

Key terms

Low-resource language: A language that lacks large amounts of digitized text or audio data, making it difficult to train traditional machine learning models.
Zero-shot learning: A machine learning capability where an AI model can accurately process or transcribe a language it has never explicitly been trained on by applying generalized rules.
Natural Language Processing (NLP): A branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language.
Self-supervised learning: A training method where an AI learns patterns from raw, unlabeled data (like pure audio) rather than relying on human-annotated examples.

Frequently asked

What makes a language endangered?

A language becomes endangered when it is no longer passed down to younger generations, often due to globalization, migration, or the dominance of major languages in education and the economy.

How does AI learn a language with very little data?

Modern AI uses self-supervised learning and zero-shot recognition to analyze the acoustic patterns of raw audio, allowing it to infer linguistic rules without needing millions of translated text documents.

What is data sovereignty?

Data sovereignty is the principle that indigenous communities should retain legal and technical ownership over their cultural and linguistic data, preventing tech companies from extracting it without consent.

Can AI make someone fluent in a dying language?

While AI can provide interactive learning tools, translations, and vast digital archives, experts agree that true fluency and cultural transmission still require human-to-human interaction.

Sources

[1]Meta FAIRBig Tech AI Researchers
No Language Left Behind: Scaling Human-Centered Machine Translation
Read on Meta FAIR →
[2]Google BlogBig Tech AI Researchers
Supporting 1,000 languages with AI
Read on Google Blog →
[3]Dartmouth CollegeIndigenous Technologists
Computer scientists and linguists build AI tech to strengthen endangered languages
Read on Dartmouth College →
[4]EurekAlertCultural Anthropologists
AI could be the future for preserving marginalized cultures, say experts
Read on EurekAlert →
[5]Bowdoin CollegeIndigenous Technologists
Artificial Intelligence and the Future of Indigenous Language Revitalization
Read on Bowdoin College →
[6]BBCCultural Anthropologists
What in the World: Can AI save endangered languages?
Read on BBC →
[7]Factlen Editorial TeamCultural Anthropologists
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →

Up next

Secular Pilgrimage

The Modern Resurgence of Ancient Pilgrimages: Why Millions Are Walking the Camino and Kumano Kodo

As modern life becomes increasingly digital and complex, record numbers of secular travelers are turning to ancient religious trails for psychological healing, digital detox, and physical transformation.

Every angle. Every day.

Get culture stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse culture