← Home

Is there a Multicultural Advantage in AI?

Created: 2025 Nov 29 , Updated: 2025 Nov 30
ai india languages data
7 min read

This post ventures into the land of pure hypothesis. Caveat Emptor

When GPT-5 launched, everyone on the internet (with the notable exception of one Mr. Altman) was disappointed by how uninspiring it was. The general consensus is that AI progress with the current structure of LLMs is plateauing. GPT 5.1/Opus 4.5/Gemini Pro might be beating benchmarks, but true novel problems don’t seem to be getting solved. This is apparently happening because we’ve supposedly scraped all the quality text on the internet. Every book ever written, ever Wikipedia article, every Stack Overflow answer, and every Reddit thread (which caused all the big content aggregators to lock down further content without paying the piper) have already been digested by AI.

According to Richard Sutton’s Bitter Lesson, computation and data beat clever algorithms over the long run. However, this also means that if we are trying to use just data to make AI better, once you are out of new data you’ve painted yourself into a corner.

Is there an Anglo-Centric Bottleneck?

Nobody really understands how LLMs became such amazing worldbuilders, just by reading more and more text. However, it is clear that the world that was built was primarily Anglo-centric, since that’s where most of the data comes from. This means that the concepts and the connections that LLMs might have are also heavily skewed to the English language world.

However, the “exhausted internet” only applies to English. Maybe throw in some European languages if we’re being generous.

But there’s a ridiculous amount of untapped data across languages spoken in India, China, Southeast Asia, Africa, and Latin America. Conversations happening across WhatsApp groups in Hindi, technical discussions in Mandarin forums, legal reasoning in Bengali, agricultural knowledge in Tamil - none of this has been properly captured.

Where this is different is that this isn’t just “more of the same data.” Different languages encode fundamentally different ways of thinking about the world.

Hello Sapir-Whorf Hypothesis, meet Transformers

It might be easy to think of languages as just different vocabularies mapped onto the same concepts. However, it might not be this simple, since there are good arguments that different languages are fundamentally different ways of thinking about the world.

Take Turkish. Every time you make a statement, you have to specify whether you saw it directly, heard it from someone, or inferred it. There’s no way around it - the grammar forces you to mark how you got this information. Compare that to English where “it rained yesterday” could mean anything from “I was soaked” to “someone told me” to “the ground is wet.”

Or spatial navigation: some languages don’t have “left” and “right” at all. Instead, speakers of languages like Guugu Yimithirr use absolute directions (north/south/east/west) for everything. “Move the cup a bit north” instead of “move the cup to the right.” This fundamentally changes how speakers think about and navigate space.

Why does this matters for AI improvements? Training on diverse languages doesn’t just add more facts to the pile. It creates a higher-dimensional latent space that allows for better reasoning. An AI that understands both “linear time” (English: “looking forward to the future”) and “vertical time” (Mandarin: “shàng ge yuè” = “up month” = last month) isn’t just bilingual. Or Hindi, where “kal” means both yesterday and tomorrow - the same word for opposite temporal directions, disambiguated only by context.

NOTE: the above examples are from Wikipedia/ChatGPT, I DO NOT speak these languages myself!

Each language gives the model a different geometric representation of the same abstract concept, which means it can reason about “time” from multiple angles simultaneously. If you only train a model on English descriptions of “time,” you get some of the axes in the latent space. Add Mandarin and you get many more axes that are genuinely orthogonal. The model can now navigate a richer conceptual space.

Different Cultures are Different Worlds

This goes deeper than linguistic structure. Different cultures have different ways of thinking about the universe. I will give a few examples from my part of the world.

The undisputed GOAT of mathematical intuition, Srinivasa Ramanujan, famously arrived at theorems through pure thought - he’d see the answer first, then work backwards to find the proof (and many a time don’t even bother to). Western mathematics goes the opposite direction: axioms → theorems → proofs. He had very little formal training, but obviously had a fundamentally different way of navigating mathematical manifold. Could an AI trained on both Indian and Western traditions take advantage of this and find mathematical proofs out of thin air a la Ramanujan?

There are WhatsApp university posts that claim Sanskrit is the best language to program computers because of Panini’s Ashtadhyayi - a 2,500-year-old Sanskrit grammar textbook that’s essentially a manual to a formal programming language. It uses recursion, metarules, and a sophisticated notation system to describe linguistic rules. The claim is obvious hyperbole - Python is hard enough already, forget about memorizing the twelve inflections of “Rama” just to write a for-loop. But an AI trained on Sanskrit could connect underlying grammatical structure to code in genuinely different computational paradigms.

Legal reasoning is another example. In Western legal traditions, “law” is a set of codified rules. In Indian legal philosophy, “Dharma” is context-dependent - the same action can be righteous or unrighteous depending on the situation, your role, and the consequences. An AI that understands both approaches has a richer model of how humans think about right and wrong.

Right now, are AI systems overfitting to a particularly Anglo-centric view of the world?

There are good reasons why nobody has done this yet

Of course, training LLMs natively like this is not easy. The first problem is just getting the data. The valuable stuff isn’t sitting in public GitHub repos - it’s in WhatsApp groups, family forwards, regional forums, PDF scans of old legal documents, and oral histories that aren’t even digitized.

But even if you could scrape every WhatsApp forward, you’d hit a second problem: current LLM tokenizers are optimized for English and Latin scripts. A single Hindi word might get split into multiple tokens, making the model far less efficient. Languages with complex scripts like Tamil or Bengali need fundamentally different tokenization strategies - different attention mechanisms, language-specific adapter layers, the whole works. Companies are still figuring out how to do this.

The third problem is that nobody actually speaks in neat, separated languages. Multilingual speakers code-switch constantly - “Yaar, I was at the market today and the bhai was like…” That fluidity is messy and hard to label, and it’s exactly the kind of data that would be most valuable.

And even if you solve all of that, how do you prove your model is better? Current benchmarks are Anglo-centric as well! A model that’s genuinely better at reasoning about Dharma or understanding Hindi humor won’t show up on MMLU scores. You’d need entirely new evaluation frameworks, and good luck getting those accepted by the global research community.

An opportunity for the “Global South”?

However, if you can actually show real value this way, does this mean that countries with more heterogeneous languages (and cultures) have an inherent advantage?

I honestly don’t know, but it is undeniable that the current LLMs have captured the internet of the last 30 years. Across the world, there have been civilizations and worldviews two orders of magnitude older and more diverse than just the English-speaking world, but they are trapped in PDFs, oral histories, and dialects.

Maybe China already understands this. They’re building models trained heavily on Chinese data, for Chinese use cases, with Chinese cultural context. The US had a head start because the internet was English-first. But that advantage could be temporary.

India could be next. The diversity within the country is greater than almost any other nation or area (bar African nations, who face even more structural constraints). We’re talking about 22 official languages, hundreds of dialects, and multiple writing systems - all encoding different cultural knowledge.

So can India or other diverse nations actually capitalize on this? I have no idea. However, it is clear that the current approach of training bigger models on the same English-heavy internet has diminishing returns. If the next breakthrough in AI comes from genuinely different ways of thinking about the world, then the advantage goes to whoever can unlock that diversity first.

The West has the infrastructural advantage but that is already a commodity. What isn’t commoditizing (because it is hard!) is 1000s of languages, 10000 years of philosophical traditions, and fundamentally different ways of encoding knowledge. That might matter more than anyone realizes.