Sanskritam: The world's first tokenizer

Table of Contents

Tokenizers are the translators of natural language for LLMs. They’re the first essential step without which LLMs would be blind to text. Interestingly, Sanskritam mirrors the same logic of tokenizers in a surprisingly modern way. Lets explore how tokenizers work, how LLMs use them, and how Sanskrit’s compositional system is best suited for this machinery.

Part 1: Tokenizers in LLMs

What are Tokenizers?

When we input a sentence like “The sky is blue” into ChatGPT or any LLM, the model doesn’t see words or letters. A computer only understands numbers. It can only comprehend the relations between them. The translation from text to numbers is a tokenizer’s job. It breaks down a sentence into smaller pieces called tokens. Each token is mapped to a number- a unique ID. These tokens can be full words (like “sky”), or just parts of a word (like “run” and “ning” in “running”).

The model forms relations using this number to understand how that token is used.

Why Not Just Use Whole Words?

Because language is messy.

Let’s say the model only learned the word “happy” during training. What happens when it sees:

happiness or unhappy or happiest?

It has no idea what those mean if it treats each word as a totally new token.

Instead, tokenizers break words into meaningful parts it already knows:

unhappiness → [“un”, “happi”, “ness”]

Even if it’s never seen “unhappiness” before, it’s seen “un-”, “happy”, and “-ness” enough to understand what it probably means.

Types of Tokenizers

Character-level – Every character has its own token: Eg. ‘H’, ’e’, ’l’, ’l’, ‘o’. While it offers precision, its incapable to model larger relations between words & sentences.
Word-level – Each word is a token. Eg. ‘sky’, ‘blue’, ‘butterfly’. Simple, yet struggles with new words it hasn’t seen before.
Subword-level – Breaks common patterns into reusable chunks. Eg. ‘unhappiness’ → [‘un’, ‘happi’, ’ness’]. Its the most common today & employed for its versatility.
Byte-level – Each byte of data is a token. Treats even punctuation and formatting as meaningful data. Used by GPT-2 & GPT-3.

Subword & Byte level tokenization is the accepted sweet spot. It tries to mimic how humans process language in chunks.

How Tokenizers Power LLMs

Once the input text is converted into tokens and each token into its unique id, that number is looked up in a large table— the embedding. Each token gets converted into a dense vector— a list of numbers representing that token’s relationship with all other tokens the model has seen. These vectors are the true language of the model. They flow through the transformer’s layers, attention blocks, and hidden states to produce meaningful output.

But it all starts with breaking a sentence into the right parts.

Part 2: Composition in Sanskritam

Sanskritam is a timeless language. Modern science has been unable to trace its roots past a few thousand years. The only surviving text is Panini’s Ashtadhyayi— considered the world’s first grammar rulebook. It lays out the rules on how words are formed, modified, and combined, ensuring consistency.

And astonishingly, it works much like a tokenizer.

Dhatus — The tokens

Sanskritam builds from Dhātus— root words that carry raw, atomic meaning. They’re too tiny to be full words, yet their meaning is preserved and expanded as larger words are formed. It is this compositional nature of Sanskrit that makes it so powerful.

Examples:

गम् (gam) — to go
ज्ञा (jñā) — to know
भू (bhū) — to be
कृ (kṛ) — to do or make
वद् (vad) — to speak
श्रु (śru) — to hear

Just like a tokenizer chops a word into reusable parts, Sanskritam naturally builds up from dhatus.

From Dhatu to word

Let’s build a word step by step:

Root: jñā = to know
Add suffix ana → jñāna = knowledge
Add prefix vi → vijñāna = analytical knowledge (science)

Another one:

Root: bhū = to be
Add tva → bhāva = state of being
Add mahat → mahabhāva = exalted state of emotion

Each dhatu adds more meaning. The original meaning is preserved and expanded.

Compare that to how tokenizers work:

understanding → [“under”, “stand”, “ing”]

Samāsa

One of Sanskritam’s caveats is samāsa— the ability to fuse words together into a compact, compound word, with richer meaning. The joining is governed by Asthadhyayi without any exceptions.

Examples:

rāja (king) + puruṣa (man) → rājapuruṣa = king’s man or official
dharma (duty) + kṣetra (field) → dharmakṣetra = battlefield of duty
atma (self) + jñāna (knowledge) → ātmajñāna = self-knowledge

Each compounded word is denser with more meaning. A single word might encode what takes a full phrase in English.

Sandhi

Another powerful tool is sandhi, where words are joined based on sounds across word boundaries:

tat + api → tadapi (that + also)
rama + īśvara → rameśvara (Rama + Lord)

It improves rhythm, clarity, and euphony. The rules are algorithmic and consistent.

Tokenizers break down text while sandhi intelligently joins it back together, in a way that is logical and natural for human utterance.(based on sound)

Why Sanskrit Excels for Machines

Sanskritam wasn’t built for machines—but it might as well have been. Here’s why:

No ambiguity in structure

In English, order of words matters. “Dog bites man” ≠ “Man bites dog.” In Sanskrit, grammatical function is marked directly in the word using vibhaktis (case endings). That means:

“रामः सीतां पश्यति” (Rāmaḥ Sītāṁ paśyati) – Rāma sees Sītā

Can also be written as:

“सीतां रामः पश्यति” or “पश्यति रामः सीतां”

All still mean the same thing. Because रामः is marked as the subject (nominative case), and सीतां as the object (accusative case), the model doesn’t have to guess based on order.

This is a huge relief for tokenizers and LLMs, which struggle with ambiguity in languages like English.

Phonetic spelling

Sanskrit follows the Devanagari script, where every letter maps to a consistent sound.

क = ka, ख = kha, ग = ga, घ = gha, …

This means:

No silent letters
No multiple spellings for the same sound
Easy-to-predict pronunciation

For machine understanding, this reduces confusion and allows clean, lossless tokenization.

Compare that to English:

“Though” vs “Through” vs “Tough” vs “Thought”

These all look similar but sound wildly different. Sanskritam avoids this mess entirely.

Rule-driven transformations

Every transformation in Sanskrit—from root to sentence—follows precise rules. There’s a grammar engine under the hood.

For example:

gam + ti → gacchati (he/she goes)

gam + tum → gantum (to go)

gam + ta → gata (gone)

This is predictable, logical, and decomposable— perfect for tokenizing and generating.

Compare this with English verbs like:

run → ran → running → runner → ran

Where do these forms come from? Often from memory, not logic.

Compositional clarity

In Sanskrit, you can look at a word and usually break it down into meaningful parts. Let’s take:

Dharma = dhṛ (to uphold) + manin (abstract suffix) → that which upholds

Vidyālaya = vidyā (knowledge) + ālaya (abode) → school

Ānanda = ā (towards) + nanda (joy) → bliss

Tattva = tat (that) + tva (ness) → that-ness or essence

Even big compound words can be parsed with some effort. For example:

Satyadharmaparāyaṇaḥ = satya (truth) + dharma (duty) + parāyaṇa (devoted to) → devoted to the duty of truth

This makes Sanskrit extremely transperant & easy to understand for machines and humans alike.

This is a dream for tokenization. It’s like Sanskrit was designed to be parsed.

Reality

Surprisingly, there are very few roadblocks for a Sanskrit-based tokenizer.

Multiple Scripts – While it primarily uses Devanagari, its also written in Telugu, Kannada etc. Makes it harder to tokenize.
Low digital footprint – Lacks large-scale text corpora. While some sources do exist, they’re not as comprehensive as English.
Ambiguity via Compounding - Technically, words can be joined via sandhi, samāsa or pratyaya to create infinitely long words. But understanding the underlying meaning becomes hard.

Example:

tatsatyadharmaparāyaṇaḥ = “he who is devoted to the truth and righteousness of that (entity/person)”

But unless you’re a trained Sanskritist (or an LLM with a syntax tree), it’s hard to understand its precise meaning. Is it:

“that devotion to truthful dharma”

or “truthful devotion to that dharma”

or “devotion to that which is true and dharma”

Each reading slightly shifts the interpretation.

Another:

satyavākyaḥ (सत्यवाक्यः) => satya (सत्य)(truth) + vākya (वाक्य)(speech)

At first glance, satyavākyaḥ might seem to mean:

“one who speaks the truth” or “of true utterance”

In reality, it means “one whose speech becomes the truth”. Such nuances and caveats are hard for machines to understand.

Conclusion

Despite its elegance, lack of training data and navigating the nuances leaves us with a few roadblocks. Despite these, it will be a fun experiment to see how a model trained only on Valmiki Ramayana or Mahabharata performs.

We are still a few years away from a homegrown, Indian LLM that can truly ‘understand’ Sanskritam. But if you ever wanted a human language that thinks like a machine, it already existed.

It was called Sanskritam.