Tokenizer

SUTRA’s multilingual tokenizer significantly reduces the cost and computational burden of LLMs by efficiently handling diverse languages with a balanced vocabulary. This breakthrough approach ensures better performance, especially for non-English languages, making multilingual models more accessible and cost-effective.

Try out SUTRA’s Tokenizer on Hugging Face

SUTRA’s TokenizerCopied!

One of the primary reasons other models are slow and inefficient in non-English languages is due to their tokenizers, which have vocabularies primarily focused on English. Bilingual models often extend this vocabulary to include other languages, which can hamper performance in English. In contrast, SUTRA's vocabulary is trained with balanced data from multiple languages, leading to an efficient token distribution across languages.

The above plot was generated by classifying tokens from vocabulary and computing histogram over language distribution. Tokens from major Indian languages like Hindi, Gujarati, Tamil, Bengali, Marathi, Urdu, Malayalam, Telugu, Punjabi and Kannada were grouped together as Indian languages.

Our approach involves training a Sentence Piece model based tokenizer with uniform sampling from a wide range of languages using a curated and balanced dataset. SUTRA’s tokenizer has a significantly larger vocabulary, allowing it to efficiently represent multiple languages simultaneously. By avoiding an excessive English bias and maintaining a reasonable level of granularity, our tokenizer and model can better preserve semantic meaning across different languages. Text generated with our tokenizers lead to 80% to 200% reduction in overall tokens consumed across languages, which is critical for bringing down the cost of inferencing when deploying these models for cost-sensitive use-cases.

The example above shows comparison of tokenizers of SUTRA and other leading LLMs. SUTRA tokenizer consumes less tokens across languages and is more efficient than tokenizers from leading models like Llama (from Meta), Gemma (from Google) and GPT-3.5, GPT-4 and even GPT-4o (from OpenAI).