Tokenizer
SUTRA’s multilingual tokenizer significantly reduces the cost and computational burden of LLMs by efficiently handling diverse languages with a balanced vocabulary. This breakthrough approach ensures better performance, especially for non-English languages, making multilingual models more accessible and cost-effective.
Try out SUTRA’s Tokenizer on Hugging Face
SUTRA’s TokenizerCopied!
One of the primary reasons other models are slow and inefficient in non-English languages is due to their tokenizers, which have vocabularies primarily focused on English. Bilingual models often extend this vocabulary to include other languages, which can hamper performance in English. In contrast, SUTRA's vocabulary is trained with balanced data from multiple languages, leading to an efficient token distribution across languages.

Our approach involves training a Sentence Piece model based tokenizer with uniform sampling from a wide range of languages using a curated and balanced dataset. SUTRA’s tokenizer has a significantly larger vocabulary, allowing it to efficiently represent multiple languages simultaneously. By avoiding an excessive English bias and maintaining a reasonable level of granularity, our tokenizer and model can better preserve semantic meaning across different languages. Text generated with our tokenizers lead to 80% to 200% reduction in overall tokens consumed across languages, which is critical for bringing down the cost of inferencing when deploying these models for cost-sensitive use-cases.