Token problem of English

Tokenization, a critical step in NLP pipeline, involves converting text into a sequence of tokens, where each token represents a subword or word. Although English specific tokenizers can generate text in non-English languages, they don’t capture language specific nuances and are highly inefficient in other languages, especially non-Romanized languages. More specifically for Indian languages like Hindi, Gujarati or Tamil, we note that tokenizers from leading LLMs like Llama-2, Mistral and GPT-4 consume 4.5X to 8X more tokens compared to English.

Token Efficiency Comparison

Better & Efficient Tokenizers

The first step in adding language specific skills is decreasing the token fertility (average number of tokens a word is split into) of its tokenizer on non-english text. This makes inferencing efficient as well as semantically meaningful. We train sentence-piece tokenizer from a large proprietary corpus of multi-language dataset of 500K+ documents, which is then merged with pre-trained english tokenizer to increase the vocabulary size. Text generated with our tokenizers lead to 80% to 200% reduction in overall tokens consumed across languages, which is critical for bringing down the cost of inferencing when deploying these models for cost-sensitive use-cases. Furthermore, we found that models fine-tuned with these efficient multi-lingual tokenizers performed better than those trained with 8X larger number of tokens with mono-lingual tokenizers.