Quantitative Evaluation

We evaluate SUTRA models on a variety of NLU and NLG tasks diversity. To test the knowledge and reasoning capabilities of the model, we evaluate on the machine-translated version of the benchmarks such as MMLU. While not perfect, these give an indication of the trends in LLM performance for non-English languages.

Overall, the evaluation results demonstrate that SUTRA models are at par in English with state-of-the-art models like GPT-3.5. On non-english languages like Hindi, Gujarati, Tamil, Korean SUTRA models consistently outperform GPT-3.5, Llama2 models by a margin, particularly for providing natural and engaging responses. Although GPT-4 is still state-of-the-art in terms of performance, cost continues to be a major hindrance for wide-scale deployment in cost-sensitive markets.

What are MMLU Benchmarks?

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pre-training by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots