In our evaluation process, we leverage human-generated pairwise preference rankings to compute Elo scores for each model on a per-task basis. Elo scores, a system traditionally used in ranking individuals in competitive scenarios, help gauge a model’s relative effectiveness by predicting the likelihood of its output being preferred by human evaluators over another’s. Unlike traditional uses where Elo scores track players’ evolving skills, our focus is on assessing how users prefer responses from different models.

The evaluation results from over 5000+ tests with human raters on the major BotChat framework demonstrate that SUTRA models can match and surpass leading models like GPT-3.5 and LLama-7b in terms of quantitative ELO scores, particularly in providing accurate (factualness) and up-to-date responses (freshness). This method involves generating Elo scores from numerous random sequences of comparisons. We processed over 5000 such permutations to derive Elo score distributions, enabling us to establish 95% confidence intervals for our assessments.

Freshness and Factuality of Responses