Shipping AI features based on gut feeling is a common mistake. Instead, use systematic evaluation frameworks – 'evals' – with golden datasets and regression tests. This ensures your AI performs reliably and consistently, avoiding unexpected failures in production, especially crucial for India's diverse user base and varied device landscape.
A practical, jargon-free guide for Indian engineering teams and founders — part of the Learn AI with Reeturaj series on InBharat AI.
Many teams, especially those new to AI, ship features the way they've always shipped traditional software: build, test with a few internal users, iterate, release. [2] This approach falls apart with AI, particularly with large language models (LLMs). Why?
I've seen teams in Bengaluru spend weeks fine-tuning a model, only to find in production that it fails on basic queries from Tier-2 cities because their test data was too narrow. This is where systematic AI evaluations, or 'evals', come in. [1]
AI evaluations are systematic frameworks for measuring whether your AI system performs the way you need it to. [1] They are not just about checking for bugs; they are about ensuring the AI meets specific performance criteria and user expectations.
Think of it like this: for traditional software, you write unit tests and integration tests. For AI, you build evals. They answer questions like:
At InBharat AI, our eval system relies on two critical components:
A golden dataset (or 'golden set') is a collection of carefully curated inputs and their expected, correct outputs. These are human-verified examples that represent the desired behavior of your AI. They are the 'ground truth'.
For example, if you're building a feature that extracts entities (like names, locations, dates) from unstructured text:
{"date": "15th August", "person": "Reeturaj Goswami", "organization": "InBharat AI", "location": "Pune"}We build golden sets for every critical AI feature. These sets are not static; they grow and evolve as we discover new edge cases or expand our product's capabilities. For instance, when we added support for more regional Indian languages in UniAssist, our golden sets expanded to include examples in Tamil, Bengali, and Gujarati, ensuring our models understood the nuances.
Once you have a golden set, you run your AI system against it and compare its output to the expected output. This is a regression eval. The goal is to ensure that new code changes, model updates, or data refreshes don't negatively impact existing functionality. Just as CI/CD ensures code quality, regression evals ensure AI quality.
Here’s a simplified flow:
We automate this process. Before any AI feature goes live, it must pass its regression evals with a predefined accuracy threshold. If a change causes a drop in performance on the golden set, the deployment is blocked. This is non-negotiable.
While golden sets and regression evals are foundational, a comprehensive eval strategy includes more:
For us in India, evals are even more critical due to unique challenges:
This is why building for Bharat, as we discuss in Desh Ka AI, requires a disciplined approach to quality.
Starting small is key. Don't try to build the perfect system overnight.
Shipping AI features without a robust evaluation system is akin to driving a car without a speedometer or fuel gauge. You might get where you're going, but you're constantly at risk of running out of gas or crashing. For us at InBharat AI, golden sets and regression evals are non-negotiable. They are the guardrails that ensure our AI products, from UniAssist to Sahayaak Seva, actually deliver on their promise, consistently and reliably, for every user in India. Don't rely on "it looks fine to me"; build a system that proves it's fine. For more on how we build reliable AI, check out our insights on What Agentic AI Really Means.
Author: Reeturaj Goswami #AIEvals #ProductManagement #InBharatAI #TechInIndia #LLMEvals