Beyond Frontier Models: Grounding AI in Indian Clinical Reality

April 28, 2026

Large Language Models (LLMs) have moved into clinical workflows quickly. They now draft clinical notes, triage patient queries, surface drug information, and guide protocol decisions across a range of healthcare products. The adoption is real, and it is running ahead of the conversation about how to make these systems safe enough for those settings.

Healthcare doesn't tolerate mistakes the way most domains do. If a chatbot hallucinates a restaurant recommendation, you end up with a slightly disappointing dinner. If it hallucinates a drug dose, someone gets hurt. That asymmetry matters. It changes what "good enough" looks like, and it argues for a different engineering posture: clinical claims shouldn't come from a model's fuzzy compression of its training data — they need to be grounded in sources that are actually authoritative.

A common approach to grounding is to give models structured access to authoritative knowledge at inference time through tools, MCP servers, or retrieval pipelines, rather than relying entirely on what they absorbed during pretraining. The approach works. But how much it helps, on which tasks, and whether models actually use these tools well are empirical questions. Without benchmarks designed for those questions, "we have grounding" can become a checkbox with little behind it.

These questions get harder when the context is Indian healthcare.

Why Indian Healthcare Is a Structurally Different Problem

India's clinical reality is not a regional flavor on top of Western medicine. It is, for AI purposes, a different problem space.

The country has more than 500,000 branded drugs in active circulation, most with no US or EU equivalent. Hundreds of manufacturers produce variants of the same generic under different brand names, and the same brand name occasionally maps to different compositions across manufacturers. An LLM that can confidently identify Tylenol or Lipitor often cannot tell you what Jorolac or Cefotel Suspension contain. That is not a model weakness; it is a data density problem. These molecules do not appear with enough frequency in pretraining corpora for parametric memory to handle them reliably.

Clinical protocols are similarly local. ICMR, RSSDI, IAP, MoHFW, and various state health ministries publish guidelines that diverge from FDA or NICE recommendations, partly because of population phenotypic differences and partly because the operational realities of Indian healthcare delivery are different. The National Formulary of India defines its own pharmacological standards. eGFR calculations need attention to Indian-specific patient parameters. Type 2 diabetes management follows RSSDI guidance, not ADA.

A model that scores well on USMLE is, in this context, being measured on a test that does not reflect the conditions it would be deployed into. Most medical AI benchmarks are built for Western healthcare; performance on those benchmarks tells you how a model handles Western drugs, Western protocols, Western clinical conventions. The distance between that and Indian clinical practice is real, consistent, and not closed by simply scaling up the model.

Given the scale of healthcare delivery in India, that gap is not a small one to leave unaddressed.

What We Built

To address this, we built two things at EkaCare: a remote MCP server that gives any compatible LLM client structured access to Indian medical knowledge, and four open datasets that let the community measure how well models actually use that access.

MedAI MCP Tools

The MCP server exposes four knowledge domains as four tools.

Indian Branded Drug Search indexes 500,000+ branded medications drawn from real EMR prescription data — the long tail of drugs Indian doctors actually prescribe, not textbook examples. The tool turns a coverage problem into a lookup problem.

Indian Treatment Protocol Search indexes hundreds of published guidelines from ICMR, RSSDI, IAP, MoHFW, and international publishers in a vector knowledge base. Results return as the actual protocol pages, preserving table structure and clinical notation rather than text re-rendered through another model. The tool exposes two intents: publishers (discover available guideline sources) and search (retrieve by clinical query and publisher name). That two-step design turns out to matter a lot, more on this below.

Indian Pharmacology Details provides structured lookups from the National Formulary of India 2011: indications, contraindications, dosage, adverse effects, and pregnancy safety for generic drugs.

Medical Calculators expose 403 clinical calculators across 26 categories — cardiovascular, nephrology, hematology, OB-GYN, pulmonology, diabetes, pediatrics, and more — as three separate tools that mirror how a clinician approaches an unfamiliar formula: discover available calculators, fetch the input schema, then compute. The model cannot silently misremember the Parkland equation or approximate CKD-EPI. It either runs the right calculator with the right inputs, or it does not.

Four Open Datasets

To make this evaluable, not just usable, we open-sourced four datasets on HuggingFace under the ekacare organization. Each is also integrated into KARMA, our open-source medical AI evaluation framework.

Indian Drug MCQA (ekacare/indian_drug_mcqa): 1,512 questions covering 1,046 medications drawn from real EMR data. Given a brand name, identify the correct generic composition.
NFI Pharmacology MCQA (ekacare/Eka_NFI_MCQA): 925 questions anchored to documents published by the Indian Pharmacopoeia Commission on behalf of MoHFW, covering mechanisms, indications, contraindications, dosing, and adverse effects. Each question carries metadata tags, difficulty assignments, and explanations sourced from NFI text.
Medical Calculator Eval (ekacare/medical_calculator_eval): 1,066 clinical vignettes asking for specific computed values. Ground truth is the actual calculator output. Vignette language reflects Indian clinical reality: mixed shorthand, Hinglish phrasing, and Indian measurement conventions. Numeric tolerance is calibrated per question to clinical meaning.
Protocol Retrieval Rubrics (ekacare/protocol_retrieval_rubrics): 504 clinical queries built against a medical protocols knowledge base, evaluated by an LLM judge against rubrics generated specific to each question and clinical query.

We chose to release these openly because closed benchmarks tend to reflect the priorities of one team, while shared benchmarks let a community converge on what works.

What the Numbers Say

We benchmarked frontier LLMs against all four datasets, with and without tool access. The pattern that emerges is more interesting than a simple "tools help."

Drug Identification: A Coverage Problem, Solved

Model	No Tools	With Tools	Lift
GPT-5.2	83.3%	99.7%	+16.4 pp
Claude Sonnet 4.6	79.8%	99.3%	+19.5 pp
GPT-5-mini	77.6%	82.0%	+4.4 pp

Without tools, both frontier models score in the high 70s to low 80s — reasonable, but well short of what a clinical drug identification application would need. With tool access, both clear 99%.

The gap here is not pharmacological understanding. It is data density. 500,000 brand names and manufacturer variants do not appear often enough in pretraining for parametric memory to handle them reliably. The tool closes the gap by making real, accurate, continually-growing data accessible at inference time. No amount of model scaling fixes a coverage problem.

GPT-5-mini is the outlier, and a useful one. Even with the same tool access, it gains only 4.4 points. The pattern repeats across our other evaluations: smaller models struggle to reliably invoke tools and parse structured outputs. Tool access helps only when the model uses the tool well — a pattern that recurs across the rest of the evaluation.

Pharmacology: Narrower Gap, Still Real

Model	No Tools	With Tools	Lift
GPT-5.2	81.2%	87.7%	+6.5 pp
Claude Sonnet 4.6	78.4%	84.5%	+6.1 pp
GPT-5-mini	80.1%	81.7%	+1.6 pp

Pharmacological principles transfer better across training corpora. A model that learned general pharmacology from Western literature is not starting from zero on mechanisms and indications. But the no-tool ceiling still falls meaningfully below what clinical use requires. The NFI tool moves both frontier models into the mid-80s, recovering accuracy on India-specific dosing conventions and prescription patterns — the kind of detail that matters for handling local AMR realities.

Calculators: Where Formula Memory Fails

Model	No Tools	With Tools	Lift
Claude Sonnet 4.6	43.6%	81.9%	+38.3 pp
GPT-5.2	56.3%	79.3%	+23.0 pp
GPT-5-mini	53.4%	70.5%	+17.1 pp

This is the largest lift in the entire evaluation: 38 percentage points for Claude Sonnet 4.6 on calculators. Models do not fail calculator questions because they do not know what BMI means. They fail because they misremember formula variants, mishandle unit conversions, or compound arithmetic errors across multi-step chains. Deterministic computation eliminates all of that.

But aggregate accuracy hides where models are still failing. Error analysis across 1,066 samples reveals two failure classes that persist even with tool access. The first is wrong calculator selected — picking a semantically adjacent but distinct tool, like the generic BMI calculator where a sex-specific variant is required. The second, affecting 12–13% of samples across all three models, is right calculator, wrong inputs: either enum misselection (mapping "sedentary" to "very_active" in a TDEE activity field) or unit-conversion overreach, where the model receives a correct result, decides it should be in a different unit, and modifies the answer — introducing an error the calculator did not make.

GPT-5-mini exhibits one additional failure: it passes natural-language category names instead of the required snake_case slugs, gets back empty lists, and falls back to self-computation. That single bug accounts for most of its 7.8% "listed but never computed" rate, versus 1–2.4% for the frontier models.

The takeaway: aggregate accuracy hides whether a failure is at tool invocation, calculator selection, input mapping, or output handling. Each is a different problem with a different fix.

Protocol Retrieval: Where the Real Gap Opens

Model	No Tools	With Tools	Lift
Claude Sonnet 4.6	76.9%	92.0%	+15.1 pp
GPT-5.2	78.1%	78.9%	+0.8 pp
GPT-5-mini	74.7%	76.3%	+1.6 pp

This is where the evaluation gets most interesting.

Claude Sonnet 4.6 and GPT-5.2 start with nearly identical no-tool baselines: 76.9% and 78.1%. With tool access, Claude reaches 92.0%. GPT-5.2 reaches 78.9%. Less than a percentage point of gain, from the same tools, on the same dataset.

The difference is not access. It is strategy.

Metric	Claude Sonnet 4.6	GPT-5.2	GPT-5-mini
Samples where model used tools	100%	96.0%	93.5%
Avg tool calls per sample	2.5	2.0	1.5
Samples with more calls than peers	68%	20%	—

The protocol tool, recall, has two intents: publishers (enumerate guideline sources) and search (retrieve by query and publisher). Claude calls publishers first in 99% of samples before searching. GPT-5.2 goes straight to search in 65% of samples, often with an imprecise or guessed publisher name.

This matters because the metadata filter is strict. "RSSDI" and "Research Society for the Study of Diabetes in India" both work. "Indian Diabetes Association" retrieves nothing. A model that calls the tool but supplies a wrong publisher name gets no useful result, and is no better than a model that did not call the tool at all. GPT-5.2 also skips tool use entirely on 4% of samples.

Claude consistently makes at least two calls and often three or more when a question spans multiple guidelines. More calls means more relevant protocol text in context, which translates directly to better answers.

The pattern is consistent with what the calculator error analysis surfaced from a different angle: how a model uses tools matters as much as whether it has access to them. On protocols, skipping discovery means guessing publisher names and getting empty results. On calculators, passing natural-language slugs means getting empty category lists and falling back to self-computation. The mechanism differs, but the underlying failure is similar: the model is not using the tool's structure as intended.

Discovery before retrieval is a one-call overhead, not an advanced technique. The models that skip it leave accuracy on the table — in our evaluation, 13 percentage points between the two best-performing models on the protocol task.

Closing the Tool-Use Gap

The error analysis points to concrete interventions, each tied to a specific failure mode.

Explicit sequencing instructions in system prompts. GPT-5.2 uses the protocol tool, it just skips the discovery call. A single instruction specifying call order would likely recover most of its gap. Specifying that calculator category slugs must use exact snake_case formatting would eliminate GPT-5-mini's most damaging failure mode. These are the lowest-cost fixes and should be tested first.

Few-shot examples of correct tool sequences. A worked example of discovery-then-search teaches the pattern without fine-tuning. For calculators, a few-shot example demonstrating the full discover-fetch-compute flow addresses cases where models short-circuit after seeing the formula.

Fine-tuning on tool-calling trajectories. For production deployments, fine-tuning on demonstrations of correct multi-step tool use changes the model's default behavior rather than relying on prompt engineering that can be overridden or degraded.

Benchmark retrieval and tool-calling behavior, not just final accuracy. The query rubrics in our Protocol Retrieval dataset directly measure publisher selection accuracy — a leading indicator of answer quality. The calculator error breakdown by failure class is more actionable than aggregate accuracy alone. If your evaluation only measures final accuracy, you cannot distinguish a model that retrieved the right source from one that got lucky on parametric knowledge.

Connecting an LLM to MedAI Tools

The MCP server is remote and hosted, and speaks standard MCP over HTTP. No SDK to install, no local process to run.

Endpoint: https://medai-tools.eka.care/mcp
Transport: streamable HTTP
Auth: OIDC via accounts.eka.care. On first connect, your client opens a browser window for Eka login and receives an access token automatically. Internal integrations can pass an Eka API token directly via the Authorization header.
Access: the hosted server requires an Eka account. Request one at ekaconnect@eka.care.
Clients: Claude Desktop (Pro), Claude Code, ChatGPT (paid plans), Cursor, VS Code 1.95+, and any other client that supports remote MCP. Free-tier Claude Desktop works via the mcp-remote bridge.

The quickstart documentation has per-client setup with config snippets, the tool reference covers all six tools and the three-step calculator workflow, and the authentication page documents both the OIDC flow and the direct-token path.

Running the Benchmarks Yourself

Both the datasets and the KARMA evaluation framework are public. KARMA (Knowledge Assessment and Reasoning for Medical Applications) is an open-source toolkit for evaluating medical AI systems across text, image, and audio, with particular focus on India's healthcare environment. It includes standardized metrics, out-of-the-box support for major model providers, and a registry system that lets researchers integrate their own models and datasets with minimal friction.

pip install karma-medeval

karma eval --model <your-model> --datasets ekacare/indian_drug_mcqa

karma eval --model <your-model> --datasets ekacare/Eka_NFI_MCQA

karma eval --model <your-model> --datasets ekacare/medical_calculator_eval

karma eval --model <your-model> --datasets ekacare/protocol_retrieval_rubrics

Run both tracks — with tools and without. The lift between them is often more diagnostic than the absolute score. If you benchmark a model not listed here, we would like to see the results. Contribute at karma.eka.care or join the conversation on GitHub.

What This Means for Teams Building Healthcare AI

The drug and calculator results establish that tool access produces gains large enough to shift from clinically inadequate to clinically viable — 80% to 99% on drug identification, 43% to 82% on calculators. These are tasks where parametric memory does not compete well with authoritative lookup, regardless of model size or training quality. For Indian healthcare AI products deployed without grounding for these capabilities, the gap between model output and clinical acceptability is large enough to warrant attention before scaling further.

The protocol results establish something more nuanced and, for decision-makers, more important. The infrastructure is necessary but insufficient. Two models with equivalent baselines diverge by 13 percentage points based entirely on how they sequence tool calls. The calculator error analysis shows the same dynamic from a different angle — models that nominally use tools but mishandle enum fields, unit conventions, or category slug formats leave substantial accuracy on the table in ways aggregate benchmarks do not surface.

For teams deploying AI in Indian healthcare contexts, this points to two parallel workstreams. First, invest in tool infrastructure — drug databases, calculators, protocol search — built for Indian clinical reality, not adapted from Western equivalents. Second, invest in evaluating how your models use those tools, not just whether they have access. Treat tool-calling strategy as a variable you measure and optimize, not an implementation detail you assume is handled.

For higher-up stakeholders evaluating frontier models for clinical deployment, two gaps deserve attention before procurement decisions are finalised. The gap between "this model passes USMLE" and "this model handles Indian clinical context reliably" is real and measurable. So is the gap between "this model has tools" and "this model uses them well." We have tried to make both gaps visible, and to put the infrastructure — the tools, the datasets, the evaluation framework — into the public domain so the Indian medical AI community can run, extend, and improve them together.

Rigorous evaluation, covering not only accuracy but also retrieval behaviour, tool-calling strategy, and error-class breakdowns, is the part of this work we think is most undersupplied today. We hope the resources released here are useful starting points, and we welcome contributions from teams working on similar problems.

This work would not have come together without Nikhil Kasukurthi, who led much of the engineering and evaluation effort behind MedAI Tools and KARMA. The tool design, the dataset curation, the benchmarking pipeline, and most of the analysis surfaced in this post are the result of his work. Thanks, Nikhil.

Beyond Frontier Models: Grounding AI in Indian Clinical Reality

Why Indian Healthcare Is a Structurally Different Problem