Large Language Models (LLMs) have moved into clinical workflows quickly. They now draft clinical notes, triage patient queries, surface drug information, and guide protocol decisions across a range of healthcare products. The adoption is real, and it is running ahead of the conversation about how to make these systems safe enough for those settings.
Healthcare doesn't tolerate mistakes the way most domains do. If a chatbot hallucinates a restaurant recommendation, you end up with a slightly disappointing dinner. If it hallucinates a drug dose, someone gets hurt. That asymmetry matters. It changes what "good enough" looks like, and it argues for a different engineering posture: clinical claims shouldn't come from a model's fuzzy compression of its training data — they need to be grounded in sources that are actually authoritative.
A common approach to grounding is to give models structured access to authoritative knowledge at inference time through tools, MCP servers, or retrieval pipelines, rather than relying entirely on what they absorbed during pretraining. The approach works. But how much it helps, on which tasks, and whether models actually use these tools well are empirical questions. Without benchmarks designed for those questions, "we have grounding" can become a checkbox with little behind it.
These questions get harder when the context is Indian healthcare.
India's clinical reality is not a regional flavor on top of Western medicine. It is, for AI purposes, a different problem space.
The country has more than 500,000 branded drugs in active circulation, most with no US or EU equivalent. Hundreds of manufacturers produce variants of the same generic under different brand names, and the same brand name occasionally maps to different compositions across manufacturers. An LLM that can confidently identify Tylenol or Lipitor often cannot tell you what Jorolac or Cefotel Suspension contain. That is not a model weakness; it is a data density problem. These molecules do not appear with enough frequency in pretraining corpora for parametric memory to handle them reliably.
Clinical protocols are similarly local. ICMR, RSSDI, IAP, MoHFW, and various state health ministries publish guidelines that diverge from FDA or NICE recommendations, partly because of population phenotypic differences and partly because the operational realities of Indian healthcare delivery are different. The National Formulary of India defines its own pharmacological standards. eGFR calculations need attention to Indian-specific patient parameters. Type 2 diabetes management follows RSSDI guidance, not ADA.
A model that scores well on USMLE is, in this context, being measured on a test that does not reflect the conditions it would be deployed into. Most medical AI benchmarks are built for Western healthcare; performance on those benchmarks tells you how a model handles Western drugs, Western protocols, Western clinical conventions. The distance between that and Indian clinical practice is real, consistent, and not closed by simply scaling up the model.
Given the scale of healthcare delivery in India, that gap is not a small one to leave unaddressed.
To address this, we built two things at EkaCare: a remote MCP server that gives any compatible LLM client structured access to Indian medical knowledge, and four open datasets that let the community measure how well models actually use that access.
The MCP server exposes four knowledge domains as four tools.
Indian Branded Drug Search indexes 500,000+ branded medications drawn from real EMR prescription data — the long tail of drugs Indian doctors actually prescribe, not textbook examples. The tool turns a coverage problem into a lookup problem.
Indian Treatment Protocol Search indexes hundreds of published guidelines from ICMR, RSSDI, IAP, MoHFW, and international publishers in a vector knowledge base. Results return as the actual protocol pages, preserving table structure and clinical notation rather than text re-rendered through another model. The tool exposes two intents: publishers (discover available guideline sources) and search (retrieve by clinical query and publisher name). That two-step design turns out to matter a lot, more on this below.
Indian Pharmacology Details provides structured lookups from the National Formulary of India 2011: indications, contraindications, dosage, adverse effects, and pregnancy safety for generic drugs.
Medical Calculators expose 403 clinical calculators across 26 categories — cardiovascular, nephrology, hematology, OB-GYN, pulmonology, diabetes, pediatrics, and more — as three separate tools that mirror how a clinician approaches an unfamiliar formula: discover available calculators, fetch the input schema, then compute. The model cannot silently misremember the Parkland equation or approximate CKD-EPI. It either runs the right calculator with the right inputs, or it does not.
To make this evaluable, not just usable, we open-sourced four datasets on HuggingFace under the ekacare organization. Each is also integrated into KARMA, our open-source medical AI evaluation framework.
We chose to release these openly because closed benchmarks tend to reflect the priorities of one team, while shared benchmarks let a community converge on what works.
We benchmarked frontier LLMs against all four datasets, with and without tool access. The pattern that emerges is more interesting than a simple "tools help."
Without tools, both frontier models score in the high 70s to low 80s — reasonable, but well short of what a clinical drug identification application would need. With tool access, both clear 99%.
The gap here is not pharmacological understanding. It is data density. 500,000 brand names and manufacturer variants do not appear often enough in pretraining for parametric memory to handle them reliably. The tool closes the gap by making real, accurate, continually-growing data accessible at inference time. No amount of model scaling fixes a coverage problem.
GPT-5-mini is the outlier, and a useful one. Even with the same tool access, it gains only 4.4 points. The pattern repeats across our other evaluations: smaller models struggle to reliably invoke tools and parse structured outputs. Tool access helps only when the model uses the tool well — a pattern that recurs across the rest of the evaluation.
Pharmacological principles transfer better across training corpora. A model that learned general pharmacology from Western literature is not starting from zero on mechanisms and indications. But the no-tool ceiling still falls meaningfully below what clinical use requires. The NFI tool moves both frontier models into the mid-80s, recovering accuracy on India-specific dosing conventions and prescription patterns — the kind of detail that matters for handling local AMR realities.
This is the largest lift in the entire evaluation: 38 percentage points for Claude Sonnet 4.6 on calculators. Models do not fail calculator questions because they do not know what BMI means. They fail because they misremember formula variants, mishandle unit conversions, or compound arithmetic errors across multi-step chains. Deterministic computation eliminates all of that.
But aggregate accuracy hides where models are still failing. Error analysis across 1,066 samples reveals two failure classes that persist even with tool access. The first is wrong calculator selected — picking a semantically adjacent but distinct tool, like the generic BMI calculator where a sex-specific variant is required. The second, affecting 12–13% of samples across all three models, is right calculator, wrong inputs: either enum misselection (mapping "sedentary" to "very_active" in a TDEE activity field) or unit-conversion overreach, where the model receives a correct result, decides it should be in a different unit, and modifies the answer — introducing an error the calculator did not make.
GPT-5-mini exhibits one additional failure: it passes natural-language category names instead of the required snake_case slugs, gets back empty lists, and falls back to self-computation. That single bug accounts for most of its 7.8% "listed but never computed" rate, versus 1–2.4% for the frontier models.
The takeaway: aggregate accuracy hides whether a failure is at tool invocation, calculator selection, input mapping, or output handling. Each is a different problem with a different fix.
This is where the evaluation gets most interesting.
Claude Sonnet 4.6 and GPT-5.2 start with nearly identical no-tool baselines: 76.9% and 78.1%. With tool access, Claude reaches 92.0%. GPT-5.2 reaches 78.9%. Less than a percentage point of gain, from the same tools, on the same dataset.
The difference is not access. It is strategy.
The protocol tool, recall, has two intents: publishers (enumerate guideline sources) and search (retrieve by query and publisher). Claude calls publishers first in 99% of samples before searching. GPT-5.2 goes straight to search in 65% of samples, often with an imprecise or guessed publisher name.
This matters because the metadata filter is strict. "RSSDI" and "Research Society for the Study of Diabetes in India" both work. "Indian Diabetes Association" retrieves nothing. A model that calls the tool but supplies a wrong publisher name gets no useful result, and is no better than a model that did not call the tool at all. GPT-5.2 also skips tool use entirely on 4% of samples.
Claude consistently makes at least two calls and often three or more when a question spans multiple guidelines. More calls means more relevant protocol text in context, which translates directly to better answers.
The pattern is consistent with what the calculator error analysis surfaced from a different angle: how a model uses tools matters as much as whether it has access to them. On protocols, skipping discovery means guessing publisher names and getting empty results. On calculators, passing natural-language slugs means getting empty category lists and falling back to self-computation. The mechanism differs, but the underlying failure is similar: the model is not using the tool's structure as intended.
Discovery before retrieval is a one-call overhead, not an advanced technique. The models that skip it leave accuracy on the table — in our evaluation, 13 percentage points between the two best-performing models on the protocol task.
The error analysis points to concrete interventions, each tied to a specific failure mode.
Explicit sequencing instructions in system prompts. GPT-5.2 uses the protocol tool, it just skips the discovery call. A single instruction specifying call order would likely recover most of its gap. Specifying that calculator category slugs must use exact snake_case formatting would eliminate GPT-5-mini's most damaging failure mode. These are the lowest-cost fixes and should be tested first.
Few-shot examples of correct tool sequences. A worked example of discovery-then-search teaches the pattern without fine-tuning. For calculators, a few-shot example demonstrating the full discover-fetch-compute flow addresses cases where models short-circuit after seeing the formula.
Fine-tuning on tool-calling trajectories. For production deployments, fine-tuning on demonstrations of correct multi-step tool use changes the model's default behavior rather than relying on prompt engineering that can be overridden or degraded.
Benchmark retrieval and tool-calling behavior, not just final accuracy. The query rubrics in our Protocol Retrieval dataset directly measure publisher selection accuracy — a leading indicator of answer quality. The calculator error breakdown by failure class is more actionable than aggregate accuracy alone. If your evaluation only measures final accuracy, you cannot distinguish a model that retrieved the right source from one that got lucky on parametric knowledge.
The MCP server is remote and hosted, and speaks standard MCP over HTTP. No SDK to install, no local process to run.
The quickstart documentation has per-client setup with config snippets, the tool reference covers all six tools and the three-step calculator workflow, and the authentication page documents both the OIDC flow and the direct-token path.
Both the datasets and the KARMA evaluation framework are public. KARMA (Knowledge Assessment and Reasoning for Medical Applications) is an open-source toolkit for evaluating medical AI systems across text, image, and audio, with particular focus on India's healthcare environment. It includes standardized metrics, out-of-the-box support for major model providers, and a registry system that lets researchers integrate their own models and datasets with minimal friction.
pip install karma-medeval
karma eval --model <your-model> --datasets ekacare/indian_drug_mcqa
karma eval --model <your-model> --datasets ekacare/Eka_NFI_MCQA
karma eval --model <your-model> --datasets ekacare/medical_calculator_eval
karma eval --model <your-model> --datasets ekacare/protocol_retrieval_rubricsRun both tracks — with tools and without. The lift between them is often more diagnostic than the absolute score. If you benchmark a model not listed here, we would like to see the results. Contribute at karma.eka.care or join the conversation on GitHub.
The drug and calculator results establish that tool access produces gains large enough to shift from clinically inadequate to clinically viable — 80% to 99% on drug identification, 43% to 82% on calculators. These are tasks where parametric memory does not compete well with authoritative lookup, regardless of model size or training quality. For Indian healthcare AI products deployed without grounding for these capabilities, the gap between model output and clinical acceptability is large enough to warrant attention before scaling further.
The protocol results establish something more nuanced and, for decision-makers, more important. The infrastructure is necessary but insufficient. Two models with equivalent baselines diverge by 13 percentage points based entirely on how they sequence tool calls. The calculator error analysis shows the same dynamic from a different angle — models that nominally use tools but mishandle enum fields, unit conventions, or category slug formats leave substantial accuracy on the table in ways aggregate benchmarks do not surface.
For teams deploying AI in Indian healthcare contexts, this points to two parallel workstreams. First, invest in tool infrastructure — drug databases, calculators, protocol search — built for Indian clinical reality, not adapted from Western equivalents. Second, invest in evaluating how your models use those tools, not just whether they have access. Treat tool-calling strategy as a variable you measure and optimize, not an implementation detail you assume is handled.
For higher-up stakeholders evaluating frontier models for clinical deployment, two gaps deserve attention before procurement decisions are finalised. The gap between "this model passes USMLE" and "this model handles Indian clinical context reliably" is real and measurable. So is the gap between "this model has tools" and "this model uses them well." We have tried to make both gaps visible, and to put the infrastructure — the tools, the datasets, the evaluation framework — into the public domain so the Indian medical AI community can run, extend, and improve them together.
Rigorous evaluation, covering not only accuracy but also retrieval behaviour, tool-calling strategy, and error-class breakdowns, is the part of this work we think is most undersupplied today. We hope the resources released here are useful starting points, and we welcome contributions from teams working on similar problems.
This work would not have come together without Nikhil Kasukurthi, who led much of the engineering and evaluation effort behind MedAI Tools and KARMA. The tool design, the dataset curation, the benchmarking pipeline, and most of the analysis surfaced in this post are the result of his work. Thanks, Nikhil.