Introducing KARMA-OpenMedEvalKit: An Open-Source Framework for Medical AI Evaluation

July 24, 2025

AI models often struggle to address India's distinctive healthcare landscape due to training on predominantly western datasets. These models often miss crucial elements like India's rich linguistic diversity, unique healthcare system dynamics, varied genetic backgrounds, and specific disease prevalence patterns.

To bridge this gap, robust evaluation frameworks are essential, ones that incorporate representative datasets and meaningful metrics tailored to assess AI model performance in diverse healthcare contexts. KARMA-OpenMedEvalKit represents an advancement toward this goal. KARMA serves as an expandable evaluation toolkit designed specifically for assessing AI models in medical applications, featuring multiple healthcare-focused datasets with particular emphasis on the Indian healthcare environment.

This comprehensive approach ensures that AI systems can be properly evaluated for their effectiveness in real-world medical scenarios, particularly those reflecting the complexities and specificities of Indian healthcare delivery.

Alongside KARMA, we are contributing four novel evaluation datasets to the research community, specifically designed for developing and assessing medical large language models (LLMs). Additionally, we are making publicly available two specialised LLMs: one optimised for Medical Automatic Speech Recognition (ASR) in English, and another tailored for medical document comprehension.

These resources represent our commitment to advancing the field by providing researchers and developers with the tools necessary to create more effective and contextually appropriate medical AI systems.

What is KARMA?

KARMA (/ˈkɑrmə/) stands for Knowledge Assessment and Reasoning for Medical Applications. It provides a unified package for evaluating medical AI systems, supporting text, image, and audio-based models. This extendible framework includes support for 19 medical datasets and offers standardized evaluation metrics commonly used in healthcare AI research, and out of the box support for models like Qwen, MedGemma, IndicConformer, OpenAI, and AWS Bedrock (Anthropic models).

KARMA's registry system allows researchers to integrate their own models and datasets. Models can be added by inheriting from base classes and registering them using the framework's decorator system. Similarly, new datasets can be registered with specified metrics and task types. KARMA also supports custom metrics that are needed to evaluate datasets with specific requirements of post-processing model output, like in the case of Automatic Speech Recognition (ASR) models, where language specific nuances need to be handled before calculating the metrics. Beyond this, KARMA also has a model output caching mechanism to store the model’s output and re-compute metrics or re-use the outputs as needed.

KARMA can be used through the CLI as well as a package to import. The CLI also has an interactive mode to select the required model and datasets for running the evaluation.

Supported Datasets

KARMA includes evaluation support for medical datasets with following task types:

Question Answering: PubMedQA, MedMCQA, USMLE-style questions
Clinical Tasks: Note generation, medical record parsing, medical history summarisation
Multimodal: Medical VQA, radiology image analysis
Audio: Clinical conversation transcription, medical dictation‍

‍View complete dataset catalog. Adding new datasets involves just adding a single new file to KARMA read further in the documentation here.

Supported Models

Open Source: Qwen, MedGemma, IndicConformer and any Huggingface model can be integrated into KARMA
API Services: OpenAI GPT-4, AWS Bedrock, Eleven Labs, Gemini

Browse all supported models. Here’s how you can add your own models to KARMA.

Supported Metrics

All HuggingFace metrics are supported out of the box, and custom metrics like Healthbench’s rubric-based evaluation have also been added.

We also introduce two new ASR metrics - Semantic WER and entity/keyword WER, to address the challenges in evaluating ASR models in the health care use case. Semantic WER measures transcription accuracy based on semantic meaning, handling numerals, synonyms, abbreviations, and code-switching. Keyword WER evaluates the accuracy of critical medical terms like drug names, dosages, and vital signs within transcripts. Read more about these metrics here.

Eka care dataset and model releases

In addition to KARMA, we also release 2 models and 5 evaluation datasets already integrated within KARMA.

Datasets

We are launching these datasets along with KARMA

All the datasets have been curated by the medical-doctor team at EkaCare. All datapoints undergo rigorous PII removal to ensure complete privacy. Read more about these datasets here.

Models

Along with our datasets, we are also releasing 2 models from our Parrotlet series in the public domain licensed under MIT.

Parrotlet-a-en-5b: A purpose-built model for automatic speech recognition for medical context for English (blog post)
Parrotlet-v-lite-4b: A purpose-built model for medical report understanding (blog post)

Getting Started with Karma

The framework is available as a Python package and can be installed via pip

Installation

pip install karma-medeval

First evaluation

$ karma eval --model "Qwen/Qwen3-0.6B" --datasets openlifescienceai/pubmedqa –max-samples

{
  "openlifescienceai/pubmedqa": {
    "metrics": {
      "exact_match": {
        "score": 0.3333333333333333,
        "evaluation_time": 0.9702351093292236,
        "num_samples": "3"
      }
    },
    "task_type": "mcqa",
    "status": "completed",
    "dataset_args": {},
    "evaluation_time": "7.378399848937988"
  },
  "_summary": {
    "model": "Qwen/Qwen3-0.6B",
    "model_path": "Qwen/Qwen3-0.6B",
    "total_datasets": 1,
    "successful_datasets": 1,
    "total_evaluation_time": 7.380354166030884,
    "timestamp": "2025-07-22 18:43:07"
  }
}

Documentation and Resources

Complete documentation, installation guides, and examples are available at https://karma.eka.care. Source code is hosted on GitHub at https://github.com/eka-care/KARMA-OpenMedEvalKit.

KARMA aims to provide researchers and healthcare organizations with consistent evaluation tools for medical AI development, supporting reproducible research and systematic model comparison across different medical domains.

Roadmap

KARMA is released under the MIT License. We are looking forward to building India medical evaluations with the community. The datasets, models supported through KARMA will be evolving.

Join the Indian Medical AI Revolution

Get Started

pip install karma-medeval

run your first evaluation in 5 minutes

Performance Benchmarking

Evaluation and benchmarking of both the models released by EkaCare and its internal systems is done using KARMA framework. Below we present our evaluation benchmarking results done using KARMA on four datasets mentioned above.

‍
Get Involved

Contribute Data: Integrate your datasets into Karma with addition of just a single file addition https://karma.eka.care/user-guide/add-your-own/add-dataset/
Join Discussion: Connect with 500+ researchers in our Developer Discord
Report Issues: Help us improve on GitHub Issues
Get Support: Reach our team at developer@eka.care

‍

Introducing KARMA-OpenMedEvalKit: An Open-Source Framework for Medical AI Evaluation

What is KARMA?

Supported Datasets

Supported Models

Supported Metrics

Eka care dataset and model releases

Datasets

Models

Getting Started with Karma

Installation

First evaluation

Documentation and Resources

Roadmap

Join the Indian Medical AI Revolution

Performance Benchmarking

Medical ASR Evaluation Dataset

Medical Records Parsing Evaluation Dataset

Structured Clinical Note Generation Dataset

Eka Medical Summarisation Dataset

‍
Get Involved

Related posts

Participate in India’s healthcare revolution

Create ABHA & store all your medical records with Eka care (Govt of India ABDM approved PHR app)

Popular topics

Introducing KARMA-OpenMedEvalKit: An Open-Source Framework for Medical AI Evaluation

What is KARMA?

Supported Datasets

Supported Models

Supported Metrics

Eka care dataset and model releases

Datasets

Models

Getting Started with Karma

Installation

First evaluation

Documentation and Resources

Roadmap

Join the Indian Medical AI Revolution

Performance Benchmarking

Medical ASR Evaluation Dataset

Medical Records Parsing Evaluation Dataset

Structured Clinical Note Generation Dataset

Eka Medical Summarisation Dataset

‍Get Involved

Related posts

Beyond traditional WER: The critical need for semantic WER in ASR for Indian healthcare

Parrotlet-v-lite-4b: Releasing our purpose-built vision LLM for parsing medical records

Parrotlet-a-en-5b: Releasing our purpose-built LLM for english ASR in Indian Healthcare

Participate in India’s healthcare revolution

Create ABHA & store all your medical records with Eka care (Govt of India ABDM approved PHR app)

Popular topics

‍
Get Involved