Lab-Ready and Prescription-Perfect: Eka Care's Small LLMs vs. Industry Giants

on
November 26, 2024

In this post, we share our motivation for building custom small LLMs for medical document understanding in the context of India. We show that by using high-quality data for training that is specific to Indian healthcare ecosystem, we've outperformed giant state-of-the-art (SOTA) models like GPT-4o and Claude Sonnet 3.5. 

The Healthcare Data Maze

Healthcare in India is at a crucial turning point, with technology holding a great potential for transformation. However, several challenges are currently hindering this progress:

  1. Low Interoperability: Medical data in India exists in silos, there is barely any digital exchange of information between hospitals, labs, and other stakeholders. This lack of interoperability impacts patient care, delays decision-making, and adds inefficiencies to an already strained system.
  2. Minimal Digitization: While documents like lab reports to a large extent exist in digital and pseudo-structured formats, prescriptions and other healthcare documents are still transitioning from handwritten to digital formats. 
  3. Unstructured and Inconsistent Data: Healthcare records whether digital or handwritten are largely unstructured and lack standardization. Lab reports, prescriptions, and discharge summaries exist in diverse formats and use varied medical vocabulary, often combining free text, abbreviations, and non-standardized terms. These inconsistencies make extracting meaningful information a complex task.

LLMs to the Rescue

The challenges mentioned above create an ideal ground for Large Language Models (LLMs) to shine. Advanced LLMs can potentially understand, process, and interpret unstructured data, making them uniquely suited to tackle India’s healthcare challenges. Till the time creating medical records is truly digital and data exchange happens over formats such as FHIR, employing LLMs can enable transformation of the unstructured data into structured, with machine-understandable linkages with ontologies such as SNOMED-CT and LOINC. This would pave the way for seamless interoperability and better clinical decision-making. With the right high-quality data and models tuned to specific needs, LLMs present a unique opportunity to streamline processes and improve outcomes.

Tailored Solutions for India’s Healthcare Needs

The unique challenges of India’s healthcare ecosystem demand GenAI solutions specifically designed to address its needs. While general-purpose LLMs offer impressive capabilities, they often fall short when applied to the specific requirements of Indian medical data. This is where customized LLMs come into play, offering precision, efficiency, and scalability to bridge the gap.

1. Understanding India-specific formats and terminologies

General-purpose LLMs are typically trained on broad datasets, but healthcare data in India has its own complexities:

  • India-specific entities: Named entities such as branded drugs are typically available only in India and therefore underrepresented in the text corpus at a global level. Hence, these entities are not well understood by SOTA LLMs. 

Let us understand this by an example of Zin 10 mg tablet, which has cetirizine as the main component. The screenshot presented below clearly highlights that it is incorrectly understood by the best of the SOTA models. 

Conversation with a SOTA model highlighting the incorrect understanding of the model about a drug sold in India.
  • Localized Terminologies: Indian healthcare uses a blend of vernacular, medical abbreviations, and colloquial terms (e.g., "Sugar" for diabetes, “Ghera” for dizziness, “?” for suspected etc). 
  • Diverse Formats: Lab reports, prescriptions, and other medical documents vary widely across regions and institutions. Following are examples of how different labs prefer different formats for printing their Lab-reports.
Screenshot highlighting three different formats of lab reports generated by different facilities

Training LLMs on high-quality, India-specific datasets ensures that these models understand and process the unique nuances present in Indian healthcare documents with high accuracy.

2. Frugal Small Models for Scalable Operations

Training and deploying massive, resource-intensive models at scale isn't always practical due to cost constraints. Smaller, task-specific models are more computationally efficient, allowing them to be trained and deployed even in resource-limited environments. Customized small models can enable scalable adoption without compromising performance.

3. Minimizing Hallucinations and Errors

General-purpose LLMs might be more prone to generating hallucinated outputs, especially when dealing with unfamiliar formats and contexts. In healthcare, such errors can have serious consequences. Custom LLMs trained on task-specific data reduce the likelihood of inaccuracies by narrowing their operational scope.

Eka Care's Small LLM vs the Giants

While large-scale models like Sonnet 3.5 and GPT-4o are renowned for their capabilities, they often struggle with the specific challenges of Indian healthcare data. We perform extensive evaluations to understand their shortcomings and subsequently develop small and specialized LLMs. Our small LLM (we name it Parrotlet-V) excels in tasks like lab report parsing, prescription extraction, PII redaction and document classification, consistently outperforming these industry giants. 

Our benchmarking dataset includes thousands of meticulously annotated medical documents spanning lab reports, digital prescriptions, discharge summaries, health insurance policies, and radiology scans. We employ standardized evaluation methodologies, comparing structured outputs entity by entity for precise assessment. To maximize the performance of SOTA LLMs for a justified comparison, we also experimented with prompt tuning, carefully guiding the models to produce outputs in the desired format.

Below, we present the results of our benchmarking experiments. The score here represents fuzzy matching score at entity level aggregated over the corpus.

Comparison of accuracy across models for four different tasks

Detailed entity-level results of Parrotlet-V vs SOTA models for different documents are presented below.

Lab-report parsing:
Prescriptions parsing:
PII extraction:

The results highlight the Parrotlet-V model's superior performance compared to state-of-the-art (SOTA) models. Parrotlet-V not only outperforms models with a comparable number of parameters, such as Qwen2-VL 7B, Llama 3.2 Vision 11B, and Phi-3.5 Vision 4.2B, but also surpasses the largest SOTA models in performance.

On further investigation we find that one of the major reasons for the lower accuracy of the models mentioned above is hallucination. Models often invent a value in the structured output even when that field is empty in the medical document. Examples of these hallucinations are given below. We overcome these hallucinations by carefully balancing dataset such that there are ample instances of blank fields (sparse tables), and use techniques such as DPO during the SFT stage. Also at times some models didn't follow the provided schema to parse information and often even end up generating results as free text.

Other types of errors can be attributed to challenges in contextual inference of the fields such as specimen and panel and method. These fields have to be inferred carefully from the heading / subheadings in the lab report. Even though prompts include this hint, SOTA models fail to consistently infer these fields from the surrounding context.  

Hallucinations: In the following example SOTA model has generated tests that are not even present in the document.

Output of a SOTA model highlighting hallucinated tests which are not present in the report

Contextual understanding: In the following document, our model accurately understands and extracts details like specimen and method, along with all the tests. Meanwhile, GPT-4 and Sonnet models struggle to grasp these visual and medical nuances.

Lab report section showing how the information about specimen and panel has to be contextually inferred

Our experiments show encouraging results, however, there is still a long way to go in order to completely mitigate hallucinations and ensure high recall of the entities.