Releasing Parrotlet-a v2: The Gold Standard for Medical Speech Recognition in Indian Healthcare

on
February 12, 2026
Engineering & AI Research

Releasing Parrotlet-a v2

The most performant medical speech recognition model built specifically for the chaos of Indian Healthcare.

Automatic Speech Recognition (ASR) is the invisible backbone of any AI scribe solution. It is the bridge between the doctor's intent and the digital health record. If the bridge is shaky, the medical record is dangerous.

In this blog, we strip away the marketing hype and evaluate State-of-the-Art (SOTA) ASR models—ranging from "Small" to "Large," from Indian-specific to International Generalists. We benchmark these models on proprietary datasets comprising over 5,000 hours of real-world doctor-patient conversations.

Our findings are counter-intuitive: Models specifically created for purpose using domain data are significantly more performant than massive generalist models, despite being 10x smaller.

Why we need a specialized ASR for India

Why build Parrotlet-a when Whisper and Google Chirp exist? Because the operating environment of an Indian clinic is hostile to standard ASR. It is not a quiet room with a Dictaphone; it is a live, unpredictable acoustic environment.

1. The "Acoustic Chaos"

Indian clinics are noisy. Ceiling fans, open windows with traffic noise, and crowded waiting rooms create a unique noise floor. Standard noise reduction often strips away the spectral details of soft-spoken patients.

2. The "Hinglish" Code-Mix

Conversations are rarely mono-lingual. A sentence often starts in Hindi, switches to English for medical terms, and ends in a regional dialect. "Yeh tablet subah empty stomach lena."

3. The Terminology Minefield

Generalist models frequently "hallucinate" to fix grammar. In medicine, hallucinating a drug name is a critical failure. We need precision, not just fluency.

Benchmarking Results

We evaluated 10 models across 3 datasets using three key metrics:

  • CER (Character Error Rate): Phonetic accuracy.
  • WER (Word Error Rate): Lexical accuracy.
  • Semantic WER (sWER): A weighted metric we developed that penalizes errors in medical entities (drugs, dosages, symptoms) 5x more than common stopwords.

Comparison on the 'Eka-RealWorld-1K' Dataset (Noisy OPD Audio)

Breakdown of Results

Parrotlet-a v2 achieves a Semantic WER of 4.1%, significantly outperforming Whisper Large v3 (12.2%). While Whisper is excellent at generating grammatically beautiful English sentences, it often "corrects" medical shorthand into common English words (e.g., hearing "T. Bactrim" as "Tea back trim").

Parrotlet-a v2, being trained on 2.5 Billion tokens of Indian medical context, understands that "T." stands for "Tablet" and "Bactrim" is an antibiotic.

Listen to the Difference

Numbers are one thing; reality is another. Use the interactive player below to hear how models handle real, noisy clinical audio.

0:14 / 0:42
Whisper Large v3 (Generalist)
Patient complains of gastric issue. Prescribed pain D for two days.
Parrotlet-a v2 (Ours)
Patient complains of gastritis issue. Prescribed Pan-D for two days.

Analysis Over Medical Entity Accuracy

The most critical metric for a scribe is Entity Extraction. If the ASR misses the drug name, the downstream LLM cannot structure the prescription.

We ran an extensive Named Entity Recognition (NER) test on the transcripts. Parrotlet-a v2 captured 96.8% of Indian brand-name drugs correctly. The next best model, Google Chirp, captured only 84.2%. This gap of 12% represents thousands of potential medication errors prevented daily.

Test with your own data

We believe in transparency. Upload a challenging medical audio clip to see raw inference results from Parrotlet-a v2.

Drop your audio file here

Supports .wav, .mp3 (Max 10MB)