The most performant medical speech recognition model built specifically for the chaos of Indian Healthcare.
Automatic Speech Recognition (ASR) is the invisible backbone of any AI scribe solution. It is the bridge between the doctor's intent and the digital health record. If the bridge is shaky, the medical record is dangerous.
In this blog, we strip away the marketing hype and evaluate State-of-the-Art (SOTA) ASR models—ranging from "Small" to "Large," from Indian-specific to International Generalists. We benchmark these models on proprietary datasets comprising over 5,000 hours of real-world doctor-patient conversations.
Our findings are counter-intuitive: Models specifically created for purpose using domain data are significantly more performant than massive generalist models, despite being 10x smaller.
Why build Parrotlet-a when Whisper and Google Chirp exist? Because the operating environment of an Indian clinic is hostile to standard ASR. It is not a quiet room with a Dictaphone; it is a live, unpredictable acoustic environment.
Indian clinics are noisy. Ceiling fans, open windows with traffic noise, and crowded waiting rooms create a unique noise floor. Standard noise reduction often strips away the spectral details of soft-spoken patients.
Conversations are rarely mono-lingual. A sentence often starts in Hindi, switches to English for medical terms, and ends in a regional dialect. "Yeh tablet subah empty stomach lena."
Generalist models frequently "hallucinate" to fix grammar. In medicine, hallucinating a drug name is a critical failure. We need precision, not just fluency.
We evaluated 10 models across 3 datasets using three key metrics:
Comparison on the 'Eka-RealWorld-1K' Dataset (Noisy OPD Audio)
Parrotlet-a v2 achieves a Semantic WER of 4.1%, significantly outperforming Whisper Large v3 (12.2%). While Whisper is excellent at generating grammatically beautiful English sentences, it often "corrects" medical shorthand into common English words (e.g., hearing "T. Bactrim" as "Tea back trim").
Parrotlet-a v2, being trained on 2.5 Billion tokens of Indian medical context, understands that "T." stands for "Tablet" and "Bactrim" is an antibiotic.
Numbers are one thing; reality is another. Use the interactive player below to hear how models handle real, noisy clinical audio.
The most critical metric for a scribe is Entity Extraction. If the ASR misses the drug name, the downstream LLM cannot structure the prescription.
We ran an extensive Named Entity Recognition (NER) test on the transcripts. Parrotlet-a v2 captured 96.8% of Indian brand-name drugs correctly. The next best model, Google Chirp, captured only 84.2%. This gap of 12% represents thousands of potential medication errors prevented daily.
We believe in transparency. Upload a challenging medical audio clip to see raw inference results from Parrotlet-a v2.
Drop your audio file here
Supports .wav, .mp3 (Max 10MB)
.jpg)
.jpg)
%20(1).jpg)