Scaling deep learning models for millions of users

Nikhil Kasukurthi
October 30, 2023

At EkaCare, deep learning powers key capabilities that improve healthcare experiences for millions of users. Across all our verticals, we have deeply integrated multiple DNN models to enrich the user experience.

In Eka’s journey so far, we have tried multiple approaches around the development and deployment of DNN models, using 3rd party services, as well as developing in-house capabilities. Through this article, we will explore different use cases of machine learning at EkaCare, along with the nuances in each of the pipelines. We also dive into the considerations involved while scaling these pipelines while keeping the cost in check.

Deep learning at Eka

Personal Health Record (PHR) application

EkaCare is India’s one of the leading PHR applications, where users can securely store, access, and share different types of health records such as their lab reports, prescriptions, discharge summaries, and medical history. Users can track their longitudinal health journey and get medically meaningful insights. Few screenshots of our PHR applications are shown in the image below.

Medical records section in EkaCare's PHR application

We facilitate our users to readily organize and gather their medical records through our Gmail sync feature. Using our sophisticated document understanding transformer-based models we are able to automatically classify different types of medical records ranging from prescriptions, and lab reports to MRI scans, and differentiate them from documents that are medically irrelevant. This entire process entails no human intervention, thereby guaranteeing the utmost levels of privacy for our users.

Lab reports and digital prescriptions are further parsed to convert unstructured information into structured. We identify different types of entities and perform linking to their unified identifiers. In addition, we also identify user profiles based on name, age, and gender. This is what enables EkaCare to automatically consolidate lab vitals of a user across different laboratories in India, and to enable our users to see longitudinal trends. 

We have an intricate pipeline of multiple deep-learning models to enable the features as mentioned earlier. Furthermore, since the source of these documents vary from self-uploads, to gmail sync to the ones fetched within the ABDM ecosystem, the expectation of SLA varies considerably. Owing to reasonably high traffic (north of 100K documents per day at the time of writing this article), and varied range of SLA expectations, we have set up multiple AWS SQS queues to create a priority order for their parsing. The messages are pushed to a common queue by multiple producers, then they are routed to their respective priority queue based on source. 

Electronic medical record application 

EkaCare is a connected healthcare ecosystem for patients and doctors, we provide an EMR (Electronic Medical Record) tool for doctors to create prescriptions and manage their practice. While creating prescriptions doctors can search medical entities in Eka’s medical knowledgebase, which are essentially based on SNOMED and LOINC Ontology. These standard terminologies help in ensuring interoperability across systems and consolidated analytics for the doctors.

EkaCare EMR application

Due to factors such as high OPD traffic, the descriptive nature of symptoms and diagnosis, and prior behavioral patterns, doctors at times prefer to write free text beyond auto-suggest. This free text could be cases where symptoms are written in the way the patient has described (Eg: Head is spinning) or the doctor would like to capture a more nuanced diagnosis with properties (Eg: No h/o of Dm2, htn or example in ). To make sense of this textual content, Eka has developed its own NER / NEL system (called Alchemist) tuned for medical entities often used in prescriptions.  Alchemist is based on a transformer model and is trained on both openly available. We find that cloud solutions for this problem still do not work as well in the Indian context due to the presence of short hands, colloquial language, and nuances specific to medical specialties. 

Another challenge we faced was that while prescribing medications, doctors tend to write advices or devices (like Blood pressure monitor, thumb split) in the same input meant for medications. This is a challenge since we have product flows like the medication reminder that are built for drugs and having an advice creates a bad user experience. To alleviate this, we have BERT-based classification model that is invoked at every search selection to check if the entity is an advice or a device. 

Customer support and helpdesk

On our EMR tool, we have support chat integration to help our doctors with their queries and also for new doctors to show their interest in using the tool. These queries range from feature requests/explanations, bugs in the tool, sales, and leads. Initially, maintaining good visibility over the conversations was taxing, since these exchanges had to be manually examined and tagged to one of the classes. 

To help our customer success team with better visibility, we used the Claude-V2 LLM by Anthropic through Amazon Bedrock. Following the precedent of OpenAI’s function_calling, we developed a similar approach with Claude. The conversations are rated on the overall satisfaction, the anger quotient, and also generating a summary to peruse. We benchmarked these generated metric metrics, against a sample dataset that was annotated by team members and the accuracy is staggeringly good.

One of the largest contributors to the support requests is feature explanation, although we have an extensive library of articles on using different features, surfacing this to the user in the correct context is challenging. We are planning on using the Amazon Titan embedding model to surface relevant articles based on semantic similarity to the query. 

Bedrock enabled us to prototype very quickly and especially helped us benchmark multiple different LLMs without integrating their APIs specifically. We will be writing another article soon about our experiments with LLMs, which will also include our efforts on fine-tuning them for our specific use-cases.

Background on our training approach

Our training pipelines primarily utilize Sagmaker training notebooks.  All our models are in PyTorch, and while serving, we optimize through quantization and TorchServe. Model versioning is maintained on weights and biases.  Post experimentation, we export the model manually to the respective deployment environment. At the time of writing this, we had around 8 models and an automated approach for this model export is not required. 

Model deployment 

To deploy these models, we have heavily leveraged Sagemaker from AWS. 

We find that the Boto3 Python SDK is very evolved, although at times you need to hunt for clues in the documentation about specific nuances, overall it’s easy to set up and use. All the different parameters can be driven through the Python SDK itself. 

The community support is also fantastic for Sagemaker, there are also third-party companies like HuggingFace contributing to the Sagemaker SDK and making it very easy to deploy HF models.

We have set up Continuous Deployment (CD) on the GitHub actions runner to deploy the models to sagemaker. The catch here is that, while deploying for the first time, we need to create the endpoint configuration and also the endpoint itself. The models trained outside Sagmaker need to be uploaded to s3. If the endpoint already exists, then we need only to update the endpoint.

Currently, there are two different types of deployments at Eka, one for HuggingFace based models and the other for pure PyTorch models. This separation helps us use the boto3 hugging face SDK for easier deployment and also to use the optimizations made by hugging face.

Our CD pipeline is also very easy to debug since the boto3 exceptions are precise and easy to debug. This is not the case with some of the other alternative model hosting platforms.


Rightsizing of the resources is very important from a cost perspective. Good monitoring is key to finding the correct balance of the infra resource against the load. In this section, we detail a few catches that need to be accounted for. 

Another key metric is GPU utilization, in the use case with advice/device classification, the GPU memory being used was very low. This led us to optimize the model to utilize CPU better through quantization and TorchServe. On load testing the endpoint without GPU with a sample payload with Apache bench, the latency was within acceptable range.

Load handling and GPU availability 

Per the regulation, all our data is required to reside in India, so the GPU availability of cloud vendors in India is an important factor when choosing a vendor. On Sagemaker, all the NVIDIA GPUs are available in India for real-time inference, with very competitive pricing. The scale-up and scale-in time is also very consistent and predictable, considering the current global GPU shortage, some of the other vendors could not promise us the GPU provisions that we needed. 

Costs beyond compute

The network in (ingress) and network out (egress) costs can climb up very quickly, especially in the case of images and pdf(s) formats when the load increases. 

By default, sagemaker routes the traffic through the Internet, when the model predictions are being consumed by internal services (our current Kubernetes infrastructure is on EKS), routing this traffic through the Internet is redundant. Setting up a VPC endpoint to route that traffic can have significant savings on costs. Cloudwatch logs also factor into the overall cost, so logging large payloads (like base64 images) should be avoided. 

Another factor beyond the cloud is the effort required to maintain these pipelines and to identify errors once deployed, here again, the Amazon ecosystem through Cloudwatch alarms is very handy. This cohesive integration of other services is currently lacking in other providers. 

Closing thoughts

Our setup evolved with cost being a key consideration throughout. To achieve that, we have optimized model inference on CPU only and delved deeper into monitoring and costs that are incurred apart from the compute. Having a strong partnership with our cloud providers helped us resolve issues and understand our billing better.