 Next up we have Manas who is the data science and NLP lead at our episode which is a US healthcare company and today you will be talking about building a decentralized data collection platform for deep learning healthcare. Good morning. So before we start off I just want to quickly introduce myself in addition to what she mentioned. So I have been working with machine learning and data science at episodes and our major target has been to build automated platforms for medical charts. So you have a length of medical charts and you would want to analyze what content it has. So you would have to do a fair bit of machine learning as well as natural language processing to extract entities and entities being lapped as procedures, diseases and then map them to the corresponding ICD codes. So in that effort it has been a massive exercise both from a point of view of data collection as well as creating machine learning pipelines and one of the major targets and challenges of creating machine learning pipelines in healthcare is you have to abide by certain regulations and how they add more complications and how we essentially worked around it is what this talk is going to be about. So this is the broad agenda. So we will talk about the problems and challenges and I will take you through three architectures that we are using in production now and how they are essentially allowing us to be HIPAA compliant as well as ensuring that our machine learning pipelines work efficiently. So this is the problem statement. So last year when I joined episodes and was brought on board to lead the data science team, episodes were struggling with a fair amount of competitive market pressures and these market pressures were more to do with the amount of people you had to hire to deliver certain projects. Now the projects to resources essentially the employee was a very linear relationship. So if you wanted to deliver more projects, you would have to have hire more people and since this is a very linear relationship this scalability then suffers massively. So what you would want to do is add augmented intelligence using machine learning and NLP and ensure that these things are taken care of. And of course episodes since it is in a healthcare risk management business, it becomes a lot more important for it to adhere to new ML and NLP based experience platforms and ensure that they are at the bleeding edge of the healthcare machine learning research. So when I was brought on board I was told and I was given a very clear directive that we have to build a scalable information extraction pipeline for healthcare documents and that's what we started off with. Now there were major three challenges with creating this machine learning pipeline. The first being that it has to be scalable and the second being it has to be local. As it goes with any machine learning platform or any machine learning pipeline you create, there are three things. It can be good, fast or cheap and you cannot have all three of them. You can only choose two of them. So you can either have a cheap and good but it won't be fast and so on. So it's a triangle which you only have access to a couple of them. So that was a big major challenge while creating this platform. And of course when we are talking about creating a platform at that scale where you are processing not thousands or hundreds of documents but we are processing millions of documents. You have to create an architecture which is useful for optimized machine learning as well as optimized training data collection. And anyone who has worked with healthcare will tell you that there is a big compliance issue when you are working with machine learning healthcare. Now privacy is a big factor nowadays whether you are working in finance or whether you are working healthcare or anywhere. So privacy and when you are using social networks or when your medical data is out there becomes lot more important. And HIPAA is that rule which guides and has a list of compliance checklists that you have to take care of. Now what it essentially talks about at the core of it is that you have to secure the data both while it is in transit as well as the rest. So while you are transferring the data or while you are storing the data you have to secure it in your best possible manner. And that's what HIPAA compliance talks about. And the KPI for us was we have to minimize the cost of processing both from the data cleaning data preprocessing and the data inference the machine learning inference that is to the bare minimum USD per chart cost as much as possible. And I'll talk about what that cost has been brought down to. So this was a philosophy when I started off for the team. So I think this is a philosophy which is true for majority of the machine learning teams out there. So we would want scalable fault tolerance cost effectively machine learning pipelines and we would want immutable configurations. So I do not want to be giving excuses to my boss it works on my machine I don't know why it's not working on the cloud. So I don't want to give those kind of reasoning to my peers or to my superiors. And at the end of the day I did not want a bunch of people who are very good at DevOps or very good at ML. So I wanted people who are mostly ML Ops kind of a profile people so who have a fair bit of understanding of machine learning as well as they can deploy their own algorithms because if I create my own algorithms and allow someone else to deploy it Chinese whispers essentially will dictate that by the time my machine learning pipeline is deployed it has become something else to what I had devised in the earlier. So that's what I want to avoid in the first place. So anyone in my team is able to create algorithm as well as deployed on it on their own. So now why is HIPAA roadblock for us now you may say ok HIPAA dictates certain security dictates that you undertake certain security measures in terms of securing ML pipelines but the major reasons why HIPAA becomes a roadblock is one there are many machine learning as well as backend architectures that you now can't use ok now for example if you are using if you are working on AWS cloud there are many AWS services which are not HIPAA compliant which means that you can't use them. Now there are services like S3 there are services like AWS lambda which are HIPAA compliant AWS lambda is not actually HIPAA compliant so you have to make sure that you are launching it in your own private VPC to make it HIPAA compliant but explicitly it's not HIPAA compliant and you can't have public IP for service which means that if I have to run my entire machine learning pipeline monitoring it as well as ensuring to how to debug it becomes an issue so I have to create a automated pipeline where all these logs as well as debugging and what information in the way they are processed becomes lot more streamlined and as I said that data has to be secure data privacy has to be at the utmost and that was going to regulate our entire machine learning architecture so that's when we decided to go serverless so we said okay there are three components to machine learning one is the training data collection now you would know anyone who has worked in machine learning will tell you that no matter how complicated your algorithms are if you don't have quality training data doesn't matter okay so you can have a high quality training data with as something as simple as logistic regression is going to work very well unlike you have 10 layer deep neural networks based on PyTorch with very low quality training data so the first target was to collect that data in the first place and we being in healthcare there is no public data available which is annotated which says okay this is a medical chart this these are the annotated diseases these are the annotated labs these are the annotated procedures so in that case what we had to do is since we had and that's why we like to call episodes as a machine learning heaven because we have roughly more than 3000 people and our data science and LP team can get access to anywhere from 50 to 200 people to create training data for three months at very short notice so if you want to create training data I could just call up the senior management who handles ops and ask them okay I want certain amount of people deployed for certain amount of time but even then they're going to be working on private healthcare data they're going to be annotating private medical data and so which means that their entire architecture to even annotate that data as well as collect the training data has to be encrypted as well as a backup plan so we'll talk about couple of methods how we are doing so and well we are while we are going going serverless there are multiple platforms like Docker which allow us to create those immutable configures and anyone who has used Docker will also tell you that once it's you learn the basics of Docker it's extremely easy to deploy and as well as use it in machine learning it's a very nifty tool when it comes to any amount of DevOps deployments and these are our major kind of pillars so we are using Docker Ansible Boto which is a python wrapper around AWS cloud and the AWS platform itself and that's what we believe is if I'm not maintaining my platform if I'm not spending time maintaining my platform which is essentially no ops that's the best kind of doubt so I did not want my DevOps people to come in and just work on maintenance but rather create a HIPAA compliant platform which is self-healing as well as self-reland so here I'll take you through three architectures so there are three architectures that I'm going to be showcasing and one is the data collection architecture where we are creating it in a serverless manner as well as ensuring that they're HIPAA compliant the second part is the data processing and the third part is the machine learning model deployments so these are the three separate architectures that I'll basically be showing you as to how we are creating these machine learning platforms so this is the broad overview so how do I essentially make sure that my environments at the bare minimum are well maintained so I use PyCharm as my preferred id and since I have to use Python to create all my machine learning models and exclusively use PyTorch now so what I've done is I've created a Docker container as a remote interpreter for PyCharm and every time I'm compiling a bunch of code I'm testing this Docker container and what happens is that my code base machine learning training my preprocessing as well as my prediction code bases are tested in real time and all I have to do is I have to push this Docker image to AWS CCR now AWS CCR is same as Docker Hub for private Docker images and once I've done this I essentially have lambdas triggering these architectures and we are using services like AWS STS which is the security token service to generate temporary credentials so we are not plugging in the core AWS security keys but rather generating these keys on the fly and at the end when we are using Docker we know that Docker can't be used for persistent tasks so we can't use Docker where the persistent data becomes a lot more important but we want to be using Docker where we want to save logs we want to load our pickle files so what we essentially do is we mount an encrypted disk from AWS EC2 services as the Docker volumes and ensure that the models as well as the logs are saved in real time and then which are pushed to S3 at the end of the run so this is the first architecture that we are using to collect data now so so this is a kind of very skeleton framework of what we are doing so obviously we have like 300 people working on a platform like brat which is a Stanford NLP annotator open source annotator so what we have done is we have created multiple images of brat on multiple servers now the issue is that I am trying to create a three-level QA system which means that the same chart has to be annotated by at least three coders before I finalize that as my training data which also means that I cannot have three separate copies I need to have the same copy on across servers so for that what happens is that 300 people logging in I do not want the site the brat to become slow what I would do is I will place it in front of a load balancer and there would be three or four servers attached to my load balancer but the content the training data itself is going to be pulled from something called the elastic file system now EFS is essentially you can think of so has anyone worked with AWS before ok so in AWS there is the concept of you can attach your own disks which is called EBS and you can encrypt them on your own now in a typical scenario you would have to attach three separate EBS to these three separate servers but what that would mean that I have three separate copies and these three copies are not synced but my use case dictates that I need to sync my copies so what EFS essentially is is an advanced form of EBS to put it simply which you can attach to multiple servers so you can think about it like it is a remote disk which you can mount across your machines ok so if you make any change on your machine and it is a remote server FTP server it will reflect on someone else's machine as well so EFS works in that manner which means that by the time a person comes in and makes a change here it gets automatically synced in EFS which then automatically gets reflected across the servers and it is real time and then once that is done we have a lambda job which ensures that we take regular 6 hour backups of the CFS servers so what we have is we run lambda jobs to take backups from the CFS into S3 and what this is also allows us is we have used this architecture which is a relatively simplistic architecture to even create chatbots now in healthcare you have to collect two kinds of data now one you have to collect the annotated data itself where you have to say ok this refers to this disease or this procedure but you also have to collect taxonomies ok so you have to create collect data where you say diabetes mellitus also means diabetes so while I would know that diabetes and diabetes mellitus are similar the machine would not necessarily know unless and until until I use techniques like cosine similarity and even in that case it is going to fail so I need to also create these taxonomies and ontologies on the fly so for that what we have done is we have created a simplistic chatbot which will collect these taxonomies so we would have these questions asked to the person ok what is the nearest synonym of diabetes mellitus and he is going to provide us a list of synonyms and this is what is going to then we are also going to have three level QA so we are not going to take the answer from only one person but we are going to aggregate it from across people so this is our kind of data collection kind of methodology where and all these servers are in the private VPCs private subnets and the only the and the reason being is that we cannot have a public IP for a server with private healthcare data which means that all this has to be in a private subnet and this one the load balancer itself then routes request via the net gateways to these servers so this is a slightly more complicated version of the data processing architecture and we are actually we have created two new components using this architecture so the first part is how does the data processing pipeline start now we are provided lots of healthcare data now the healthcare data can be anywhere from 10,000 charts to even half a million charts so what we do is we have a certain threshold is when the data is uploaded to S3 which is the storage buckets for AWS we trigger a lambda so we have a threshold which says okay if the number of charts in a certain bucket are greater than X in our case is typically anywhere from 1000 to 2000 then you start off a pipeline so then there is a AWS lambda which is a function as a service service for AWS and we use AWS lambda to launch our architecture using Ansible now you may say okay why I am not using AWS lambda itself to do my complete processing because AWS lambda puts two limitations on my entire processing pipeline so one is that the maximum RAM memory that I can have access to is only one and a half GB and it is lot more expensive so the minimum is 128 MB and the maximum is one and a half GB so I cannot have my machine learning pipeline operate in one and a half GB and the second and the most business critical limitation that it places on is that the execution time of one lambda function cannot be more than 5 minutes which means that I cannot use lambda to create my entire machine learning pipeline end to end so now what I do just lambda using lambda I just use create an architecture using Ansible now if you are using Ansible it essentially is a method by which you can run through a multiple level of steps so I like to look at Ansible like it is a cleaner form of bad scripting so it is a YAML file where you specify a list of steps and that allows you to launch your architecture in the way that you would want now I could also could have used services like Terraform or AWS cloud formation but Ansible looks lot more readable and it is lot more easier to debug and lot more easier to implement and once this is done then we launch the service now there is a master slave kind of architecture so where the master itself is launched in the public subnets and the slaves here are launched in the private subnets now the tasks the task of the master essentially is to pull the tasks from a queue which is the SQS queue of AWS and provide them to the slaves that is all it requires to be done so but now what we have moved on to is we have moved on from AWS lambda to this master slave architecture to lambda to AMR clusters so instead of launching a master slave architecture in both public and private subnet now we are doing this using AMR architectures and then finally once the data processing is done it saved on S3 now we could always have saved it in a database now we could always saved our data processing outputs in a database now but what that provides while that provides a good amount of flexibility to us to go out and share it with others as well but what it doesn't take care of is the cost now if you are saving the data on something like DynamoDB the costs are very prohibitive and the maximum amount of data that you can save in DynamoDB is something around 64 kb per item which means and since if it is a longer chart I am never going to have data worth of 64 kb it is going to be more than 1 mb at the bare minimum so I can't save it in DynamoDB or any database for that matter so what we have done is we have created a serverless database API where we are saying that we have created an API endpoint with its own resources in a restful manner we saved the data itself as JSONs in S3 and once you hit this API endpoint it triggers a lambda and this lambda then goes ahead and fetches the data from S3 and it fetches the JSON and returns the data as a JSON format and while the response time is not that great it is not in seconds but it is somewhere around 100 to 200 ms but at the end of the day we don't require real time updates or we don't require real time response at the anyway so what that does is it allows me to create my own API endpoint in my own scalable mesh so I can say in this chart how many diseases were there in this chart how many procedures with body parts are mentioned so was there leg x-ray mentioned or was there only x-ray mentioned so I can create my own custom API endpoint using this API so what this essentially is we have created as a serverless S3 based database it's not a database but is more of pseudo database as we put it inside our team so but if you are working with any kind of places where you have larger documents that you have to store and analyze it's preferable that you also use this method so you store it in S3 you have a lambda in the middle and you back it with the API gateway and create your own APIs with that and the final architecture is how we deploy our models so we need to save our models in S3 bucket for example now this model has to be either a place where we understand whether or not our models has improved or whether or not we have been able to create a proper environment for this model to be deployed in the first place so what we do is every time there is new data in our S3 bucket which is the annotated data which is coming from the first architecture so every time we have three level QA performed and the data is saved into S3 we have we again trigger a lambda and ansible based architecture and what that does is every time either there is a new data in S3 or there is new code pushed into our GitHub master so we have a separate GitHub repo where we push our data processing and training pipeline code and every time there is a new code pushed after obviously the relevant tests in the master branch there is a lambda which is triggered so now this lambda can be triggered by only two methods either new data or updated code and once that is done we start training the model using our GPUs on the AWS and what we have is in a certain place we have the runs for the last few weeks saved so we know what was our previous best so every time my new model and new data comes in I need to see it automatically whether I have improved from my earlier performance and if I have what I then essentially do is I remove the previous one and previous model from the docker image and re upload and rebuild my docker image so that is what we essentially do so all this is automated so no one essentially takes care of this aspect so no one looks or debugs into that all we know is when was this model run what was the precision what was the recall what was the f score and what was the accuracy and we typically monitor recall so if the recall is better the current recall is better than the previous recall then we deploy the new model and that is how this architecture runs in all these three architectures the one common underlying point for us is no one is monitoring it and that itself is kind of a big challenge but with this what we have been able to do is that the processing cost either machine learning inference or the data processing cost itself for us is extremely low so we can process a million documents at roughly 500 US dollars and that brings us as to a very low cost per chart kind of rate and of course all of these architectures are HIPAA compliant which means that I have created an architecture which processes lots of documents in a lot scalable fashion but also is not breaking the bank to put this in perspective would have processed a million documents with having servers 24-7 as well as proper sys admins and monitoring platforms then this one would have been roughly around 25,000 dollars and this I am talking about just for one month so if I am processing 1 million charts per month then with all the servers and all the cost that I have would have been roughly around 25,000 dollars but with these three architectures that I currently showed you which is pure serverless as well as requires very limited monitoring the cost comes down to 500 dollars which is lot less and lot effective when you are trying to create a scalable machine learning platform and of course there are their own challenges when it comes to creating your own serverless back and now there are two major questions that everyone will ask so what do you do if something fails do you know if something fails how do you monitor because one you are going to be launching everything in private VPCs which means that you do not necessarily know what failed unless until you log into that server and how do you monitor each step because in each of these architectures which I showed there are multiple moving parts so there is a launching of the architecture there is the pulling of data from S3 deploying the code and pulling the docker image from S3 and there are multiple such steps now how do you monitor each step now these are the two challenges that come with a lower cost but scalable architecture so what you have done is I think just in the purposes of time I think I just have couple of slides so what we have essentially done is we have decoupled our tasks so we are using SQSQs to decouple our tasks and to set up lambdas to see if these tasks are there any pending tasks that are there and re-initiate those tasks and of course there are dead letter queues which would re-initiate the processing after some tries and then finally we use cloud watch metrics to monitor our complete pipeline and we have automated these using Slack so every time some task fails or succeeds there is a alerts sent to Slack and emails to all the stakeholders and I think that's it thank you all unfortunately we have run out of time for questions but I am sure you can take it