 That was amazing talk by Ben and now we have on the screen Sagar. Hey Sagar, can you hear us? Yes, I can. Am I audible? Yeah, you're audible. Great. Okay, let me just quickly add your slide. Yeah, I guess the slide is visible as well and People get ready for the next talk. Sagar, this stage is all yours Perfect. Thank you so much. Hi folks. I wish you all a very good evening And thank you for coming for my talk at PyCon 2020 the focus of this presentation is going to be on combining natural language processing as well as tabular features and this The context is specifically related to mapping clinical entities to the relevant section headers So this is a bit about me. My name is Sagar Dauda. I work as a data scientist with Episodes the most most of the world that we do is about designing and implementing solutions with the help of Data and I would say variety of data The majority of the data that we deal with is based on clinical text So we perform a lot of MLP. I would say medical language processing rather than a Colloquial natural language processing and I'm also a data science mentor at an ethics startup Some of the projects that I've been working on recently one that we are going to discuss today is to find the section header and Mapping an entity to a relevant section header There's one problem which was very tough and which is also pretty rewarding is to break an entire PDF chart into different clinical encounters and we've also built an ontology and a search engine for our entire Health care domain, which is very specific to our corpus And to give you a heads up about what episode is and about my team So we episode is a health care risk management firm and we cater to Big insurance providers and hospitals and we render our services and a very strict HIPAA compliant environment So being in health care you are governed by HIPAA So we have a strong team of 20 folks which includes data scientists Data engineers and we also have subject matter experts because a domain like health care requires a lot of expertise and Someone from a technical background and technical interest would definitely not be good at bones and flesh So we also have strong subject matter experts for us and together we build robust and scalable health care solutions and Definitely leveraging on machine learning and it will be deep learning obviously comes as a part of it Right now before we move further. I would like to give you guys a Background about what medical coding is and what is this entire cycle of medical coding and also help you understand the importance of it So normally this is how it works in the US because it's a US-based firm We key to the clients and the health care forms in the US. So a patient would check in At a general physician or maybe some specialist like a cardiologist or an ophthalmologist And there will be an insurance verification and eligibility done to ensure that the health care plan supports the treatment that the patient is seeking and then Once that verification is done and the patient qualifies The documents come to us this patient check-in interaction with the document that comes To one of the vendors like episodes for medical coding. So this is where we capture all the coding diagnosis the procedures and a lot of other medical entities or clinical entities that are required and Post capturing this it is in a past on for a claim submission and upon a verification the claim They are forwarded for payment processing so episodes as an organization fits into step three and The kind of charts that we deal with have these patient check-in interactions either for last one year or two years depending on the requirements, so we normally deal with Pretty huge ads There are some varying from 10 pages to some almost eight and a half thousand pages depending on the complexity So which is why scalability is very important factor for any solution that we build People who are more curious to know about how this medical coding works and you really want to jump into machine learning Specifically in the healthcare domain you could go to this link and read more about it now Coming to defining what a section header is and why is it so important and what was the business objective? behind it So normally a section header will give you a gist of the content that follows and it will also mark a change in the section within a given Document if I were to relate it to a resume you will normally see things like B.com M.com or Maybe NHSC or an SSC this will come under a section called educational background or Academics or something like an academic information and then there will be another section which talks about your recent projects So a section that's how gives you a gist about what can you expect within this specific section? So that is what a section header will tell you and it will also Indicate that yes now there is a change in the content Let me emphasize on the importance of section headers in our case So when it comes to the clinical domain section headers play a pivotal role in Identifying the relevance of a disease whether a disease is relevant or it's not relevant will depend to a great extent on This section header which it belongs to and it also identifies the current state of the disease now There are certain treatments which require That it has to be in an active state for the reimbursement. Let's take an example. You have an insurance and You are suffering from fracture. Yes, your claim will be reimbursed because you are currently suffering from a fracture But if someone Identifies that this information about the fracture was in the past and it is not an active state No one is going to reimburse you for a past fracture Which is where your section headers play a vital role in determining? What is the current state of the disease or the diagnosis that the interaction is all about and This in turn has a direct impact on the claims or the reimbursements now What is the kind of impact? I have a separate slide on it. So we will talk more about it and Having understood what section headers are and why they are so important for us This led us to To separate identifying two separate objectives the first objective was to identify what these section headers within a chart are and The second objective was how do you map an entity to its relevant section header now section headers can be very tricky And so do the entities it totally depends on your chart format the layout and the kind of text that you have So yeah, it boils down to what is the section that one has identified And how do you ensure that the disease that we are talking about is mapped to the relevant section So this is an example now if you look at this chart Here additional risk factors is one of the major section But there are certain minor sections as well suicide risk anxiety depression Preview of systems. So these are all different kind of section headers. So the first objective was to obviously identify The section headers and then we map it with the relevant entities Now just like every other machine learning or a deep learning project it will come with its own challenges and Our problem is no different than that So these are some of the challenges that we faced The first is there is no standard chart format now when I say no standard chart format This is in terms of presentation or the layout Now there are certain standards that are available Which tells you what is supposed to be included in the chart, right? So it talks about a specific kind of structure called as SOAP they call it soap format So soap will tell you what what should be included? So there are subjective things which talks about the past medical history the current state of The disease what are the symptoms then there is an Objective which talks about any associated labs or kind of any of the procedures that were done Then there is an assessment to identify What does the condition look like after the previous treatment and then there is a plan So plan talks a lot about treatment and any further diagnosis that are required. So These standards they dictate what is supposed to be included, but there will never be a how no one can force you to Provide information in a specific format. So you cannot have everything tabulated. It is not necessary that everything is going to be in proper paragraphs or sentences and There is no guarantee that it is also going to be You structured into those very specific sections So the providers are free to choose a format Layout the division and this creates a lot of problem This is one of the problems second one being multiple formats within a single chart Well, like I mentioned earlier guys a patient a patient's interaction is captured for the entire one year If the patient has 25 visits in a year It is not necessary. It would be to the same doctor or the same provider The patient can visit multiple doctors. So let's say the first visit may be to a general physician The second visit probably maybe to a cardiologist who was a specialist The third visit may be to a lab to get some lab tests done So every provider is going to be different the way every interaction is going to be captured is also going to be different So when this information is aggregated For last one year or two years We encountered multiple chart formats within a single document and that really adds more complexity because you cannot build custom rules which are very specific to a certain format and Another challenge where the models that previous models that you were using got confused where There were certain diseases which were identified as Section headers Sometimes there are genuine cases where the diseases are section headers but you know there is always a problem of misclassification and this is what the previous model was prone to and another challenge was that the content of a section can span across multiple pages and during this situation if you have a Header and a footer in your chart then irrelevant sections may be picked up from there And barring those technical or other format or layout related errors There are certain obvious errors which can give a lot of headache one of them being OCR error Where in the chart quality or the PDF image quality may be so bad that while getting converted into text It may be gibberish or in the words where we misspelled so that can lead to a lot of problem as well The text alignment in multiple column layout So we may have a chart Which could be divided into three columns a left column a center column and a right column so there is no guarantee that the text will be in the sequence which will How a human would like to be Because it can only provide you text but the sequence is not guaranteed in an OCR based solution. So That definitely adds more complexity And contradicting table headers. So this is in the cases when the information about disease Estabulated so you will have a section header called as current medical problems and below that you will have table headers The the problem or the name of the disease What is the current state is the current? Is there a treatment going on? Yes, no kind of thing. So, you know, such additional information Creates challenges for the project No To stress more on the kind of impact section headers can have So we classified the impact into four different Sections two of which are acceptable and the other two are unacceptable. This is very similar to a confusion matrix So the first one that we can allow This is the ideal situation the section identified is correct and the mapping is also valid for a given disease So what is the outcome of the situation? Accurate and valid diseases are captured for reimbursement and this is exactly what we want for every chart and every disease but what happens if we have identified the correct section, but The mapping is not accurate. So I identify that my page two has a section header called as active problems But the entity that is on page two is not mapped to my actual section header It is mapped to some different section header So a claim that should have been reimbursed Will not be reimbursed. Why because a section header will Define whether the state of the current disease is active or not So this can lead to something called as undercoding. Now when undercoding comes into picture The reimbursement amount also lessens so a patient will not be reimbursed for what he or she should actually be reimbursed And this definitely leads to a lot of penalties The other situation which is definitely unacceptable is the section identified is incorrect and the mapping may or may not be incorrect but Definitely a valid disease is mapped to a wrong section header. Now. What happens in this case? So a section which should have been Which was a part of a non codable section header that disease is mapped to a codable section So let's say let's take an example of fracture again So the fracture was a situation of the past, but somehow The section header captured for it is the current problems Now in this case that leads to something called as over coding which means the medical coders They end up capturing additional codes. So while processing the claims the insurance company Ultimately pays out more money Then what should have been reimbursed to the client and this is directly a revenue loss for the organization So these two situations where either your section identification is incorrect and your section mapping is also incorrect So any one trigger within this combination can directly impact the revenue and lead to losses and may also attract penalties so This is not a good situation to be in so you have to be very careful about dealing with these things and obviously this is the case where the Section identified is correct. But the mapping is for an invalid disease, which means the disease is that of the past but it is Map to a section header, which is also Which also indicates that yes, this is a section for the past diseases So when I say invalid disease, these are diseases for which there is no reimbursement as But the healthcare plan obviously Now, how do we solve this problem since we know it is of prime importance? So when we started off, we took three approaches So first was a typical sentence classification approach The second was a linear chain CRF approach and third one was a pure metadata approach I'll go over the approach overview and we'll also talk about the pros and cons and this slide should probably give you guys a fair idea that if you have to think about solving a specific problem, which is highly dependent on Your domain related features then you may want to try out a similar approach for your problem as well Now if we have to if you look at a typical sentence classification or a text classification approach We usually tokenize the text lower case it performs semi-limitization if required to remove the stop words vectorize them either using a count vectorizer or a TF IDF vectorizer and then you train your model and test it on your holdout dataset and In your CRF approach, you would similarly perform all the steps of tokenization to vectorization But then instead of machine learning model you would pass it on to your CRF model something that Spacey is very good at so we use spacey and Then a metadata approach will only have tabular features So you convert categories into integers if there are any categories or string in your data you perform the necessary steps of feature selection and Yeah, stop words can be removed if you are going to add more text to it and Then you perform model training and testing Now what are the advantages of all these approaches? First is if we use a FIDF as a vectorizer, it will give importance to Rare words and it is easy on computation as well But the problem with this solution is it's a simple bag-of-words approach and it is going to ignore sequences and semantics On the other hand if we follow the CRF approach, which is by the way The shortlisted approach or the approach prior to what we have implemented right now but this gives more importance to the sequence and Ultimately will capture the semantics and it is also not sensitive to imbalance So if you have an imbalance dataset your CRFs will take care of it But the problem here is the computational it is computationally expensive And it doesn't work well with unknown words or out-of-vocabulary data So your Typical TFIDF vectorizes they can be used for identifying the most important keywords and An approach like CRF is best for your name entity recognition kind of problems And if you look at the metadata approach it works The good things about that is it will work well with limited set of data and there is no need for you to Go for a complex deep learning approach But the problem is it will ignore the text totally unless you include a column of Text and there can be a lot of feature overlaps so we witnessed a lot of feature overlaps Which is why this approach was immediately discarded So this metadata kind of approach is useful for any tabular or structured data that you may have To the drawbacks of the existing approach the existing approach is a shortlisted CRF approach The first is section headers are not really a sequence problem Now I've given a couple of examples to look at this if I say chief complaint then this looks like a relevant section header But if there is a context or a sentence in the document that says patient is taking medications for all the conditions listed in chief complaint So a CRF being a sequence problem It is still going to pick this up chief complaint as a section header Even though in reality for this specific context, it is not a section header All right, so this leads to something called as type one errors. So we end up identifying a lot of irrelevant section headers and this can lead to Unwanted mappings now we know that unwanted mappings can either lead to Overcoding or undercoding and under both the situations the company ends up losing money either as penalty or overpayment The third problem is if you have a tabular layout or a multi-column layout the text sequence We are not guaranteed that it is going to be proper. So This definitely is an issue because of the sequences broken the CRF approach Which relies so heavily on your sequences and it learn from your sequence is definitely not going to work well So this is an example of where things went wrong. So look at this anxiety and depression here They are actually section headers. Now we have Three use cases for the CRF. We identify section headers We identify diseases and we also identify time attributes So the CRF picked up anxiety and depression as diseases here But if you look at this chart format your fault risk abuse and neglects the side risk anxiety depression We are all section headers and then within this section There is information about this disease Okay, so it gives information about depression. So this is not really a disease it acts more like a section header and here HPI stands for history of present illness and This was an incorrect section header that was identified So the solution is you combine your text features along with metadata You normalize your text you match them with well-known headers perform a spell correction and add your text characteristics and positional information To get these combined features like this. So you've got vectorized operations your vectorized matrix stacked horizontally by your metadata and That ultimately gives you the solution So this is what is passed on to the machine learning model Which does the section header classification and then we have a probabilistic Mapping to it and the other additional challenges that came along with this project live memory efficient XML parsing The processing time per chart and the complexity versus accuracy trade-off So, you know, this is how we were able to map things properly The obesity is the actual disease which does not belong to problem three or problem two or orders It belongs to impression and recommendations So this is what we were able to achieve With the help of this hybrid approach and some of the constraints that we Taken to account the Identification of the section its position the isolation and the page layout So we take into account all these year six and this gave us a massive boost of 19% in the overall mapping accuracy. So we went from 67% to 86% and That concludes the formal presentation, but I hope that was informative enough for you guys to I understand that how you can combine your text features along with the metadata and how it can boost your results I'm open for any questions that you guys may have great. Thanks, Sagar. We have one question coming from Susie girl and Yeah, the question is how much training data was used for each category. I meant in the CRF approach Yeah, so we had taken data from approximately 55,000 charts So we go in terms of charts and if I have to translate that to pages on an average we have Around 20 pages so 55,000 charts multiplied by 20 pages. So those many number of pages was the training data for CRF We have another question What was the approach to training for this? How was the data labeled? Yeah, that's a pretty good question. In fact, so which is where our subject matter experts come into picture now we had the data labeled in terms of Offsets so we first captured the offsets and then we generated features and The offsets were mapped to our table and ultimately we classified them We rather programmatically label them as one or zero depending on whether it's a valid section header or not So the annotation was still a proper text based annotation before you were able to extract the features We have another question. Did you try word embedding? Yeah, absolutely a very good question. We did try word embeddings. In fact the CRF approach that we are using already has a 2gb word embeddings of PubMed which is trained on like billions of data points But sometimes these complex word embeddings may not really be a good enough Solution because it can miss important features Which your vectors like TFIDF or a count vectorize and be captured Another question What was the final model used for the hybrid features? So we are using an ensemble of a linear model and Forest-based model a tree-based model rather Could you talk more about the application of metadata approach for medical texts? Yes, in fact for most of the medics not just for medical but any specific Text-based Challenges that you're working on if it relies on domain specific information You always have to include the metadata now for healthcare in particular We work on healthcare data which deals with a lot of medical charts, but if you're working on Medical images, then there are specific metadata, which is related to the imaging part as well And that is only going to boost your performance Was this data annotated by humans? Yes. Yes, it was But by Yeah, it was by our own Set of team we haven't outsourced it Okay, so any data annotation tools was used There is an Amazon ground truth that tool is used heavily by us We also tried prodigy and there are a couple of other tools like docano that we've tried in the past They work pretty good as well Great, thanks a lot Sagar so if the attendees you have any other question we have Go to 2020 for a slash state or slash Delhi, okay? 2020 stage Delhi and you can ask any question with Sagar and there were few requests Sagar and People who are asking about the links to the presentation and the project that you were doing so if you Have time or whenever you get you want to say those links on the Yes There's one last question if you want to take it Okay, thank you. I'm just gonna add that question here and Yeah, what CSF trained like an ER So yes, I think like an NDR. Yes, that is correct. I see I was trained like an NDR So if you look at a medical chart section header is also treated as an entity a disease is also treated as an entity and Time attributes like your date of service or a date of birth is also treated as an entity So we treat them as medical entities and try to extract it as in any hour Okay, thanks a lot saga Feel free to connect with everyone on toilet and everyone in attendees if you have any other question If you want to just connect with Sagar of flat form feel free to reach out to him on select He'll be present on 2020 power slash stage for slash deli and saga. Take care. Thank you so much. Good luck guys okay, so We are closing in on deli stage. This was the last presentation last session for deli stage and I guess the last session for deli stage in 2020 That's pretty sad, but we'll be coming back again next year so if you enjoyed feel free to share your feedback reach out to speakers in the Zulip chat and I'm just gonna make a quick announcement for what's gonna happen next It's not going to happen on that is state obviously will be closing this stage now Only one stage will be available and that is banglore stage and on that stage We have a keynote session now. That's interesting. We have tell you the for the p-roll sessions Make sure you go to banglore session after the deli session. I Have just quick videos lined up for you. So, yeah, maybe after the videos. Thank you