 Hi everybody. Good afternoon. I'm Nishchil. I'm the VP of Engineering and Data Science at Omnias. Omnias is a Berlin based company that's building AI product for the insurance industry to support their claims process. Today I'm going to be talking about document digitization and how we are rethinking this with machine learning. But what this talk is not about is it's not about OCR. So OCR does not mean you have digitized documents. OCR is just enabling your machines to be able to read documents. So what we are really solving at Omnias is to understand unstructured complex documents and identify semantic information in order to automate the claims handling process. So if you put Omnias in the picture, we are looking at PDFs, emails, scans, WhatsApp images, photographs, and a whole lot of noise that comes into our system for handling claims. And Omnias is making sense of this and using this to support claims handling. In the ideal world, and I would really like to thank Sandeer for covering a lot of things in the earlier session today. I would have loved to speak after her, because I am taking what she's spoken today about data and data sets and how you have accuracies for data sets and in the real world things fail. So in an ideal world, this is what you would expect. You have a curated data set, you have lots of different documents and you expect your models to perform well. What we see and receive once we've deployed the systems, is just one of the use cases is we see forms and documents coming in different ways. They're crumpled, they're short in different angles. People are taking them with shadows and prints and everything. And the expectation is that we identify what's important and how can we process this and provide this to a claims adjuster for him to make the decision. Can he settle a claim? Can he settle a claim? So the last two and a half years, I'll be going through the journey of what we've done at Omnias and at the very beginning, when we decided to get into the insurance industry, of course we were a team of five people on the tech team and we said, okay, we are smart and we know why we are smart. So let's get this rolling. So the coolest part is we wrote a lot of rules. Initially we said we didn't know what we wanted to solve. We said we'll write a lot of rules and we were comparing this with the data set that we had and we said, okay, looks great. So we've been able to extract a lot of information, understand the noise, classify, and then we evaluate this. Then we pushed it to our first, first, first customer. So we were in the 87% of failed projects that Sandhya was talking about and we started seeing real-world data. And one of the challenges that we have in the insurance, working in the insurance industry is, especially working with customers in the Europe and the US is there are a lot of regulations on how they can provide data and most of the times they don't because they don't exist on the cloud. Neither can they provide this data because it has a lot of private information. If they were to provide us medical insurance or claim, we would know what a person would have gone through. So which means the data sets that we work with a fragment of what is there in the real world and of course our systems are bound to fail. However, the beautiful thing of writing rules or starting with something very simple is you have a baseline. Now you can start working your way up from here. Now we got a baseline and we said, okay, if you want to solve this problem, the worst thing that we can get at is we can do only about 58% of accuracy I'm using accuracy here because things start to change a little bit on how we actually evaluate the solution. The next thing we did was we said, okay, do we have the courage to ask questions and if yes, what kind of questions do we want to ask? So we said, okay, how does a human solve this problem? How is the claim adjuster looking at this if he gets these documents? What is he trying to do? And these are all documents that he's familiar with or he's not familiar with, but he still gets to a conclusion. So we came to an understanding that it's a combination of what a person sees and with domain information what sense he makes of the text and the images that are present in them. So now once we started asking these questions, the next thing was do we have courage to ask more questions? So there's so much that you can do in any of these problem spaces, but given a small team and the focus and the impact that we wanted to have, we asked the questions, what algorithms do we want to use? What kind of data do we want to feed into these algorithms? What are we actually evaluating? Are we solving a data set here? Or do we want to really make an impact with our product? And if we want to do all of this, what kind of human resources do we need? Do we really rethink about how are we setting up teams? Do we just get a bunch of data scientists and just one engineer and expect the engineer to build everything around data science team? Then we started to choose our battles. So the first thing that we did was we didn't have a lot of data. So given that we didn't have a lot of data, we could go into approaches. So we said we could use unsupervised learning to automate some of the training data, but we were very, very clear that we wanted to go with supervised learning from the very beginning. And one of the other learnings that we had initially was before we jumped on the NLP wagon, we did a lot of computer vision because our core team was a computer vision focus team and we tried to use object detection on documents and then we were trying to do message parsing networks. We did a lot of custom CNNs ourselves. And something that we realized was yes, to understand layouts, to understand that something exists as a table, even though you don't see lines and column identifiers and invoices that are present that have a lot of other information, it started to make sense. But if you do not know how to group them without understanding the information that exists in these documents, then we are bound to fail. We did a combination of natural language processing and computer vision and in this we were mostly looking at some of the sequence tagging framework, some of the domain adaptation, language modeling, combined with computer vision. So when you combine both of them, that's when we truly started seeing that we can solve a problem that the insurance industry can actually use. And to feed data initially, given that we didn't have a lot of training data, we used some of unsupervised learnings to actually generate the data set. So we did a little bit of work on something called as deep clustering that sort of become evident now with Facebook publishing a blog post recently where we used an already trained network, fed the documents as images to this network and just use the features to do a clustering to see what kind of different document spaces we have and how can we use them for further processing. Of course, the emphasis on supervised learning, I think everybody's aware of Richard Tucker here and we at Omnius too believe that a lot of work that we want to do is in the supervised learning space, although we see a bit of unsupervised learning coming up now that can have big impact. So the next question was, okay, we have some sort of data, we want to generate good training data that's well annotated. When we started about two and a half years ago, a lot of the systems that we were using did not have a mechanism to do annotations for both computer vision and text based at the same time. Everybody were dealing with autonomous cars, autonomous vehicles, autonomous drones. So a lot of companies build platforms to do annotation but mostly for computer vision. So what we decided to do was to combine both. So we built some tooling in-house where we could label the text inside, identify what text is important, build relationships between text, hierarchy of text and also group the layouts. So we sort of built in-house annotation system and we started using the RVL CDIP dataset that's publicly available and worked with an annotation company. So we gave them the tools, we gave them the data and we started getting a lot of training data to see what algorithms we can actually use. Now building a team. So I'm sure a bunch of you have already heard about this but from the very beginning, we've been a very ML Ops focused team. So something that we convinced the data scientists to do as well as if data scientists can write deep learning networks, they can definitely build Docker containers. So we were not thinking about building an engineering team around data scientists. So if that tends to happen, then taking things into production, running proof of concepts and running it on different environments would become a big issue. The other thing that worked out great for us is we started giving out master theses options to some of the students in Germany. So we said there are some core research problems and the master theses time duration is about six months and would also give us a way to identify some talent that we could build into the team eventually. So last year we had two master theses students one working on generative adversarial networks to actually generate training samples for us and the other student who did fantastic work on natural language processing and how we can actually use this. The cool part was not only was there research included into our product or actually saved our product to some extent, but the students loved to be part of the team. So once they finished their master theses, we actually were able to convince them to join the team. So the onboarding was absolutely not required because they knew what problems we were solving and they also knew some of the research problems that we had that we could take forward. Of course, if you're building a startup and if you're not aware, so a lot of cloud companies like Google, AWS and Microsoft are friendly enough to give you credits, especially I think initially Google helped us a lot because the credits that they gave us came with absolutely no attachment. So we could use multiple GPUs spin across different environments and you could do whatever you want to do. So this also saves you a lot of initial investment that you might have to do for infrastructure. So you should definitely get in touch with these teams and in case you don't know anybody, please feel free to reach out. I can connect you with a bunch of them. Now the next challenge for us was how do we evaluate the solution that we are building? So we had a pipeline of AI models. We had some natural language processing based classification. We had computer vision based classification. Then we had sequence tagging systems and then we had some other heuristics based system that were all put together to solve the problem and what we decided was we said individually, yes, it's good for us to understand how these models would perform, but for an end customer and especially for someone who's not a data scientist who's going to be using the system, how does he evaluate this problem? So we started evaluating doing end to end evaluations and the most important thing for claims adjustment or claims processing is the fact that in the end, how much of this could be automated. They do not really care about the fact that are using state of the art computer vision are using state of the art natural language processing. All they care about is if I have 100,000 claims that I need to settle in a week, how much of this can I do with as little human intervention as possible, which is why once we deployed some of the new changes that we had, our evaluation changed. So what we started looking at was a different dimension of evaluation. We said, can we do document automation given a certain confidence where the accepted errors are lesser than a certain percentage. Now this is for a business user something that he can understand because if you go to a business user and tell him I'm going to give you an F1 score of 95%. It makes no sense to him. If you go tell him that I'm going to give you an accuracy of 98%. What does it has no impact on him? So once we started rethinking about how we're doing evaluation and made it very transparent on what the customer could get as a return of investment or investing in AI and an entire product, it started making sense to them. Now to build this in the first year of course a lot of it was in Google Cloud because we're part of the startup program. Everything was containerized so that we could run it anywhere including our own infrastructure on our laptops and of course on the cloud. But of course once we were able to win a few customers with our AI results and what they could do, we had to go live. And I don't know how many of you work in enterprise world. When you say go live it's not the same as B2C where you deploy on the cloud, you have your cool DevOps team, everybody's monitoring, you can make changes on the fly. We were dealing with an industry that's probably one of the oldest in the world. Insurance industry has, you can see in Hammurabi's code of something that's coming to the insurance industry. So insurance industry has existed for a very long time which means that they're not on the cloud, they're completely on-premise and you have absolutely no way of understanding or knowing what's going on. So now you need to ship something that is quite new and you're building algorithms that you don't know how it's going to behave in the real world and the entire control is in the hands of a customer or a business user who also does not know what the system can do. So we have to go live or we were at a point where we had to go home. So then we started thinking about initially we could give them the capability that they could train their own models. We gave them, we had the annotation tools, we had the models, we knew what is the problem that we could solve. So we started bringing human in the loop where with AI assistance, initially they could solve their problem. If they were spending about 30 minutes on understanding what their claims process was with AI assistance they could do it with five minutes and we built a loop around which we said there are trained models, it predicts, it assists the human, the human corrects the errors and can be used for retraining. So it was a continuous improvement in the model or also the model could deteriorate which meant we had to focus on our domain. So we were very sure that we are not going to build a horizontal AI system. We are not going to engineer a system that can solve any problem. So we were very focused on what we can do in this domain and we had to educate our customers. We had to actually educate our own team. So when we were talking to our C-level team and we were talking to marketing sales you could understand their understanding of AI came from Google. So they would Google and ask, okay, what is AI, what is deep learning and all they could see was this beautiful fancy things where AI worked out of the box with 95, 98% accuracy, one AI tool that could do everything and of course your customers think the same way. So if you tell them your AI can do only so much the question they are going to ask is then why do we actually have to be a customer? If your AI cannot solve the problem they must be doing something wrong. Because Google, Facebook, LinkedIn, Microsoft do all of this so much better. So you have to educate your customers, make them understand what are the problems in the real world, how it's different from looking at papers and data sets and what is the true impact for them on a longer term. And the third thing that we had to also pay attention to was the fact that we need to solve this end to end. We need to close the entire engineering product around this where we were not selling AI. We were selling them a product that works with AI because if you sell AI then you're selling consultancy. You're not really selling a product. So this gave rise to the Omnis platform which consisted of three major components. One was the training platform where when we deployed for a customer they could use the annotation system, they could train their models on the fly, on their GPUs or CPUs, on the infrastructure that they had and then they could move these train models into a pipeline which could process documents and give them predictions. So they could validate the predictions and use them back for retraining. And of course when you're shipping enterprise software you need to ensure that at any given point people can understand what's actually happening in the software. Did the training job go well? How much of CPU, how much of memory can be used? Is our services going down? It's coming up. All the logs, all the infrastructure logs, the application logs, the monitoring. So to build all of this into our platform. So as I had already touched upon, so we had our annotation system and the challenge was what is the schema of the annotation system? So as we started playing around with insurance customers and their documents and we started seeing more and more problems, having a schema and defining what the schema should be is a very big problem. So initially of course we said, okay, let's put everything, convert. We initially had an XML model. We said, okay, XML is so old school. Why do we want to deal with XML? Nobody works with XML anymore. Let's do JSON. And during this sort of conflict of engineering and what we really wanted to do, one thing that I want to point out is we stayed with XML and it's a very important thing to have these sort of engineering discussions because choosing what's latest and greatest is not really going to help you solve a problem. If you don't understand what is the problem that you wanted to solve. So we had a lot of hierarchy of information. We had both computer vision algorithms and natural language processing algorithms writing to this or working with the schema and XML schema actually has a beautiful notation of having as many hierarchies as possible, connecting different sorts of annotation that you require which normally JSON does not allow you to do. And so we built our core annotation system with XML as our core schema and of course for visual representation we had a way to convert XML to JSON and back into XML. The next thing is for the ability to train models on the fly. So one of the architecture decisions that we made in the very beginning was to adapt Kubernetes and the reason why we started using Kubernetes was that it was not because it was one of the biggest projects from Cloud Native not because Google's side branch of Borg Kubernetes but the fact that was we were a small team and we wanted to support customers that were all across Europe and in the US and we could not build a service team or a service engineering team who could be on a pager service if something went down or came up. So Kubernetes for us was a choice that said can we use a system where we could use it on our infrastructure on any infrastructure including running it on your laptops and have the same capability of pushing your software and having someone to manage the software. So Kubernetes actually takes care of your services coming up and going down takes care of the fact that you can associate this much of CPU this much of memory for each of the services you can provide which services needs to run on GPU which service do not need to run on GPU so you can mount GPUs on the fly and this was actually one of the best decisions that we made because of the wide range of customers that we started seeing we did not shy away from implementing this on-premise because Kubernetes being open source you can deploy Kubernetes on-premise as is or you could also use Red Hat's system of Kubernetes for enterprise customers if they are willing to pay the license so it gives you the flexibility of having your cost at the minimum having someone or some system manage all your services and still capable of pushing code that does not have to be legacy in any manner. Then we started building a whole lot of prediction consoles so we started building async API in order to facilitate the throughput of the predictions that are coming through so if you had 100,000 documents coming in a day 10,000 documents coming in or 50 documents coming in it would not matter to the system because everything was throttled and giving the capability of async API gives the downstream system so I am talking when I say downstream systems I am not really talking about the new age systems that have callbacks I am really talking about mainframe systems systems that are from the 1970s, 1980s that have a lot of data and are connecting to our systems to work with them and microservice also played a huge role because choosing microservices gave us the flexibility of changing our pipelines changing and pulling and pulling out of our AI services as and when required and we didn't have to really build one pipeline or one product that does it all so it gave us a flexibility of depending on what the customer wants to solve giving them the capability of choosing what modules they really need and one other aspect that was quite important for us from the word go was how we looked at the data engineering part of our entire product so initially we built a Java workflow ourselves which was one of the first mistakes that we did it set us back for about two to three months because building something from scratch for a problem that's already solved just because you wanted to be proprietary does not make sense then we started evaluating a few data pipelines that you could use out of the box so there was Luigi from Spotify, there's Airflow from Airbnb and then there was Apache and ify and then we started evaluating a few of these frameworks and how we want to connect the dots of these microservices and we decided to go with Airflow for reasons that Airflow also supports a few features that are not currently present with Apache and ify and I would definitely suggest to use a lot of and evaluate a lot of open source systems before actually writing something of your own and now given the throughput that we want to work with we are also evaluating how to use Kafka instead of Airflow and that's probably one of the next updates that we're going to go push with our product now once all of this is running you need to be able to monitor them you need to see if your infrastructure is up your services are up are some of your services misfiring what's happening to the logs what's happening to the infrastructure logs can you do user management through a platform because a lot of these enterprise companies also have their own users they have their active directories they have their LDAPs now how can they connect to your system so for this and the biggest challenge in all of this was also for us to do configuration management because for every customer the configuration needs to be different no two customers have the same environment not the same configuration so how can we make all of this transparent to the customer and make it usable for them to work with so we used something called as key cloak for user management and for managing roles for configuration management we used something called as Helm with Kubernetes where you could inject configuration on the flight through your containers your services to bring your Kubernetes jobs and whatever you want to do to evaluate infrastructure logs we started using something called as Prometheus which visualizes with Grafana everything that your infrastructure uses at any given point in time setting up alarm setting up notifications that's required and of course for application logs we started shipping them to ELK that's also shipped with our product giving the customer the flexibility that if they already have their central logging server they could use something like Fluendee or Files or Forwarder to actually send all the logs to some of their central logging system now our tech stack exploded so we started using a lot of other tooling so okay so one thing that I forgot to mention was how many of you here have or are working with version control of your machine learning models okay so this is actually very very very important now think about the fact that you've given your customer or even your own team and you have a certain data set and you've changed hyper parameters you've changed your algorithm you've changed the architecture now you can version control your code which has been prevalent for quite some time but with machine learning deep learning and working with a lot of data you need to also version control your data you need to version control your models because you need to know which code generated what model and what was the output of this model along with what data that you're using because think about this so you do the retraining loop you're using AI in production and you do not know if your model deteriorated you do not know if this is the model that you actually want to use in your production and what if you want to roll back you can roll back code you can roll back services can you roll back on your models so when we were evaluating this of course Sainer minds did not prevail we spent about a month and a half writing tooling thinking that okay we're doing something great then I presented the tooling at a conference which is why I really like attending conference because someone in the audience said yeah but there's DVC this ML flow why are you building something of your own and we realized we were using the wrong Google search queries to identify what are the open source systems that are actually available there that we can use then we stumbled upon DVC there actually a cool team that's building fantastic stuff for model versioning and also ML flow by Databricks so currently we're using ML flow for a few reasons that it also gives us to do ML flow tracking and ML flow experiments so you could do whatever you're doing with your PyTorch models with your scikit-learn models with your Keras models, TensorFlow whatever and include the way to version control your model so you get a dashboard that says this model ran at this time with these hyper parameters and we could we had all the artifacts that got generated your confusion matrix, your reference course and for a business user you could actually go through a history of models for your entire to the entire lifetime the next thing that we did was also that we could not expect our customers to buy Tableau licenses we could not expect them to say now take our product and to understand business metrics which is also important use Tableau so Raghotam who's working currently at Ericsson suggested they were using Metabase when he was at Tivo which is a very cool BI visualization tool and we started shipping this with our platform for business users to get relevant metrics on how to understand what their systems currently doing the other part which I touched upon right now is data versioning so how many of you version your data version your databases version what exactly is going on okay a very big suggestions start doing it you need to version your data because if your data changes your model changes your hyper parameter changes so you've lost the fact that what is it that your models good in doing what is it that it's really learning so currently we're using because we deal with a lot of unstructured data and we're dealing with XMLs so the best thing that we thought was to use GitLab with Git LFS so a large file system that we could use to track all our XMLs if someone changed the data someone modified the data human got into the loop a system change something so we could track everything so it was a version control of our entire data history and before we ran an experiment we took a snapshot of it so we do releases of data we do releases of code and we do releases of models all tied together because there is nothing called as a failure of an experiment of an it's an experiment that gives you results so if you have to go back two years from now because you had a brain wave that said this experiment was important relevant for us with the current data set that we had you need to be able to reproduce this experiment without having to expect so GitLab or Git versioning gives you the capability to version control your data as well with Git LFS and of course if you're dealing with databases you could take a dump of your entire database release it push it to an object store and sort of build reports on this the next thing that we that I don't have here but that is something that also we do is we do releases of reports so it version controlling our reports of experiments of end to end evaluation and evaluation of our machine learning models because you need to know if you run the same experiment what's actually changed so some of the data scientists on the team are have been cool enough to do automation of reporting so there are reports that are generated on the fly after an experiment is run and they are pushed to our internal Git repository and we can we are version controlling this as well. So overall what I would emphasize on and when I was talking to Xenob about this I said I'm not going to be talking about deep learning architectures. I'm not going to be talking about what you have to do with math because there's so much content available out there. I would really like to talk about what it means to take AI into production and the ML ops that's non-privileged because people don't talk about what it means to take AI into production right. So please think about this and it's not just taking AI into production. So some of the learnings that we've had is the fact that if you think about engineering into AI you actually start making getting time which sounds very weird to run your experiments and a whole lot of experiments in parallel because once the framework has been set up you can run your experiments and expect and make sure that the results are something that you can evaluate also AI models in general if you're just trying to solve a data set. It's a very big problem. You need to think about what is the return of investment? Why are you building this AI model? What does it really solving for the customer? So which means it's important for you to focus on your domain focus on the problem at hand and please don't think about the fact that only deep learning or machine learning problems can solve or algorithms can solve your problem. So we use a lot of heuristic measures. So for example, a big thing that is very relevant for us to generate annotated data is we use something called as levitating distance. So there's some data that's present in an Excel file that we can use so expecting 10,000 documents to be annotated by 50 people over a span of six months is going to take you a lot of time to automate this with taking the data that's present in databases in Excel and sort of semi-automating the entire annotation jobs. So please, please, please think about or at least for us something that really helped is that the entire engineering team and data science teams are working together where they questioned each other on the idea that do we really require to write a deep learning network for this to be really required to write a classifier for this. Can we not use some of the heuristic measures or algorithms that are already tried and tested? One other thing that's important that also Sandeya touched on was the fact that at this point in time assisted AI is much more important than saying AI can solve everything. So think of the fact that people who are using your systems are the people who know what they're doing. So please bring in human in the loop initially when you're trying to test the models or someone's validating what's coming out and if they can use this it also builds trust because now they can see what your models can do and what it's not capable of doing. So which means you get an input of understanding if your model is running everything with related to correlation or you can actually understand what the model is capable of doing. Visualization is a very important feature. So we built a lot of visualization around some of the computer vision stuff that we did some of the natural language processing stuff that we did. So most of the times if you're working with data scientists they give you a graph that says confusion matrix. They give you a curve. That's a lost curve and they say the model is good. But do you know what exactly the model is learning? You can take even the sequence tagging the birds the language modeling that you can do you can actually visualize them in a very nice way and people have proved with writing open source utilities now on what you can do in the end you're giving probabilities to events that can happen. So which means if you're doing information extraction you're doing language modeling you can also show what the systems learning you can take the probabilities of what a certain thing can be and associated with what you're actually looking for. So visualization helped us a lot in understanding what our models were trying to do and is also helping us generate synthetic data that we can introduce for models to learn what it's supposed to be doing based on the data set that we have. Last but not least is something that we've introduced now and I know a bunch of you have been working in startups a few of you are working in enterprise companies in core AI teams is to bring in the idea of release process engineering. So when you bring in the idea of release process engineering what you're really trying to do is to automate or build tools for your engineering team for your DevOps for your ML Ops and data scientists to release their code as software that you can actually use and test across your entire infrastructure. So this gives you more time to build the new things. It gives you more time to improve your quality and you know what you're shipping to your customer. With that I'm done open for questions. So I think we have five minutes. Yeah. Hi. Hi. You mentioned you allow customers to do retraining of models. So is that I mean what are the safeguards there because there can be class imbalance label noise. Of course. Okay. So we have a lot of safeguards and we also do not have safeguards at all. So some of the safeguards are easy to build. So we have reports. We have data reports that are generated before the model is trained. So we actually tell the customer if you're training a classification algorithm, your data is imbalanced and we tell why the data is imbalanced and the same thing we do for complex natural language processing and computer vision models as well. So as part of the retraining loop before they could retrain and as an artifact that we generate out of the entire process, we actually have data reports reports that talk about data that tell the customer that your data is such and such and given that the models are versioned, they know if the model is deteriorated and tested against the right thing. Hi Nishal. Hi. This is Jain. Great talk. Thank you. So my question is regarding model versioning and data versioning, it is something we are doing, but still there is one caveat while doing it that mapping the model architecture, model and data. So updating the data doesn't mean adding new rows, but it also means adding new feature. But when the feature changes, the model architecture change and the hyperparameter tuning those features also will change. So how do you map the version between the model architecture, model, data and the source code? Okay. Wow. Okay. I'll try to keep it short. So one of the things that we do every time or we're trying to do is that when the data changes and when a version or a release is done, we actually generate a report of what's changed. And the same way when the model architecture changes, we have a report and artifact of what actually changed. So you can look at what happened in the last two versions and you can go all the way back in time and also keep track of the hyperparameters as well. So we document everything what hyperparameters went in, what architecture was changed, why it was changed and generate a report for the same.