 Hi. Good morning, everyone. My name is Aman Sena. And I'll be presenting the topic of artificial intelligence based container monitoring, healing, and optimizing. There are other two speakers with me, Sri Krishna and Sachin Joshi. But unfortunately, due to some visa issues, they were not able to make it to the conference. But they are available over the online meetings. And they'll be able to take any kind of questions that comes from here. So yeah, let's get started. So first, I'll be talking about the agenda that we are having for this presentation. So first, we will try to understand the monitoring capabilities that we are having on containers. We will try to find out what are the possible causes of failures that happens in a container ecosystem. So when I talk about a container ecosystem, it is like something which is managed by a container orchestration engine like Docker swarm or Kubernetes. We will see what kind of problems that can happen in such kind of ecosystem with when we deploy a large scale application on top of it. Then we will move ahead to understand the need of AI to analyze some kind of problems that are happening there and automatically solve it without any human intervention. We will try to understand the key concept of AI with respect to container monitoring. And we will end it up with some knowledge on integrating the AI engine back to the container ecosystem and then see how do we solve some kind of problems. So this idea came from the project that we were working on. And there were some recurring kind of problems that every time used to come. And every time we engineers used to log into the system, we used to go through the logs of each and every microservice. And then in the end, we used to find out, OK, this is the same problem that happened last time also. So we found that there are four or five problems which are very recurring. And it should be handled differently than the way that we were handling it then. So from there, this idea of the presentation came. So throughout the presentation, first, we will try to understand what kind of problems I'm talking about. And then we will move to a model through which we can solve the problem. So a very, very brief description about containers. We all understand what containers are. So containers are a method of operating system virtualization that allow us to run application and its dependencies in a resource-isolated process. Traditionally, when we talk about monitoring, we monitor the processors. We monitor the memories. We monitor network. We monitor storage. So monitoring is a systematic review process through which we monitor these aspects of the system. And if there is any issue, we raise an alert or a notification kind of a thing. So there are so many monitoring tools available with containers. So there are tools like DockerStats or C-Advisors, which are very well working with our Docker container system. Lately, Prometheus is one tool which has gained a lot of traction in the market. And we are also using Prometheus for our microservice based monitoring. So there are so many challenges in monitoring container ecosystem. I'll not be going into details of these challenges as they are huge stopping in themselves. So what we are trying to do is we will first try to understand what kind of problem statement we are trying to solve. And then we will approach to a solution for it. So some of the challenges that we need to maintain container security, we need to have optimized infrastructure, need to be optimized for containers. There are specific tools for different, different kind of monitoring. And there is not a single tool which can monitor everything, every aspect of a container ecosystem. Moreover, when we have so many containers running, there are huge data sets of logs that are generated. And for small, small issues, human intervention is required. And that is what we want to avoid. So we try to categorize some of the issues that is happening in a container ecosystem. When I talk about container ecosystem, I talk about a system which is managed by a container orchestration engine like Docker swarm or Kubernetes. So on the first level, there are some system level issues. So what are these system level issues? We all know that these kind of container orchestration system, they are also having its own monitoring capabilities. So it keeps on monitoring the containers that are running in the system. It keeps on monitoring the load on those containers. It keeps on monitoring the network. So whenever there is a load happening on a container, it tries to load balance or it tries to create a new instance of the same container and then tries to load balance. So there are some of the problems that will be handled by these container orchestration system. And they monitor different, different kind of metrics of container. And they kind of manage the lifecycle of those containers. Then there are application dependencies issues. So when we talk about microservices running on these kind of environment, there are so many dependencies between these microservices. There are business microservices. There are infrastructure microservices like databases, cache microservices are running. And all of them are interdependent to each other. So such kind of scenario is our focus today, where we see that some error happening at some infrastructure level and that is creating a cascading effect. And there are errors generated in all the business microservices as well. Lastly, there are code-oriented issues which can generate errors in the log that has to be handled at the programming level itself. So our main focus will be the application dependencies issues. So these are the issues that are not handled by the container orchestration system. These need to be monitored differently. And there are many tools available to monitor such kind of systems. When we talk about a large application, it is made up of not only the business microservices, it is made up of, along with the business microservices, there are a lot many infra services as well. These infra services may be some monitoring tool as well, or there can be some DBs, there can be some message queues, etc. So as I already talked about, there is any failure in any places that can cause a cascading effect. And there are failures seen at all the places. So let's try to understand this with a very small use case. So in this use case, there are a few microservices running. And then there is Elasticsearch, which is a RESTful search and analytic engine. So the logs generated by these microservices are stored in indexes within the Elasticsearch. So the importance of Elasticsearch is that we can write query on these indexes, and we can filter and retrieve the data that we want from it. Logstass is basically an agent which collects the data from these microservices. And then finally, these data are stored in Elasticsearch index. So overall, the process is there are microservices running which are generating logs. Logstass, as an agent, is collecting the logs from those microservices. And finally, they are being stored in Elasticsearch indexes. So let's say the container on which Elasticsearch was running becomes full. There is no more space to store logs. So what used to happen, as soon as this kind of event used to happen, there were errors generated in all the microservices. And suddenly, we used to see that all the coming operations are stopped. Nothing is working. So every time an engineer was required to go ahead and look into the issue, and then finally, we found out, OK, the Elasticsearch volume is full, and that is why such kind of errors are coming. So we used to delete the older indexes or older logs that were present, and that were not so useful to us. And then the system used to come back in running state. There can be other kind of infrastructure issues as well. Sometimes the message queue is overloaded, and that is causing a cascading effect. There are connection issues happening to the database server which is running. Or the database is not yet up, and some operation is stuck in between. So then such kind of issues are recurring issues. These are not a fixed set of issues. These can come time and again in such kind of environment. So this is the kind of issues that we are trying to solve here. So why AI? This brings us to the question, why do we need AI to solve such kind of issues? Since there are huge data sets generated, AI is the one which used to analyze them and provide active runtime monitoring on those containers. Recreation issues can be solved by predictive models. Lastly, there won't be any human interference required. The model will try to find out the issues, and there can be automatic remedial scripts which can just go ahead and fix the issues. So a person sitting outside the container world will not even know if there are such issues coming. So lastly, most of us now we think we understand the problem statement. So I will try to compare the traditional way of container monitoring with the machine learning way. So in traditionally, we used to look into the logs of each and every container. We used to identify the problems. And finally, we used to decide on how to fix such kind of problems. But with AI, even before such kind of problem is happening, we'll be knowing the possible error logs. The AI engine will be trained enough so that it can do predictive analysis on those problems. And we will let the AI engine finally decide what to do when such kind of situation happens the very next time. So the mantra here is to detect the problem, heal the problem, and as we go over, we need to optimize the problem. So there is a difference between detecting and optimizing. When we say optimize, we say that even before the problem may occur, we try to solve it out. So there won't be any small inter, there won't be any kind of stop in a business application. It will seem seamlessly running. It will be seamlessly running. So in the detect cycle, we try to detect the logs, detect the issue by recognizing the recurring pattern in the log, as and when the model gets more trained, it will try to identify the unknown issues as well, and it will try to categorize it in a different bucket. Then the next action is a heal action where as and when we detect an issue, we have something, some kind of a script or some kind of a DevOps tool which can go ahead, talk to the container system, and try to heal it up. Then the final thing is to optimize it. So as and when these training models get more and more trained, it will try to figure out, even before the problem is happening, it will try to figure out that, OK, such kind of problem may happen in future. So even before such kind of problem happens, let's try to fix it. So this is where we want our AI engine to go to, where it can optimize based on the learnings. So I already talked about two modules. First was the container ecosystem, and then there was an Elasticsearch and LogStash module. So the container ecosystem was generating the logs which was collected by LogStash and was stored in Elasticsearch. Now we are bringing two more modules here. One is the AI core module, which is built using the machine learning libraries. And then finally, there is an action center module where there are user-defined rules and action and which can trigger remedial events on those containers. So this is finally our modules. This is the final set of modules, where there is a container ecosystem, which is generating logs stored in these indexes. The AI core is picking up the logs from these indexes, and it is first cleaning up the data. It is molding the data. It is taking out the parameters which are not required. Then this particular data is passed to the data classifier module, where our actual algorithm is sitting, and it is doing the predictive analysis. These AI core and action center modules are tied up together with a module present within the action center called a pattern matcher. So this is the one which actually matches the patterns. That is that the data classifier has classified. Based on those matched patterns, there are set of rules defined. There are set of actions defined. And those particular, it can talk to event trigger, which can finally take a remedial action on the container ecosystem. So now we'll be going by each and every module, and we'll try to understand what exactly has been done in each and every module. So the first module is the historical data store. So this is the one which has the task to prepare data. So as we know in machine learning, the data is never present in the desirable format. So there is a lot of cleanup required. There's a lot of molding on the data required so that the data can be in such a way that can be consumed by the algorithms, the AI algorithms, and it can do predictive analysis on that. So this module picks the data from Elasticsearch indexes. It first inspects and clean the data. Cleaning the data, it removes the unexpected, it removes the non-used columns. So that basically the data will be collected in the form of CSVs. So it will first take out the unnecessary columns from the data and any null values from the data, it will take out those kind of columns. It will, then we will try to explore the data. We will try to see what are the different data structures, what are the different parameters present in the data, and finally we will try to mold the data. That means we will try to adjust the data types. We may need to add more columns that is needed for our requirement. Finally, what we will prepare is called a tidy data. So tidy data sets are prepared by normalization of databases and where each variable is a column present and each observation is a row. Each table of this observation unit is called a table in this case. So once the data is prepared, this data will be passed through the data classifier. So the data that we are used for training this particular algorithm is a supervised data. That means the actual value that we are trying to predict is already present in the data. So when we talk about supervised learning, if we are trying to predict if this particular log is error log or not, that particular error scenarios will be present in the data in a binary format, maybe either true or false. So, and the algorithm that we are using for this data classifier is the classification algorithm which actually classifies the problems in different, different kind of buckets. This particular model works in a correlation with the rules which are defined in Action Center. So initially, all the predictions are happening based on the rules that is defined on the Action Center. These rules are nothing but the recurring problems that we were seeing in our system. So once the data, the model is trained enough, it will start doing predictive learning and it will start categorizing the outputs in different, different kind of buckets. So in the Action Center, the first module that we are talking about is rules and actions. So this is the initial rule set which has to be defined based on the problems that we are seeing in the system. It also defines the actions that has to be taken on such kind of event. As in when there are new patterns being observed, it also provides us options to add new rules and new actions. It has options to add or modify or delete any kind of rules and actions. Then the next module is a pattern matcher module. This is the module which works in correlation with the AI core. As in when the AI core engine is finding out the patterns that is happening, this is the module that matches the pattern with the rules that are defined in the Action Center module. And whenever there is a matching rule happening, there is a corresponding action that is written with that rule. And those kind of, based on that action, some kind of event can be triggered and that is the event that is basically our remedial solution and it is talking to the container ecosystem and doing the remedy steps. Our final module is the event trigger module. So this is the module which is having all the scripts. So whenever a pattern is matched, identified by the pattern matcher, based on the action that is defined in the rule set, there are events which are stored in this event trigger module. So this event trigger module can talk to the container ecosystem through set of some DevOps scripts or it can be very simple scripts as well, which are user-defined. So these scripts are used as remedial measures and finally, based on the predictive decisions that has been taken, these kind of remedies can be stored, these kind of remedies can be applied to the container ecosystem. Yeah, I think I winded up quite early. So, yeah, I don't have a demo because the demo was not, will not be able to accommodate the demo in this 40 minutes. So I just came up with the modules. Yeah, if you have any questions, I'm here. My colleagues are also here over at Rebex so they can also answer any question that is raised here. Do you have a classifier that detects different types of situations? And then how does the rule apply then? Do they apply to classes? Yeah, so the rules are defined by user. So initially, the user knows what kind of problems are happening within the system. So based on that, the rules are defined in the rule set. Now, when the classification algorithm classifies the problem into different, different buckets, the pattern matcher tries to match the problem with the rule set defined. If it finds a actual rule which matches with the classification bucket, and there is a corresponding action defined on that, it tries to invoke that event and try to solve it out. So the pattern matching matches these, the data within the bucket? No, it is a module which picks up the new class, which picks up the issue from the bucket, classification bucket, and tries to match with the rule set. Because classes are a finite set, right? Finite number. Sorry? The classes that you obtain from the classifier typically are just the finite numbers. Right. 12 or 13 or maybe 100. Right. So that's the output of the classifier. So where do the rule apply to these 12 classes? No, this pattern matcher will, based on that classification module, it won't be just a number in this case. This is the work of this pattern matcher, which actually takes from the classification bucket. I will have to go in deep in that, but we can meet and we can talk about this. Yeah, yeah. Is the mic working? Yeah, okay. I have two questions. First question is, did you do any measurements as to analyzing the number of false positives that were raised by the system? The number of? False positives that were raised by the system? Yeah, so currently we are in a state where we are training our model. So the complete number I'm not having as of now. So we are in a state where we are having some kind of training data through which we are training the system and that we will get to know in some time actually. The false, based on the prediction, based on the current prediction after the training model, we'll get to know what is the false prediction that is happening from the data. Okay, second question is, do you also have like a feedback loop from your development team that is feeding in the patterns and the log messages back into the system? Yes, we do have. Okay. Okay, thanks. Thank you. Thanks, everyone. Thank you. I have a quick question. Do you have any example online of some code where we can have a look? We have a GitHub repository, but that is not yet public. Okay. And do you have a date when it will be public or not? Sorry, page. A date for when it will be public. Do you know? So this is not a full blown up project. This is started by a few of my colleagues. So we can make it public as well and it is in our hand when to make it public. So as and when we make a good amount of progress and there is something which is working properly, we will make it public. So since the code is not public yet, can you share some details as to what ML libraries do you use, ML tools? So for data-cleaning, we are using Pandas library and there are other, and we are working with the decision tree algorithm for training the model. Yeah. Could you talk a bit about the tools and the software you're using to build the AI engine? Yeah, so there are basically two modules which is the core of the AI engine. So the first module is the data cleanup module. So that is we are using the Python Pandas library for most of the data cleanup and we are storing it in a no SQL like Cassandra or influx DB databases. Then for the AI model, we have started with the decision tree algorithm and to make the classifications, but we are not yet sure if we'll be achieved 100% with this decision tree algorithm as in when we can try other NAVE base or other kind of algorithm as well, but we started with the decision tree algorithm. But then that confers my misunderstanding, let's understand, because we are using a very old technology for doing decision trees. So a modern learning approaches, what you would build, deep learning for example, an end-to-end system. It doesn't have to have two pieces, one for the classifier and one for the action. You do everything in one single system that will learn both from the data and both what the output should be, rather than go into different pieces which are completely different. One is machine learning and the other one is traditional pattern matching which is very brittle because then we have to go through a regular expression and you make a change and you make a mistake, it's very complicated. So modern techniques do everything in once. So instead of using rules for pattern matching, you select the features from the data that are relevant and the system will learn directly what action should be applied without going through these two stages. So what I can say here that this is not the project that we have completed yet and we are working on it and we are open for any kind of suggestions. So I'm very happy to talk to you and I'll be talking to you after this session also so that we can get more and more insights into that. Thanks you, thank you everyone.