 Hello everybody. Today, I would like to share our experience of how to use machine learning algorithm in practice of integration in Kubernetes and which algorithms we applied in our experience. First of all, I'd like to introduce ourselves. My name is Igor Gustav Myazov, and I'm head of integration department in spare bank technology. And I also like to introduce my colleague, Maxim Chernovsky, who is chief software development manager and who will provide some technical insight for you about main topic of presentation today. So, let's start. So, in the beginning, I'd like to say several words about our organization. We are a big financial and ecosystem organization and our focus is best client experience and technological leadership. Our company takes first places in financial services. We have a lot of retail clients and corporate clients. We have significant market share in Russia. And also, last year, we did a great step in the direction of development, non-financial services and development of our ecosystem. At the same time, last year, we introduced new IT platform with focus of reliability, zero downtime and applying machine learning algorithms across all services of our platform. Why we did it and do it, you know, because we did a lot of work in the direction of migration, our legacy system to cloud ecosystem and consequently, it provides a lot of a huge amount of application telemetry, which should be collected and transferred processing in appropriate way. So, of course, we have some monitoring issues because our IT systems landscape is rather large and complex and we have common problems like alert and metrics health and clear topology and clear integration, clear integration, interactions, dependencies, which we understand in real time. And also, we have some issues connected to elasticity about inefficient manipulation and efficient rising of enough resources for our workload. So after scaling issues, applications, time issues and so on. And of course, for us, the main focus is on real time interaction. So overall latency through complex topology also is the problem for us. So if we look at it as high level concepts, I can say that we work with two types of telemetry. The basic one, which we built from the Kubernetes tires. And in this level, we are most interested in performance of containers alerts, which we can collect from containers and then processing in our models. And the second level of data telemetry, which we collect through the service mesh. We use this information to create a graph for individual workloads to understand all parameters of work interaction, duration, latency, amount of data transferred, errors and so on. At the first stage, all our monitoring data is collected in intermediate storage of metrics when this information is aggregated and put to long term storage. Long term storage we use to prepare a data set for our training machine learning models and then use it for developing our model and processing this data to to engine of our machine learning cooperation system. So the system we can divide to two main parts. The first part is the preparation and evaluation of models. And the second part is a real time execution of trained models. For speaking of technologies, we use Istio as a service mesh to collect service mesh metrics and permittails as a data layer and to collect our data. But I'd like to know that technologically we can use any other service mesh implementation and any other database for time series to collect data for the model. Great. Thank you Igor for the awesome intro and high level explanation of the system. So we have figured out the conceptual approach and technologies. Let's talk about specific tasks. We have our own private cloud as bear in which we host many Kubernetes clusters. And in general, we can say that each of these clusters is something like an on-premise hosted cluster and each of these clusters can have several hundreds of nodes. For launching pods, we use a standard scheduler. Workloads are completely independent in terms of deployment and can belong to different teams. So here is an assumption. We can reduce network resource consumption as well as optimize overall latency by combining workloads in the thing which is known as schedule group. So let's see how it can be done. As I said, we use Istio as a service mesh. So I will describe all methods using the example of the well-known booking for application. I'm sure many of you are familiar with it. So just very briefly, this is a simple application. In total, we have six interactive services. And of course, the Istio service mesh is used to collect metrics and telemetry of the services. Using the Istio request total metric, we built a directed cold graph for our applications. The vertices of this graph are the interacting services. I use service names S1, S6 for simplicity. The services correspond to the services of the booking for application. The edges of the graph are the effects of network interactions. And this is also quite simple. Now you can notice that the graph is weighted. And yes, you are right. As an example, I added the average latency of request processing as a weight. The values are taken from Istio request duration metric respectively. This metric is chosen because in the example I want to group applications and minimize network latency. In general, any metric from your service mesh can be used as a weight of these edges. As a result, we got a simple data array which is shown on this slide. Also in the table correspond to the edges of our graph and the columns correspond to vertices and weights. If we sort this data set in descending order of edge weights, we can group our vertices in one cycle. The first pair of services, the delay is maximum respectively, the first group is created, and the two services are added to it. Next, the second pair of services, since S5 is already in group one, then S3 is added to it. Next, S1 and S6 create a new group, since no related services have been found yet. And here, there is an important point. The size of the groups acts as a parameter of this method. And this allows us to split the connectivity of services and not combine the entire chain into one group. In this case, the size is three, and when we process in edges of four and five, nothing happens. The services are already allocated, and since the limit for group one has been reached, the second and the first groups will not merge. At the end of the method, service S2 assigns to the second group. And here the work is done, and let's see what we got here. We have two groups of booking for services in our cluster, yellow and green teams. Services product page, revision version one and details are included in group one, and services reviews version two and three and ratings fell into the second group. Great, the approach works and let's see what we can do about it. In the simplest case, we can use node selectors to place a group of applications in a specific zone of the cluster, which provides the most comfortable conditions for the grouping criteria that we have chosen. For example, the network between host is very fast in this zone. We can also use service typology to localize traffic within the selected zones, but I have to say that now this mechanism is deprecated and the more correctly to use typology where hints here. Additionally, we can use for the affinity rules and place replicas of intensively connected applications on the same node to ensure maximum performance. However, we have to be careful here because this affinity can lead to performance issues in large clusters and we should not forget about fault tolerance and other stuff for HA and DR processes. Finally, you can use this grouping mechanism for the, I would say, smart or topology spread constraints. As for us, it's bear in our clusters, there are a large number of independent applications that can interact between this, between each other. We can use collocation in a recommendation mode. It automatically investigates the relationships between workloads and clusters and then manually, and then we use manually this information to mark up the cluster infra and configure the cluster schedule. Okay, that's all about workloads collocation. Let's talk about normally detection. We have a large set of metrics here and this metrics can be analyzed to find the normal assigned them. And here is an assumption we can recognize abnormal values of the system and determine the root cause of this anomalies through the correlation of network and system metrics. Let's see how it can be done. I will tell you about booking for application again because we use this tool, but there is one addition. There is one more rating service and I added it here to build a more complex example. Let's assume an anomaly as a point that is sufficiently distant from the main density of points so that the probability of this event can be considered very small. In this case, the problem of detecting anomalies can be reduced to drain in the model on historical data predicting the next value of the time series and classifying the points as anomalous, depending on whether the real value of the time series data, sorry, depending on a real value of time series data and when it fits into the predicted confidence interval. Here we can see on different models, for example, it is a Gaussian process, a RIMA, moving average and so on. We use a model based on clustering mechanism using DB scan and this model shows a good result, but it turns out to be sensitive to data normalization and you have to keep it in mind. The results of the models in terms of average time is presented in the table on this slide. Okay, now we know how to automatically find anomalous in metrics, let's see what we can do with this information. For the KubeKon, we checked the micro RCA algorithm on the example of our favorite booking fabrication and let's see what came of it. An anomaly detector notices time deviations in requests and a root cause detection method can start with this data. The first thing we need to do in this approach is to build a directed application graph, where the nodes services and the arrows between them are the corresponding requests from one service to another. We already seen it in the previous test. From the complete graph of the application, a sub graph is located which contains the abnormal nodes plus adjustment normal ones. Here, we've got an anomalous sub graph and just let do a few tasks on it. At first, we have to calculate the weights of edges as a correlation between response time and average anomalous response time. Next, we have to calculate the anomalous scores of nodes as a product of average node weight and the maximum of correlated utilization metric. And next, we can use the personalized page rank algorithm to determine the probable cause of the anomaly. This method produces a rented list where in the first place in the root cause of the anomalies, we can find the reviews version to service. The most related anomalous response time metric is rate container memory working sub bytes, which reflects to the essence of this anomaly heap overflow. So, we looked at anomaly detection methods and an algorithm for finding root causes. How can this information be useful? At first, I want to speak about learning in modern enterprise systems. There are thousands of learning rules and a single problem can cause dozens of notifications while detecting anomalies will automatically respond to suspicious events in the infrastructure. Methods of finding root causes can significantly reduce the flow of notifications, which is a perfect solution for the system administration. It is important to note that the anomaly detector can be implied to establish monitoring thresholds. I would say since the threshold values are most often selected individually based on experimental data. Anomaly detection allows you to automate this process. Here we got the third important aspect based on the automatic detection of anomalies so we can form flexible rules for automatic traffic management. When we make changes in our infra, for example, when we perform general releases and so on. As for us as a bear we use anomaly and root cause detectors in the process of chaos engineering. We achieve a deep understanding of various aspects of the systems and then use this information to manually configure our monitoring solutions. Unfortunately, concrete models of root cause analysis in our infrastructure is not open sourced yet, but so I will not cover them in this talk, but maybe in future we will talk about it. And the last but not least predictive auto scaling. When migrating to Kubernetes, many of our colleagues, especially those who write code in Java faced the problem of slow start of applications. Given the fact that they put some work on the elasticity of applications in our environment is achieved through automatic scaling a slow start off an application became an unpleasant problem that we really had to fight. The question arose, what if we try to scale the application not reactively, but proactively always having the necessary amount of replicas at hand and impressive set of metrics here from containers just what we can try. The method is designed to predict the required number of service ports to K methods are used to predict the number of ports in a service standard HP and predictive auto skill. As an input, the method uses the time series of a specific metric called value, for example, CPU, as well as data on pod resource limits for limit and it, it's a desired ratio on the resources, the calculation of the number of ports in the HP is performed in a very simple manner. It is necessary to split the total demand for resource into the limit of a separate pot, taking into account the ratio. A limitation is also imposed in terms of the minimum and maximum number of ports. The predictive auto scaler calculates the number of ports in the same way, but instead of the last actual value of the our metric, the predicted value is used for training and prediction, the time series data is supplemented with some features. The moment legs for the last 20 minutes and an hour in a day are used as a science as features, but this list is not is non exhaustive and can be supplemented. On the slide is an example of data transformation and future generation predictive auto scaler uses two models to predict values. CatBoost model, CBM and the moving average with a window of 20 values. CatBoost model is a gradient boosting model for decision trees. The model allows you to track the growth of the load with a small time delay. All the two values obtained and the maximum is selected. The CBM model allows you to quickly respond to an increase in the load and moving average in turn partially dampens vibrations and does not allow the pulse to be lowered quickly, which is important with a sharply variable nature of the load. Moving average response to load growth with a long delay, which makes it impossible to use to use it as the main prediction model. However, the response to load reduction is also slow, which is very good and which is which will be used later. The CBM model can predict the value of a quantity with small delay, which is an advantage when detecting increasing load. In this case, the model reacts at the same speed to decrease, which can adversely affect with a sharply variable nature of the load. It can be seen from the graph that the model reacted to the decrease at the moment when the load began to grow again. So we combine two models and it allows us to quickly respond to an increase in load and at the same time slowly respond to a decrease, which is very effective for the predictive auto-scaler. I suppose that's all about concrete features of ML Ops pipelines in our environment and now we can talk about our future plans. Igor, it's your turn. Unfortunately, we don't hear you. You are on mute. Thank you, Maxime. Thank you for your brilliant speech and as Maxime said, we use our machine learning cluster now as an in-recommendation mode and of course after collecting enough amount of statistics, we go into switch in operation mode and implement all our features in a real-time manner. And in the second way we would like in the future to switch from evaluation of efficiency of our models by experts to developing some type of feedback channel from model runtime to improve efficiency of our models and improve speeds in which we can train our model in production zone. So and in the third point that's directly related to the second, in addition to continuous assessing our models, we'd like to create continuous delivery pipeline for our models to deliver new feature in quick manner and receive actual training set, actual statistic from runtime to our developers and create this model lifecycle in more efficient way. I suppose that's all for today. Thank you for your attendance and I think that in the future we provide some additional information about how we can, how we developed our machine learning cluster management. Thank you. Yeah. Thank you guys. Thank you all for joining and now we are ready for your questions. Thank you.