 Maybe this handbook has already been appeared for many times from the presentations given by different speakers. The KUBA flow can help us to make the deployment of the machine learning workflow on Kubernetes simple, portable and scalable. From the name, you can tell that at the very beginning it will do the TensorFlow. If you already attend the previous session of this forum, the TensorFlow and KUBA flow has already been elaborated. If we only have the TensorFlow, we will have the machine learning job, which is complicated. First, I need to understand that, for example, you have 10 machines and 10 nodes. I need to know their words, I need to know their resources, and then I submit. I need to submit which has the PS node and which one is the working node. But for the KUBA flow, it's just like the feature show on the right hand side. It can also help us to show the complicated complexity of the environment. For example, in the model serving, as well as in the monitoring, it provides different kind of plug-in tools to show this kind of complexity. It also allows us to help people focus on the key machine learning code. In our presentation today, we use the TensorFlow as our tech cam, and each year we will collect a lot of the traces. The amount is about 700,000 each day. Later, I would like to share with you the architecture of our topic. On the left-hand side, on the blue, it is the kubernetes cluster. In the dash box, it is the space. It is nine-space. Another one, it has the requirements of the microservice in the Istio. On each board, there is an onsite card. On the below, we have the Istio system namespace. We also have the eager and elastic namespace, but we don't use the Istio D4 eager because it is only one. It cannot store a lot of the traces. It cannot support a lot of the parallelization. That's why we need to introduce an architect which can support a large amount of the parallelization. With the Istio Uncall, we can collect the trace. The trace will be sent to the eager cluster. In the eager cluster, we use our own data store in the eager. It supports concentra electric search. Finally, we chose the electric search and data in the elastic search. After it is in that, it will be stored in the unfair service. The main part would be the right-hand side, which is for the machine learning. We have the data preparation. During this step, the data will be pulled from the ES and then we'll use the data for flattening. Because when we collect these kinds of traces, they can be varied in different layers. That's why we prepare some data here. We only select the field that we pay attention to and then generate a document and put it into the service. All these kinds of data will be consumed by the copper flow and trained in a model after iteration. The model will be stored in the object of storage. After that, the model will be served by the copper flow and run on the Kubernetes cluster. This is the overview of the architecture. Now, I would like to give the floor to my co-worker, Mr. Zhang Muntao. Thank you, Ms. Yang. Later, I would like to share with you the machine learning in our case. Just now, I just introduced that in the trace, there are a lot of information. Among them, there are some data, behavior, duration, and some others. In each trace, we know which service name it is from and where it goes to. The source, the origin, and the destination IP address and service code, etc. All these can be stored in the trace and can be reflected in the service match. And then we'll know how healthy the microservice is. How can we use these data efficiently and to dig the patterns before and just now, we just told you that the data is the time sequential one. To be simple, we just select the duration as the target of the machine learning. So let's have a look at the first step. The first step is EDA. EDA is a concept on the statistic of the data digging, which means that in order to process the data, you need to perceive the data. And you'll know the features of the data in different degrees and different dimensions. After we have a perception of the data, we can decide what kind of the model we choose to process that. In this example, we use the service duration as the object of the data process. And we can observe it from different time range and time frame. Because in different time frame, we can see that the features, the data show are quite different. So let's have a look at the diagram here. So each frame lasts for about 30 minutes. And we can see that within the life cycle of the 30 minutes, the duration fluctuates back and forth. So the requesting time is fluctuating. And in this way, we can know that it is related to the total of the request. We don't need to analyze that what are the cost of the fluctuation. And let's have a look at another dimension, the last 12 hours. So let's see the change of the last 12 hours. The feature has already been flattened, which means that within the range of the 30 minutes, the fluctuation of the duration is really violent. But when we expand the range to 12 hours, it still fluctuates and also show its features, just like the EDR. But the fluctuation is not violent as before. If we further magnify the time frame in the last 7 days, starting from June 14th to June 20th within the last 7 days, we see the characteristics are less obvious. But after June 17th, there was an abrupt change. The time duration increased significantly. And the last diagram indicates total requests in the last 7 days. We can see that there is periodic fluctuation in this diagram. What is the reasons behind? Because there may be different requests in daytime and night with some modern daytime. The workload is bigger and there are more requests. And at night, we have few requests. So we see the chat here. So now we have initial perception of data for machine learning. Then we are going to process the data with models. As I said, the duration is time-serial data. So we are going to pick models suitable for time-serial treatment. So here we use LSTM. Because, first of all, LSTM is based on traditional R&M. This R&M can share or transmit characteristics time-wise. For example, there is periodic fluctuation. This fluctuation can be transmitted. Then the network can learn this characteristic. This traditional R&M has its own challenge. Due to its fluctuation, the layers may disappear. Then LSTM can come to play because it has three units for control. It will help us to decide whether this feature should be carried forward or neglected. Therefore, we use LSTM to deal with this data. Just like LSTM to process language. First, we will use model to generate new sequence based on existing samples. For example, when I deal with time-language, I am here to speak. So for us, in our case, based on the previous sample, then we will predict the upcoming duration. This is the result of our model training. On the left is the dataset. On the top right is the simulation of the model. The bottom right is the diagram to verify our training model. Here is the question. In our case here, in terms of tracing, the machine learning uses tracing for abnormally detection. That means, through the time sequence prediction, based on the previous time sequence, I can generate a future time sequence. I will compare this estimated sequence with the previous one. If there is tolerance and aviation, then I can tell whether this is an abnormal anomaly or not. On the bottom right, we have 990 data points in the test dataset, which is indicated on this diagram. There are 196 data points, which are anomalies, which is indicated in the red point. So how can we tell whether this anomaly or not? So we have used some specific algorithm and also based on our experience. So on this slide, I want to show you that when it comes to data observation, if we use different time frames, then the features or characteristics are different. The value is the same, the number is the same, but the characteristics displayed are different. It's like an old story in China, like four blind people touched an elephant, when they reached an elephant's leg, they say it's a pipe, and when they touched an elephant's body, it says it's a wall. So what is the elephant? The elephant has a more abstract idea. So we are trying to get these abstract ideas. In order to do so, we need to observe these objects in a bigger picture. This is the same to our model. Let's go back to this slide. If we view the data in a short time frame, you will see huge fluctuation in duration, which is unacceptable, which will be defined as anomaly, but if we extend the time frame, the fluctuation is still there. However, it looks less abrupt than the model will define it as a normal phenomenon. Because it has been trained for a while and it can learn to master this time characteristic. So for us, we want the model to capture this anomaly in short duration. On the other hand, we also hope this model will mistreat anomaly as a normal phenomenon. So after we train the model, we can't just put it aside. The purpose is to observe. So we can release the model as an API, then the applications can use this model. For example, in this case, also release the model. So this case uses 10 points as a time reference, and we can use this 10 points to estimate the future time sequence or time window. And we can estimate its duration. In terms of time and prediction, there may be two approaches. One is a precise approach. For example, I use the real-time duration to do the estimation. Another relies on a complete sequence. For example, I will estimate the future time sequence based on the previous sequence. So the first approach is more accurate, and the sequence-based approach may be less accurate. However, you don't need to have this minute-by-minute real-time accuracy for estimation. If you just want to know the possible trend, then the time-sequence-based approach will be better. Okay, coming back to our topic here. So what do we want to do? First, we want to test anomaly. When I use 30 minutes or 1 hour data as a sample, if I use this as the data input, then I will get a small time frame model, which we call it model 1. With this model, I can estimate what happens in the last 30 minutes and is the ending abnormal over the last 30 minutes. But we can also extend the time frame that will get a bigger model. This model will help us to identify the characteristics of a long period or long duration. It will help us to scale it. For example, in the previous instance, there is a abrupt change over the last seven days. That would indicate that the total request may reach a peak as a result the duration shows frequent anomaly. If we can use the model to predict that over the long duration, there may be some abnormal characteristics, then we can use some auto-scaling mechanism. For example, take in-breath ingress as an example. There is a long delay that we can use as metric. In the previous instance, we can get the memory usage of ingress. If these two characteristics are consistent with the data collected in the duration here, then the duration increases. Suddenly, I can scale it horizontally. That will help to expand its business capability. After this duration peak, the recent request should decline. Then we can scale the ingress down in order to save resources. In terms of model training and deployment, we use the control in Kubernetes to train it regularly. With a bigger time frame, we need more dataset that would require longer training. But for a small model like a 30-minute model, the dataset itself is more. The training can be shorter. That means we can iterate the model quickly and launch it. In our example here, we use the Istio. Istio helps us to do publishing in different versions easily. Our presentation is about using the traces in Yeager to do the cube flow and do the model and to identify the anomalies in our micro-services. After the anomaly is detected, we can notify the operation team for intervention. We also help them to do some auto-scaling so that micro-services can provide more service. This is what we want to share. If you have any questions, you can raise your hand. Please use the microphone. When you do the anomaly detection, I get your question. You mentioned about the classification. What are the characteristics of the model and the recurrence? Here, in this time frame, it fluctuates. It keeps coming back and forth. If I just use the long-term model, it will treat these values as normal because it predicts the characteristics will happen here. In this case, I will use the cluster algorithm to identify the points. My second question is that for those abnormal points, when you detect it or before it, why don't you use some engineer work to identify it? Have you ever tried to use some traditional machine learning like characteristic engineering to identify it and characterize it first? Wouldn't that be better? Maybe we will think about it. We will give it a try. The third question is that. Going forward, will you use this together with the scheduler? Yes. Because this will involve the distributed model training for a bigger dimension, a single part, a single job to train the data. This can be problematic. Therefore, we have a lot of jobs submitted. Then we want to use a distributed approach. Traditional scheduler faces some challenges when it comes to paralyzing. But later, we will think about using a queue batch as a scheduler. My purpose of asking this question is that if you try, I would like to know who the scheduler and the others which one will affect the usability. When we deploy, it needs time and the time cannot be fixed. It depends on the time and the network. But for the scheduler, I hope that I can use it as soon as possible. It requires good performance. I would like to know if there is an imbalance between these two so that it cannot be used. When you deploy the model, can you still use the TF survey and another kind of the model of the optimization method? Well, so far we haven't encountered these kind of problems so far. All right, thank you. I have a question. And how can I ensure that what I train is the one I would like to use in practice. Well, in other words, do you think that the training is only available in one area or it is in common? Well, actually, what we train, it is a batch service. You can understand it in the Relative System field. Okay, so after I train it, I can only use it in one service, right? Yes, that's right. It is not a common model. And in the final, it will have the auto scaling, right? Yes. I would like to know the time of the auto scaling and I would like to know that when it is decided to scale up and when it will decide to scale down and what are the criteria. Or if I would like to scale it up, but the process is really small, it is really slow. For example, actually I need the AD request, but actually I can only support its 50 requests. But if it scale up from 50 to 80 and it will fill in the 52 and what would happen and what we would like to do. Well, actually in our case, we don't have such a large amount of requests. Yes, this kind of the problems may happen, but we won't scale from the extreme as you describe. Well, actually, is there any possibility that when you do the auto scaling, there are some escalations. Yes, the escalation can be happened. So how can you ensure that this kind of the escalation can be avoided? Well, actually for this kind of the escalation when I scale up and the number will be increased. So the process ability will be increased in this way the duration will be reduced, right? Yes, that's right. So that's why we have to use at least two models. And for the auto scaling, I use it for the long-term model. So in this way I can avoid that it will notice the violent fluctuation. And then it will fail. What does that mean? I don't know. So another model is responsible for judging, not the same service, right? No. So this one is for compensation, right? Yes, then this one is for compensation. All right, due to the time limitation, thank you very much for your attention. And if you still have any questions, please welcome to talk to us later.