 Okay, we're going to start now, and our topic today is 150, 15 and 10, so this is a kind of standard for us, and the background of it is that currently we're entering the era of the cloud computing, and the container application scale is expanding, so the concurring failures of it are more and more, and there are many reasons for some of the many mistakes and some of the hardware issues that may lead to the failure of the container. So do we have any one solution that can help us to first recover the container failure and improve our stability and to identify the failures and to solve it without improving the positive or the challenges for us? So we've been, we are operating meetings of the container, so by analyzing the samples, and we have postponed the idea of one, five and ten, so these are the four phases that we're going to talk about. First, we will talk about what is definition of one, five and ten, so you don't know what is one, five and ten, but I will just tell you about the definition of it, and second we have a simulate failure, and also we have done some of the review of it, so for the type lines, what operations we've done, and whether they have reached our final goals, we will have a review of it, and the third and fourth is through some measures, whether it is in a procedure, the tools, and technology, the methods in these aspects, how can we achieve the goal of the one, five and ten, and we are using, doing it in two directions versus doing it on offline, that is online, so what is definition of one, five and ten, you can see that these are some of the big numbers, so what is one, one means that we need to identify the failures within one minute, for example when we are dealing with the problems, if we can find them at the early stage it is quite helpful for us, second, five means that we need to identify the problems in five minutes, only by identifying the problems we know how to solve them to just recover it, and ten which means that we need to recover the failures within ten minutes, so we need to find the problems within one minute and identify the problems in five minutes and recover it in ten minutes, while we are having these kind of goals, these are based on some of our historical analysis, we have identified that some of our failures can achieve the goal of one fifth, but some of the failures are exceeding the scope, so we are seeing that maybe one fifth and ten is relatively good and super standard for the failure recovery, so once we have set this goal these can help our different teams to improve our capability in terms of the failure coverage, to reduce the length of the failures and to reduce the impact of it, so that finally reaching a goal that the whole project don't have too much failures or have no failures, so we can see that for failure what are the mistakes we have made in this and why we didn't achieve the goal of one fifth and ten, five and ten, so we can see that this is the incident timeline we have given some of the unrelated information, this is agent in the host machine, the release of agents cause some problems, we have released agent in 370 and they are delivered in three batches because you know that we were released and it needs to be done in the gradual process if we just send it in one time, some of the results cannot be controlled and also in 430 we have released the second batch and in 455 our newly released agent, its new characteristics are functioning now and we have received the alarm and we have do the third batch of the release and some of the calls are already influenced and they have already stopped the release this time and I would not further illustrate the following points, I will just tell you about our analysis, so well before we are analyzing the timelines of these failures I want to show you a very good concept that is afterwards analysis that is a very good idea from Google and SRE that is we need to learn from the failures and we need to learn the lessons from the failures so we can avoid to make the similar mistakes next time and we can use this concept to see these problems, we just released it in 370 in the afternoon but we received the alarm in 550 and in 6 we didn't identify the problem only until 7 o'clock pm we use rollout we have recovered this failure so we can see that as the infrastructure as the underlying infrastructure so when operating any kind of behaviors we need to be quite cautious because some of the small problems in the underlying infrastructure might trigger quite good problems in the top level so I think that you may have the similar feelings that we need to be especially cautious and careful in any operation so in our infrastructure team the stability is a very important element in fact that so your character your performance and how many value or how many business you're supporting are the following issues if you don't have the foundation of the stability you don't have the other values and we can look at the failures so these are the very typical problems caused by the release you can see that we didn't achieve the goal of one five and ten so it happened in 317 afternoon and only be solved at 7 o'clock pm so through this kind of failure what have we learned and what problems are exposing now and how to rectify it first so some of the our staff they are not sensitive to the potential failures and even though they have received alarms they didn't just deal with it at the correct time and second I say we are lack and short of the risk awareness so for the new features that are not fully validated and directly go online this is a big problem and the third is that we don't have a we should have a progressive deployment they are not perfect and whether these whether it problem is caused by this release or not we need to find what is the root causes of that force is that we are lack of the observability actually when we release in the 317 and we have got alarm in five o'clock this is really not good if you want to achieve the one five and ten this goal the alarm should be identified the problems within several minutes and the fifth one is that we lack of the solutions to stop loss in time so we need to solve the problems in a timely manner and we are having a great way distance with our goal of one five and ten and how can we improve we can just do it in two direction first is the offline direction the second is the online direction on the offline direction so we cannot avoid having failures but how can we identify our setbacks through these failures and to learn from it and to avoid making the similar mistakes again this needs us to have a kind of review after the occurrence of these failures the second is that how can we solve all of these issues before it actually happened this just requires us that in our daily work we need to put stability as a very important factor to consider and third is that we need to do some really good practice of the failures should this kind of practice we can see that some of the problems that we didn't notice in our daily work and also everyone can be familiar with the whole procedure so we won't be quite nervous when really facing this kind of problem and on the online aspect we can improve it in four layers first is we need to improve the observability and the second is that we need to do it in a progressive way to do the great scale and third one is that we need to have a rollback and post one we need to have other healing so for some of the problems they are happening every day and every time so for these kind of small issues if we don't just deal with them in time actually these small issues can accumulate to big issues and next let's come to the third chapter it mainly talks about how can we improve our stability through the offline method to reach the goal of one five and ten because I don't have a good English level and some of the words I don't know how to describe it so many I just write it down in Chinese I may just put some time on talking about this point so actually when we review the process we footage a full basis first is recurrence the second phase is to track and do if they need to identify if both phases recovery and then we have these different points of time and to do the analysis all those key time points to find out more about the things that we are doing to improve so first is when these into the failure occur and when this failure has emerged so what is the root cause of these kind of failure and can we avoid some of the failure in the future the second phase is detection so when did we observe the failure and how did we find it manually or automatically is it from the business side or how would be going and then in the process we should ask ourselves can we do it we can find it earlier because our goal is to identify it in one minute and then to make a step further can we even notice the failure in the advance and the third phase is failure identification which was a much longer time and so the key points that we want to mention to you is the responsible person and when did that and when did that person take over of that failure and when did we find or pause and when did we find the solution to start developing them the responsible person whether he or she kind of end up effectively or in time and if he or she failed to do that what is the root cause and then also the method for identification do we find it's monitoring or checking the and then also do we need to find the root cause of the solutions through manual detection and then instead here are the solution and the full phase is recovery and then we need to focus on two tiny points the first is the system recovery and the second is business recovery times and during the recovery the what is the method or is a relouch or other methods and how long did it take for the execution of recovery and the last one is in the future can we recover even quicker so this means that we need to lower the cost and respond to you but more quickly and even compare in advance you have some precautions so this is actually close to much of a kind of like failure prevention and actually you have these kind of experiments to do with the development of the process we need to think about stability ahead of time so what is the protocol it is it doesn't mean no failure that means you need to minimize the impact of the failure and if you need to have this kind of like automated process for failure detection and failure treatment that so that we can minimize the impact it has so all the time we're going to have this kind of like an intervention we also need to have these kind of like a rehearsal so that they're prepared in Japan and also it's going to optimization so if we don't have this kind of like stimulation we will never find where the failure is when they actually occur so we need to have training exercise and know how to increase the capability for us to pull in those kind of emergencies and also next or I'm going to talk about how we achieve the goal of a wide range of life so so this is actually your product that's the point I need to know that we can have this gray scale or gray volume and then the work of the plants do we have a perfect one actually no because we need to actually have this kind of the gray scale at least according to your specific situation and we have online campgrounds and the daily campground we have testing environment and the risk is lower in that brand and also in terms of the faster so we divide the business into business for instance some offline we need a retail for instance a format and it actually we will that deal directly with the business side and it's quite important for us to do this kind of things right and so we start from non-key business starting from there and all the way to our key businesses and we need to save enough monitoring time for the release we need to actually notify the release system and this release system will help us to do the work back we need to make sure that all the online strategies can make sure the strategy is stopping once it's ready before enforcing any changes so all actually for stubborn ones now we actually do have your methods so for instance we can do the routing out and a lot of other clients so when we treat the failure if you have a lot of those experiences you may have like restart and then reload etc and release it again so when we do when we actually want to stop the loss we need to simplify the process and have a low cost procedure that can be adapted to stop the loss and when we do it online listen for the battery because if it's complicated the other failure might look good we also have observability the low observability means we need to have a good monitoring process our internal monitoring process is called infant storm it's actually consists of sinus and chronicus, sinus is kind of like it is an interface kind of like a project for looking to enhance chronicus, chronicus and that we have control client and control panel and we have a PSL scenario and we will want to make sure all the components to provide all the metrics and then the metrics need to include all the interfaces and all the other metrics and to use to measure whether the components work well and also the second project later is not and we have X-forger for instance we might have CP, like based out of CPU and do they have a container and also like in a D and the DEMON and then some other metrics but it's not enough to only have metrics because all those resources time metrics and we also need to observe of none of the relics and the things that are just not normal and from system D for instance it might have a problem and we need to find whether DEMON or curvenators they have like alert error alertings and for users they might already receive the alerts and it might be already too late so on that basis we have this like a no-problem curvenators so it's called mpd and we have this kind of augmentation of mpd so mpd is known as a problem detector it's a DEMONSAT detects a node from one side reports them to api server so in community at the mpd does not meet our demands and as i mentioned because our monitoring process is based on primitius but then the community is only report those kind of like data to api server so we need to further strengthen by adding an adapter of primitius and to turn those results into metrics that can be further flawed by the primitius and also in addition to primitius we also use have other monitoring system so for instance and financial they use another satellite system to monitor all the data and we need kind of this interface to connect to different platforms so this is an mpd structure that we have and we will have an exporter and you know by this we will have this kind of like from primitius adapter and the exporter will send the metrics to primitius adapter it will works connects with our back end monitoring and storage system it's actually through this and actually our the maintainer we actually have communicated with the community i think it is quite necessary for the PR and for the development the contributors they actually have added PR in the community and before the mpd we have actually upload those alerts to your vrvana and then but it is not enough only uploading the data we actually have an automated recovery so this small system can actually identify some potential problems and to have some simple recovery so this is kind of an auto healing so we could see that a lot of like the cactus so when we round the container and also in the process that we can identify all the failures that we have and they can be obtained by mpd so those are some components of mpd and mpd where then we put them to primitius so for instance customer detector kind of detector system filter and all those components and then primitius components will find the healing methods for each failure and then we have a remote method and for instance a simple example we all know that in container we have a last power the one end is in the container the other is in host and for the end of it is in host we do notice a problem that is the bias might just exit the gateway and this will actually cause network failure in the container so for this failure we actually have a detection that for instance each network card whether it's connected and if there's any failure we can actually check the failure through primitius and then to find what is the corresponding node and to run the auto healing and doing the healing process we will have to be put in a network card so all those are just the simple solutions for more complicated issues we cannot round auto healing because we will have to go through the route that might actually transfer in the node so this is container debugger but this is a simple structure that we have there are actually two parts one is healer and the other is a profit so let's have a look at the healer so it will go it will go and pause our matrix from primitius and for each metric there will be a corresponding case so each metric have one specific failure and there will be a specific healing or recovery process and we will have this matching process and we will have a remote pathway to round the recovery so this is kind of like simple failure that can be recovered and the second part is also a simple one that we just started profit it would have some simple predictions so for instance some of the metrics that we cannot have this kind of like judgment so we cannot get the failure quite quickly so we have this prediction and through some algorithm we will have this kind of like very clinic cautious warnings and send some alert if it is detected we will then trigger our developers to monitor it manually because we only just start this process and we don't actually observe any of the numbers here so these are all my sharing with you thank you for your attention do we have any questions hello i want to ask for us who want to identify the problems when they want it so what is the frequency of collecting the metrics and if we're doing in a high frequency way how can we solve because once they have a high frequency and may have too much noise so how do you solve this problem so when we are collecting this we do it in different different layers for some of the metrics can do it in two minutes for example leak of the memory and even though they have been accumulated for a while it won't have too much influence for these kind of metrics we will just collect them for one hour two hour but for some of the other metrics that may have bigger influence we might do it in one minute or two minutes and for these abnormalities once we have identified these kind of abnormalities it means that they are already some problems existing so the noise might be a few relating to this well actually i want to ask a problem because you know that the system may have various kinds of problems but for your mpd how to monitor these various kinds of failures so we'll have a summary of this of all of these failures actually when i first do this kind of a normal monitoring of the kernel i don't have good solutions i just always ask our colleagues in this regard for example will there be any keywords alarm or warning in the kernel and i want to match them but sometimes this may relate to the identification of the context so for example panic maybe a several days ago they already have the signs of this how do you achieve well for us we won't do too complicated matching we'll only do some of the relatively simple matching once the keywords are identified we think that if there might be some problems so once these kind of problems like the error it means that there are some problems in the kernel so maybe it's still we are not doing in a quite in-depth way and another problem is that for Prometheus it has a tfcb and do you also have customized some of the sdb interface because your ali cloud or system has accumulated loads of the data well actually on the Prometheus and our cloud product for the continuous storage and in our company for the tfcb we are considering whether we will just use our internal data and all to put it directly to mq thank you so currently we're still using the tstb to do this well i want to ask a problem as for the service monitor that you have said there is an access standard as for the Prometheus so for every service and microservice targeting each of these services there are some standardized qualification metrics for example the failure rate do have such kind of metrics regarding this well considering the problems that we have met actually we have a requirement of the success rate once they have been scheduled for one and times the success rate must be around 99 times and also there are some requirements of the time so you cannot spend too many time so actually for every interface you will just the main interface yes for example in the container sector we will just focus about the start delete stop this kind of key interface as for update this kind of other interface will put less attention because for this main interface they will directly influence some of the content that may be felt by the customers so we will do some of the details okay i understand it so i will ask just now you have said that in the process of the monitoring most of them are based on the relatively underlying level but if i want to develop some database related business based on Kubernetes and i have the business level failure so i want to make some of the failures i need to restart the container but this kind of authority is controlled in your hand so so how do you combine these kind of failure control with the business so based on Kubernetes so what i do in the auto healing we will do the kind of auto transformation so before using the Kubernetes migration we will just do it in two phases in starting the container we will just schedule our relevant interface to start the container so within this container we have another system to do some of the business related scheduling the monitor and the and the healing so when the business day or a monitor is on problems it will just be coupled with some of the house checks so once this kind of failures is delivered to the underlying we need to eliminate it so actually so the business can only find these failures and to just deliver the signs to the underlying system and to recover these but it's for the business cannot solve it actually have a logic towards it we don't have too much understanding of the logic of it so we only can have some of the basic failure tracking and also you have mentioned the debugger component i want to ask for that service during the release and the usage does it have any of its own problems like the bug or availability actually we will use many of the instances to solve it but actually for the bug but we are healing it some of the problems might not from one machine or two machines but also use we might also use one policy some of them are in the automated or wholly automated or some of the manual method we were just for every healing might be finished in different batches some of them are already completed online okay thank you i want to ask a question you have talked about the failure practice because usually testing environment is not totally the same with the production environment so how do you do the kind of practice usually we do it on the online environment from the initial stage of the practice we kind of have a kind of test application so actually for every business we will just apply a similar application and after the ramming as for the machine we'll do some of the edit operation for example cut the power or the internet if it cannot be recovered then we think that we need to have a full improvement of the problems exposing so time is limited