 I'm very happy today to attend the CNCF conference today and today we're going to talk about Thanos today and I want to thank the previous speakers for the introduction for kind of like a background for us and today we are going to talk about how we use Thanos to achieve high availability and capability. So self-introduction, so my name is Ting Guan and I joined Alibaba in 2016 and I work in container or a bad container in the host and now I work for Kubernetes and SRE and my colleague joined our team in 2018 and made me responsible for the data deployment ecological system and he's mainly responsible for that part and today we're going to go through four parts. Firstly we're going to make a brief background introduction why we do this, what is the meaning and secondly we're going to talk about the construction process and the third how we use decentralization data and how it evolves and falls under the divisions that we have for the future. I will be responsible for the first two parts and my colleague Li Tao will go through the last two parts and firstly what is the background? Before we start I want to do some scenario in NASA's how can we use that data and this is the policy engine and this is what Alibaba wants to do at the project that we are going to have in Krusta and what is the meaning of it so it's that when there's conflicts we can adjust the QS to ensure the stability of our resources and our business and the second is the central schedule so for instance in there are a lot of scenarios we need online and offline deployment at the same time and all those offline deployment can be used and how can we better execute those data and also we have an algorithm team to analyze whether the algorithm that we use is reasonable and also next we have a resource operation and it can see that for the business crisis and the business change that we can actually plan the server and because every year Alibaba will buy a lot of server and our resource operation can make sure that what kind of server that we need to buy to catch the need to meet the needs of our future servers and also dashboard is mainly responsible for our data management and this is kind of like a use case scenario analysis and next what is our plan and what is the capability that we are trying to have first so we want to have this kind of instrumentation we need to have this capability to identify problems and we need to analyze and solve the problems and also on the left you can see that we have a lot of data on the time series and including hardware metrics OS metrics application metrics and they are all stored in the time series server and on that basis you could see that the time series implementation and we can have we can monitor the dashboard and we can have these kind of like data coverage and we can build a lot of capabilities and this kind of like expansion of our automation and we can have monitoring and arrow alerting and use this data to optimize our container and achieve higher sales and the turnover rate so how can we use this data to solve our problems you may see that all those you may know that all those problems are larger scale so we have those problems on some version of Kono and the time series database have this kind of aggregate capability and it can be stored in the database it not only include metrics but also include event so okay so all those qubit native system metrics are also installed in there so it's kind of like an offline database so on that basis we will have these other capabilities our algorithm engineer and we'll do their work and our data analysis team will also do that work on those basis and our DevOps engineer can do some operations on the analysis that we have so this deployment system and all those key capabilities that we want to have so those are the analysis that we have next so on the capabilities that we have and what is the architecture that we're going to have actually we have these two parts the first part is the customer end so for the customer end it consists of team so it's called Wali so you might actually see they have actually watched the the movie so Wali in robot in Wali so actually in Wali this movie it's just that's how we get the name and we get these kind of like a storage and we have persistent data and and with this data we'll be pinched in the observability database and we use this a paper data that matrix and we do it and this we need to ensure a long life cycle to have a long-serving lifetime and if we have a problem so we need to make sure that data stays there so this retainer storage we have a lot of those containers and a cobalt fails we cannot see that those cobalt fails so we need to make sure the data is successfully can be successfully retrieved and we have this adapter and inference star and the matrix server and the vpa they are actually connected all those seven systems so that was the first part we introduced the background and the second we are going to talk about the not agent so how we actually achieve the functions of Wali so the design principles so we actually have several design principles and the first is availability we need to ensure high availability and it will not fail easily so there are three small principles so we need to have independent matrix and also we need to have this kind of low dependency and you will not have any effect on that and if there's no accident we will never restart the system and the second is high performance that's also an important principle that we have and we if we have like a policy engine that needs to use our data we have this qs adjustment and we need to have this high performance local query capability and a future projection capability and you have this kind of like a real-time query and this kind of like a future prediction and the third principle is low cost because for everyone who are for agent if we run an agent it costs a lot of resources it's just not reasonable so we need to control the cost so there are also two small sub principles so we need to use less cpu and we need to make sure that we only pool one metric for one time and maybe there are a lot of duplicates but maybe like a lot of people are doing it and i'm doing it so but for us in alibaba we only do it for one time so we have first is the area data sources and they're all i need to do it for example but the database seems we have a lot of we can do it all maybe the second one is that we want to support this kind of capability so in our host machine for example the cpu use time it's kind of a counter so when i show you this all showing them we need to have a kind of aggregated accumulation of computation of it so to reduce the kind of these are the design principles that we have thought so next we're going to use the overracks architecture so these are mainly divided into two types first is the main versus they just collect the data from different data source and they are full components under it first the emulator which is mainly responsible for aggregates many of the data into a new one for example the encounter to aggregate into a gauge one or for example you have many containers you want to do an average of them we can do them in this and exporter api is mainly used for our centralized storage so for example expo promises so our central storage can get the data easily and the query api can just check the data change in the past or predict the change of the data in the future and the inspector is mainly used to to collect different kinds of data through our various plugins and for the second in our plugins we have support many kinds of plugins like node a kernel container storage and later on we will have like network we have many plugins and i would like to use node plugins as an example we have achieved several interval and the name interval and clamp according to my previous introductions you can predict that actually these are the intervals that uh scrapper ma use so every time scrapper wants to uh collect they will just according to this interval to define how long to collect next time and root is uh copyright with aggregator and they can just get all the rules that we are using today and according to some of the database to do the aggregation export example today i've already have too much data according whether it is original or the one that got later on and they can be defined export for example i have got 20 of the cpu and only two to three of them can be used today so uh these are the two parts one is our main process and second is the plugins how to use so when the main process needs to visit the plugins and the plugins can be started and once they have visited finished the visiting and these plugins will just be closed you can imagine that every time uh they will be launched and to be used in socket and once i have operated this function and these plugins can be closed these can also reduce our system overhead in the maximum extent and all the container function once you have upgraded and next time you use it they can be upgraded automatically and these are an introduction and next i would like to introduce the practice ways and first of all we have chosen the primitives tstp engines and use the storage of primitives we have put them into our single machine to do some odd changes so that to guarantee the current data uh that we're going to use or start so actual tstv tstv engine has a quite excellent performance and also they have um many of the simple function embedded in it and they have realized the tst um this kind of construction and compress and later on for our plugins we have chosen go plugins and you can just search it online also we have achieved the the standardization of all the intervals interface of the plugins and this this um look at we have also achieved a combined tstv api that is the primitives query ability because they need to be used for the local query and also when our central in centralize the storage they are grasping data we can provide them the data that they need and also we have achieved and stored uh functionality we know that some of the format of the data cannot be stored and we need to store them in another way for example like these events like gnsp and whether what kind of formats they are they can just store locally these are the some of the implications that we are using now and lastly i'd like to um appreciate our open source committee and also all our engineers contribute to this open source community so since we can achieve this kind of fast developing and aggregation because we have already sent being a very high standing point we have leveraged advantage we have used go long and we have used go plugin to achieve the of the reformation renovation and also we have used the frame ql and some of the technologies there so in these slides i would like to briefly introduce that many of us here today you shouldn't have data locally so what is the meaning of it and this graph can briefly introduce that actually for our quality engine how to adjust the qs to guarantee the stability of the business we can see that actually for this the response time of these businesses normal but once this resource sharing is appearing and it has some changes so actually we can use the technology to discover some of the irregular things and these kind of some changes and differences in it to guarantee is normal function to guarantee the stability of the business so these are some of the process that we have gained if you have the interest in the policy engine and you can see the craze this uh of the open source speaks and next i would like to invite my partner li tao to introduce our how can we use premises and thenals in our center lies the storage well hello everyone and i would like to introduce in ali baba and some of our experience in using premises and thenals first of all i would like to introduce promises because initially they are a kind of system monitoring and alarming a package and in 2016 they have cncf joined cncf which is an independent open source program and also many of the companies and the organizations they have joined the promise this and they have very active developers and the users community also through Kubernetes and console or documents they can just identify our object and based on the ATP through pool to get the data object data and also promises support the alarm system and can send the alarm data to an independent management tool and also promises is sporting a very flexible checking language and can can be used for many despots and as for our promises this kind of multi-dimension um language but not only for the monitor of the machine but also for the microservice architecture and next so from the kind of a scalability plan of promises i would like to illustrate on this while we are deploying the promises the commonly used uh ways we will just just consider from different clusters and schedule the different resource for them and this is usable suitable for the scenario that we have thousands of machines and for this kind of working mode they don't need to go to the public network and can reduce the and necessary overhead in the network also this kind of deployment method can satisfy the many SMEs but in the scenario of Alibaba we have several thousands of the clusters and we have over 100 000 of the parts so we cannot maintain it through this way and also a very big problem of is that if we want to check a service of its current function in all the machines we cannot do it we don't have a general view and another way of scalability is that maybe your cluster is not big enough but still you're facing the kind of loading pressure of a promise so we can build a promise for every application so if your application is also enlarging then this method is not suitable because for Alibaba we have over tens of thousands of applications so this message cannot be used in this scenario also for promise this in order to solve this kind of general view problem they have used a federation and they can just input the data from another architecture and to have a hierarchy monitoring system so if we want to see the request of every second per second in every machine and we can just collect the data all of the data in this machine in this room and also in the core machine room we can just deploy promise this and to get the data after the application so Senus can check the relevant data from the general promise this and to get the sum and to know the request per second of this service but the problem is that first of all you need to manage the aggregation rule when you have too many aggregation rules it's a problem for you to maintain and manage them and also for the federation they are more useful for getting the data after the aggregation so you can see that it also has something to do with the performance and scalability issue and because the data storage is just huge and when you pull the data out and they will face a problem there will be a bottleneck of calculating the data and so you will have a lot of like monitoring so you need cannot support or the server state you need to provide and additionally so this kind of like federation it also has a lot of problems so if you pull the whole data out and they will have the data query and it will pull all the data out and it's kind of like queuing of the texts and the cost will be behind and there will be some reliability issue so when we were doing this so we were talking about we need to lower the dependence on the outside and if we introduce another Prometheus then there's another point that might have a failure so if we transfer data on the public website the risk is even higher so when you use federation you need to consider how can you use this kind of like data monitoring mechanism on the higher data and instead of using higher Prometheus and also it's kind of for the federation to work there will be a problem so the data you pull might be behind and you might lose your data for instance if you use public network and when it ran out of time the data might be lost and when if it's just a demo you can tolerate this kind of error but if you pull out all the data the error will be magnified and this is not acceptable another thing is another thing that we could use to scale Prometheus is actually have this kind of like federation and use this kind of like scalability and this plan also has a problem because for each Prometheus when you have this kind of like algorithm to pull the node of metrics and on each case scenario you have this kind of like a different aggregating rows and on the central Prometheus we pull this out and then the biggest problem there is that the maintenance cost is extremely high and the plan would be super complicated and so in the mechanical room we have like tens of thousands of like this kind of similar structures so the issue would be quite complicated so if we want to achieve this kind of like high availability you can see that in our environment in every minute we cannot accept this kind of like status because we need to we need a higher availability so we need to have multiple Prometheus and have this kind of like similar data and use similar convex to pull similar data out and use this kind of like a federate so the the biggest issue here is that for the case each case that you run is independent there will be a latency in each case and there will be minor differences and if we use this kind of like global view and have this kind of data acquiring there will be some difference maybe for instance if you refresh the page it turned out to be different and we need to align this kind of like Prometheus so to connect the different Prometheus and send the required send the query to a same Prometheus but this also brings another problem so this will waste the resources of other Prometheus and all the query can only go to one Prometheus and other Prometheus are only responsible for pulling out data and the resources are not fully utilized so Prometheus actually has a lot of tools that can be useful scalability but it does have a lot of problems and in the community we do have solutions so in the end we decided to use Thanos. Thanos actually have this kind of a compatible API so they have this long-term matrix retention engine and they have unlimited data retention and the storage so I will not go into details I just want to tell you how we use Thanos for our global view and what is our experience so this is how Thanos work with Prometheus so this is the most common deployment so it actually has two parts sidecar and quarrier so sidecar is deployed with Prometheus and the quarrier can be put in all those different clusters so they are compatible and so when you send a query required to the quarrier and you can and those will be further sent to the sidecar which will pull out the data and send it to the central cubanases notes and send it back to the customers and because of the instability of federation and these actually effectively solve the problems of availability and they can help us to get all the data and I also want to add that when you when we use Prometheus there was one problem that if you have like kind of like a service needs to work across like different segments and the functions that we had could not work in that case of net scenario but Thanos could work and that's how that's kind of like global view deployment that we have so in our mechanical room we have this quarry deployment and it connects to different quarry systems and forms this entire system and all the data will be sent from the main cluster to the sub clusters in different mechanical rooms and will be returned to the client and this actually helped to address Alibaba's issue in most of our cases and Thanos could help us to support our data storage requirements also some other works that we did and in the customer target discovery so you can see that because for instance we have not enough data storage or some other issues so we need Prometheus to stop automatically pulling our data so for cubanases and that is kind of like discovery system so actually Thanos could help us to address the issue it supports cubanatus and in Alibaba because our business and our some other historical issues because of all those elements so we have this cubanatus and other discovery system so we developed this kind of like components the custom-made target discovery so it can verify all the data it will work with the same format same as Prometheus and Prometheus can then answer our requirements accordingly and also configuration management so if it's just one or two Prometheus you can manage it manually but we have so many Prometheus and also we have this kind of like change of configurations of Prometheus and there's no centralization tool for Prometheus so we actually developed a tool ourselves so we have this kind of like central management tool and users could actually send this kind of like change of configuration and we use sidecar to put in on the side of the Prometheus and it will synchronize all the data and put it locally and Prometheus can load the data when you send the request so it actually helps us to address the issue last part is thinking about the future so when we use Thanos and Prometheus it's been a while now so we've been thinking that what are the enhancement what are the things of like they can improve so we think like the most important thing is authentication because security is our priority and it is quite important for us to use this kind of a authentication process and help us to authenticate who are the users that we have and that you have this authorized request and that you authorize them to only use part of the data and to have these kind of like targeted management methods and to have a more precise target and flow control and the second we about query cache and we actually when we rewrite these kind of like data we need them to be cannot be like remodified and whenever you check it the data should be the same and we need to for the time nuts and every data we need to return the data and store it locally so we can go back to the data quickly and third is query safeguards I think Thanos and Prometheus both have this kind of function but we need more precise management force we want them to be integrated with APAC flink so the calculating capability of us can be enhanced so we can integrate with APAC flink and support more calculating capabilities there's a problem with Prometheus is that it's a single like it's a single host and then when you run it they will have some bottlenecks in terms of the calculating so this is all we want to share thank you very much any question so actually I do have a question because we are monitoring so what is what are the strategies when you monitor the code and womb and heart data and what is the differences between online and offline data so for the first question so actually what is the strategies that we what are the strategies that we use actually for the data that we have so actually it has actually run from 10 seconds two minutes and even two hours they are different but then we can make sure that for the important data the the particle need to be small and that it need to be fine grade monitoring and the grain need to be finer so it's just not like one-time data or alerting data if it's like a long time away from this maybe we will have a compact sample and the retention maybe we'll delete the data so actually the strategies are quite complicated so if you have specific request we can talk later in details because I cannot summarize it in one or two sentences the second question I can't remember it could you please repeat it so the second question is so what is the like the particle the grain and also for the online and offline data the what is the response time so for the online data so it's actually real time data so we actually pour the data every 10 seconds so latency is quite low and when we so we have opportunity for another question so I'll just answer this question one so when we send the data to question q and we don't have this kind of like but we were just to have this long term and the promissious local retention so it's within seconds the latency will not be long and also the a pack of flink so we will have more alerting and aggregating rules in the future one more question so in I noticed that you didn't use object storage by fairness so you actually locally store the data by promissious so for every data center how do you store the data by like by segments or I mean how is the deployment or Kubernetes or is it on the Kubernetes or physical or virtual host so actually for long term storage our cloud we actually have our own object storage we have status it has long term storage but the interface is not connected so we have this pangu interface and in alerting object storage and also for these kind of like data we want to store it locally but we do have some problems that when we have replication we actually for every one of us we store a lot of storage and when a paker is broken and we need to copy paste it one-on-one and we don't have a lot of like original sources and we do have a have this kind of like a long-term retention and also for cuba it is whether we run it on kubernetes or physical host so actually on a lot of scenarios if it's just a small scenario and if you scale it up and if it's like tens of thousands the prometheus can be managed on kubernetes and also the tennis can be also you're used to manage all those data but in our case we have so much data and we cannot run the data on kubernetes we have like thousands or tens of thousands of cases the capability will be too small and also second because the link is actually connected to stability of the whole system if we have failure we need to make sure that the link will be not affected and we and we don't put kubernetes on the business side with us and so we have these kind of independent kubernetes and to actually change and to make it adapt to a large-case scenario and we're still in revolution and we're still running it if you have any question please contact us thank you