 Good morning, everyone. My name is Zhao Yifan. Today, my colleague and I, Dominic, we are going to talk about these global cross-cloud monitoring platforms. Firstly, I'll just briefly make a self-introduction. I've been with Improbable for almost six years from 2013 to now, and actually I'm the founder of the Improbable China, transporting the technologies from Improbable UK to China, and I'm involved in the core technique development and software engineering. So before we start, I want to ask you a small question. It's a bit abstract. So let's assume that there's a tree in the forest, and let's think about how can we monitor the status of the leaves and trees. So in the real-case scenario, the tree might be a virtual host on the cloud, or maybe a host on the platform, or maybe a container. So in the container, and all the status on the trees, and how the leaves wave and move, and the smell and everything, we need to know all the information. So there's only one tree, the question is simple. So we can check the trees regularly. But if there's a forest, then we need a system to observe and monitor all the trees in the forest. And if we further throw more trees and forests, and if we have four forests, and even we have different types of trees, and we use different methods for monitoring. And in the end, we have different models of forests, and it corresponds to different cloud providers. And we have different platforms for the cloud department, and we need to monitor all those information. So for those of you who are familiar with this cross-board monitoring, all the tools and technology provided by different cloud providers are different. If we need a unified monitoring method, it's quite complicated. So for today's talk, we're going to tell you how at improbable we started from single cluster to multi-cluster of same type and all the way to multi-cloud. So multiple cloud means that we have multiple clusters in different models, and how can we monitor this kind of distributed system? A brief introduction of improbable. So we are a startup founded in 2012 in the UK, and now we are 78th old. We have 370 employees, and we just established our Shanghai office this year, and we specialised in game industry, and we want to have this kind of like host platform for all the game players. So we want all the developers, we want to enable them to create more and better games for the players, and they don't need to worry about all the back end and everything, and we want to create this kind of platform that would empower those game developers. For those of you who are familiar with game deployment, so if you have like different MMO and also Minecraft and different games, so how much, how big is the deployment? And the monitor is actually a complicated issue. So because we need to work on multi-cloud, this is also a commercial need, because in China and in and also overseas we use GPC, and Google doesn't have employment deployment in China, so in this complex environment we need to work on different cloud providers. So we have this complicated environment to monitor all the deployment. So I'll just now give the floor to my colleague, Dominic Breen, and Dom will go through all the tech side of our Thanos, and the whole process of our developing. So the floor is yours, Dom. Okay, today, I would like to thank you for your introduction, and then I'll just briefly introduce myself. My name is Dominic Breen. I am now a software engineer at Improbable. I am now mainly working in the exploration team, and then I am in Thanos, including the public interface, and so on, and so forth. And at the same time, I have also organized a lot of Meetup activities, including Prometheus London and London Gopalston. Today, what I am going to talk to you about is data monitoring. Data monitoring, you can see here is a definition. You may have seen it, or at least heard of it. Data monitoring is actually collecting, processing, gathering, and displaying a system's practical data, such as data collection and type. For example, this system has to spend a lot of time on something, and so on. In this way, we can have such a control panel. At Improbable, we are now mainly using the red method. It is a way created by Tom Wickens. It is a bit similar to Thanos. The red methods, when you collect these data indicators, like this, you can see the picture on the left corner. And if there is a mistake, you can see it on the right. And because we are currently collecting data, you can see here is blank, that is, we are currently not collecting the data of these mistakes. And at the bottom is our delay. So how can we expand it to the whole world? In this way, we are going to expand it to a control panel. We can't tell the whole story of a hero. It is Prometheus. If you use Prometheus, a lot of people will be able to see it. Today's speech is very simple. We all have a background. We all know that Prometheus is an open source project. In Improbable, we have always used Prometheus to monitor all kinds of data. Prometheus, when he is monitoring these complex functions, he has various kinds of KPI indexes. And then you can see that you are in this API, you are monitoring. On the screen, for example, there is a different counter. And now there is a gauge. Then it may go up and down. And then there is Histogram. So we will take these data out of the work load. It will be divided into four different parts. As you know, there is a rule and a warning sign. And then there is our carrier engine, which is our right corner. And then it will collect this data from Grafana. Then our carrier API will be able to keep it for a longer period of time. And then our engineer will be able to receive the warning from the system. So today we will take a look at how to expand Prometheus to the global scale. And then we will put these things together. But today this picture is our foundation. And then we will expand the whole group of different elements and then go to the global scale. This is our second story. That is Kubernetes. Kubernetes is a basic platform that runs all software. So it is the most important for us. This is Prometheus and Kubernetes, the principle of the two types of operations. You can see that we have a scrape, which is the data capture configuration. As you can see on the left, we will collect the data from the pod. Its default form is splat matrix. And then collect all these data from the pod. And then send it to different carriers. So in the single group, it's very simple. As long as you put this Grafana and Prometheus together, just like we mentioned before, it is at the career level in Kubernetes that is such a simple price. You can see that Kubernetes is the foundation of our working complex management. At the same time, we have the service發現 of Kubernetes to look for such a work complex. It is the whole work complex. We collect data from Prometheus. It has already quickly become a whole business that we only collect. We can say that we are all using Prometheus. From the single group to the multi-unit platform, what are the advantages of the single group? First of all, it must be very simple. It is very easy and convenient to monitor. Then we directly use the service discovery of Kubernetes to monitor. There are a lot of advantages and disadvantages. And the delay will be higher. And when we consider the delay, because Yufo is a cloud platform, when we are researching these cloud platforms, the most important thing is the game industry. If there is a delay in the game industry, the player's experience will be very affected. So if you don't need to do multi-unit or cross-unit, you don't need to consider the rest of the things. But if you only need to use the single group, then maybe you can accept these disadvantages. It doesn't matter. Now, our headquarters is in England. If we go to Manchester, it's about 7 seconds. The data can be transferred. Then to Germany, to Munich, it's about 30 seconds. Actually, this delay is not too high. But if it's a world-wide layout, then the delay will be higher. For example, if we go to China, it's about 300 ms. And then to Sydney, it's about 280 ms. In the game, we don't want to delay like this. So we will transfer the single group to a place where users are closer. This is why we have multiple groups. In this case, the delay will be shorter when they play the game. But this also means that if we go there, the delay will be shorter when we play the game. So if we go to multiple groups, it may look like this. Our work is still used by Grafana to display the data. In our management layer, you can see that there is a hop cluster. And then below is our different game cluster. The group of different games. You can see that there are different groups in Europe and the United States. If I am a worker, I want to know the specific work data of each game group. I can go to the bottom of this Prometheus and then go to the match data. But if it's like this, I may need to jump in the group. I need a control panel to allow me to observe the whole world in this way. So I can see that there are different groups in each group. I can see that there are different groups in each group. Now we have an answer. With Prometheus Federation, it can connect different services of Prometheus with the federation. As you can see, we connect from the母籍群 to the母籍群 through the federation. It is a combination of data. So Prometheus is mainly a combination of data. So it is a combination of data. So we still need to collect the data from the data directly to the data. You can see that there are different groups in different regions. If there are more groups, we need to determine which Prometheus is going to collect the data. But there are still some other questions in this process. This is the third key to our story. The third key is the third key. What about the third key? It is our Podcloga and the other engineer. He is the first one to develop a new project at the beginning of 2017. He is a platformer. Now we have made him a part of our殺 river. If you are interested, you can go and check it out. What does it do? We want to make a search for the whole world We also hope to be able to save data, for example, on the server. We hope to be able to save data for more than a year from the current data storage. At the same time, we also hope to be able to ensure the high-capacity usage of the main building. Later, I will talk to you about it. And finally, we will talk about the general step. At the same time, we will also talk about the general step. We will also talk about some data accuracy. At the same time, we will also talk about some data accuracy. So, let's take a look at the balance. The balance second is a very core component of it. It also uses the same network. Now, we can see that it uses a GRPC Store API. It is a very core component. It can provide the whole system at the same time. So, we can see that the balance has the same center point. So, we need to spread out our related data through this set card. The second is our query. It also has the same GRPC API. It is also very similar. It is from our premises query. So, what we can do through it is to change our federative. We need a federative to do this operation. Let's take a look at the center point of this query. When we consider the query, because they all use the same API, it can go up to all the Store APIs. In this example, we can see that a query can communicate with another query API. It can integrate the data related to it. So, we can see that it can combine all the data on this set card. This chart shows us some of the data related to a series of Thanos. We can see that it comes from three groups. It is also from the data we collected. So, it is a collection of data from the machine. We can see that sometimes it can send the data back to Thanos. It can also see some related elements. So, it can send the data back to the table to our table. Now, we will support a series of features. Through Aliyun, we can achieve this technology. We also cooperate with Aliyun. Now, we can put every set card into a GCP structure. It can also be used for a long period of time. At the same time, we also want to collect these data. So, we need Thanos to have a storage API. We can see that he has an API that can collect the data related to it. It is very interesting that every group in Thanos can tell the query what kind of data we have and what kind of time series we have. We can reduce the time in every set card in the past two hours. We can see that it is about 15 days. We can store the data. We can gather them together to ensure that we can also reduce some of its processing time. This way, we can directly put it into our storage. We can find its object storage. This way, it can save us a lot of time. Actually, it also built a map. Where is our time waste? So, if the query can directly enter its time series, it can reduce all the time spent in the process. Let's look at the last picture. It is the same as the picture we just saw. We can see that its time is 12 weeks in the past. So, we can see that this is an object storage that can collect the data related to it in the past 12 weeks. Another problem with premises is that it is very difficult to find some high gain rates in premises. What about this? For example, the most top-level federation. We can see that its federation is more than the data it needs. So, we need to run it. We can use Kubernetes as a means to get the data we need. If we don't have the data related to it, we hope that our work flow can be carried out in this time series. But at that time, we hope that we can achieve this. Then, what Thanos can do is we can see these replicas. For example, Thanos 1 and Thanos 0. Then, he can pass it through his set car. Actually, he can reduce and turn it into a single series. This is what we usually see in two premises. We can see that there are two time series. They are all on the same line at the bottom. We can see that they all have a replica. Then, we can deduplicate. But our queries means that he can pass less data from one set car to another set car. So, I think deduplication is very good. He can pass our data and turn it into a single series. Another thing I would like to talk about is Compacted. We have similar data in the premises. It allows us to look at the object storage and compress the data. This way, we can compress less data. This makes us able to do a longer period of data. We can, for example, have a window of a minute or an hour. We can also reduce some of his errors or some delay in time. We can see it below. It means that our query will choose how to handle this data for you and how to solve the related data. So, we can see directly from premises to handle data or through windows to handle some data. We can see that it has reached a multiple layer. So, we can reduce some of the time needed in the network connection process. So, from these two windows, in the premises, we can transfer the data to the object storage. This is what GCP is running now. So, we can see that the three heroes in our story are Kubernetes. We can have a sustainable way in all of these scenarios. We can also use this service to find it. At the same time, we can also find that some of the premises may have some problems. We don't need to change any of the data on the PSB. Then, we have Thanos. He can provide us with a global amount of data which is very high. And he can do some sampling of our data. At the same time, we can also use some of the data from the object storage to use Thanos' sidecar. Then, he can also use some of the promises in the premises. So, we can see that the advantages of this multi-layered network are that we can get better and higher quality of our data and can make a global query and a long-term measurement. But we can see that the problem is that it is difficult to observe and it has increased its complexity. At the same time, we have to consider whether the automaticization of these measures can maintain a long-term level so that different Thanos can tolerate. But we know that we don't have any plans to exist in the first experience in the language. But once you have a customer who wants to have some special service providers, then we can't use this GCP in China, so we need to use it in a different way. Again, we don't need to... Thanos is already very difficult. We need to change it. So, let's take a look. We have a lot of groups. How do we do this? Through our GCP cluster, our GCP group, we can terraform and you can share this with them. We can see that in every cluster, it has its own VPG. What can we do? We can see that they can communicate more easily and transfer it to the same network range. We always go down, not up. I can see that it has become complicated and detailed. We can see that it is relatively simple for a single cloud provider. But for multi-layered providers, such as Ali cloud, how can we connect between two cloud providers? You need to ensure that the terraform is in the right time and it can be continued. For example, we are also looking for a simple way to solve it. So, ANVOID is a basic model. We think it allows us to communicate with cloud providers. So, this is complicated. We use ANVOID to install our multi-layered network. So, we will mean that we will ensure that our communication is safe and that we can transfer to different cloud providers. So, we can see that all these cloud providers can have DNS. In this way, we can know what will happen between them. We know that every ANVOID is connected to the cloud. So, we can use ANVOID to communicate with cloud providers. So, we can use ANVOID to communicate with cloud providers. So, we can use ANVOID to communicate with cloud providers. So, we can now see that all these cloud providers can have different technologies across the world. So, we can Okay, now that we've reached the跨云 stage, we're going to use Kubernetes and then we're going to use all the跨云 and the跨集群 in the same process to maintain the same way. At the same time, as I mentioned earlier, Promethean and Sanos have made sure that the entire Apple Store API can be used in different scenarios. At the same time, Envoy also makes sure that we maintain the same way in the跨集群 and the跨云. The advantages of it are very obvious. It's almost the same as just now. But the drawbacks are also there. We might think about the automation and the tools. That might be the topic we're going to discuss next time. Let's summarize a little bit. You can see that we've reached the跨云 stage from the single-集群 to the跨云 stage. If you want to take a closer look at our Sanos, you can stay in this room. Because the engineers in Alibaba will talk to you about how they're going to use Sanos in Alibaba. At the same time, you can visit our unprobable website. We also have a link to the Qubicon CNCF in Barcelona. You can see that it's giving a wonderful speech in Barcelona. If you want to change the current Promethean set, or improve our Sanos, you can contact me. I can teach you how to complete this process. At the end of the day, we're going to recruit people now. If you're interested, you can come with us to create a very powerful跨集群跨云 platform. And not only in China, but also around the world, if you have more influence, you can join our team. So if you have any questions, you can contact me. I guess I don't have enough time to answer today. But we'll be here all day. If you have any questions, you can come to us. Thank you. Thank you. I have a few questions. The first one is, if you use Sanos, how do you set up Sanos from the original design? How do you set up Sanos? We need a Qubicon Capot. We don't need a lot of this. We need an API. Thomas also did some research on the API. So the whole performance of Sanos is getting better and better. So in terms of the deployment, it's mature. The second question is, can this data be preserved? Can it be self-preserved? Maybe in the future, we'll add some new features. But for the time being, we'll add this feature in the future. Maybe we'll add it in the future. And the last question is, do you have a reference to the design of Envoy? Can you give us a reference to Envoy? The whole design is very simple. I can publish a piece of paper, and then publish it on Infobox. If you're interested, you can check it out. It's in between each group. We have Ritz and this group. You can take a look at the design of Envoy. It's actually very simple. Our time is up. If you have any questions, you can contact us. So for anyone who's interested, please scan the QR code and we can touch in the WeChat group. And we'll send the slides in the group chat as well. Okay, thank you.