 The topic of today's presentation is multiple Canary releases and stress test on production. Let's first start with multiple Canary releases. After that, let's move to stress test and production. Let's start with the simplest case. Canary is a technique used to reduce the risk associated with releasing new versions of software. The idea is to first release a new version of the software to a small number of users and then gradually iterate through the upgrade. For example, in this diagram, we test 10% of the traffic first, then gradually move more traffic to the new version, and finally the old version is cleared and taken offline. Throughout the testing process, we can label the traffic with various business tags, such as Android devices, location of Beijing, etc. Also note that user tags should not use IP addresses, which are inaccurate and inconsistent. Then we can specify Canary traffic rules to schedule route a certain part of the user's traffic to a certain Canary. For example, the Android user from Beijing is scheduled to the 2.0 Canary version of service A. The scenario of a single service Canary is still limited. In reality, it is more common to have full stack Canary testing. For example, a user client cannot be forwarded directly through the router to the Canary version of the service. This is because the service is very far back in the whole chain, separated by other services. As shown in the figure, we have published Canary for delivery, with user and order services spaced in between. In this case, we need to do two things to ensure that the traffic is scheduled correctly. The first is to pass through the user tags. And the second is to route traffic to the correct version of the next service at any endpoint of the chain. If you consider the implementation level a little bit here, you will notice that there are two categories of approaches to do full stack Canary release, either by changing the code or by a non-intrusive platform level solution. And changing the code can be very cumbersome and verbose and prone to bugs. Let's use another practical example to illustrate the difficulty of multiple Canary releases. For example, there are now two services order v1.0 and email v2.0. The service order calls the service email, and the service email uses the third party email provider Tencent provider. And we decided to add some information to the order entity as a test version for Android users only, since the changes to the order entity affect both services which need to be changed. We added order v1.1 and email v2.1 to apply this change. Another team decided to replace the Tencent provider with Google provider in the email service. So we added email v2.0.1 to test users from Beijing only. At this time there is a dilemma that Android users and Beijing users are overlapping, which is traffic from Android phones in Beijing, considering the complexity of the traffic and the inconsistency felt by the users. It will lead to a heavy operation burden. Let's take one step back and analyze different cases of Canary release. Let's first concentrate on the figure on the left. For example, a and b rely on z, while it tests Android traffic and best testing iPhone traffic. They test two different user groups. And if they both rely on stotest, stakes on two different Canary traffic and will become a source of confusion. Then there are two better approaches, seen in the figures in the middle and on the right. The first is to schedule traffic from AutoZ. And the second is to schedule traffic from land, toes and z separately. Thereby the principle to make many things easier is one Canary release. One traffic rule. Even if the Canary conflict problem is solved, there is still a problem that the traffic rules may overlap. The previous example is the user traffic of Android and iPhone. But if one Canary tests Android and the other tests Beijing, there will be a common subset of traffic rules for both Canary releases. So if the traffic is from a common subset, such as traffic from Android devices in Beijing, how should we route that traffic at this time? In fact, this ends up being a mathematical abstraction of two set problems. One, set matching user traffic is a set. Canary release of traffic rules is a set. Two, multi matching problem multiple Canary traffic rules are matched on. Which one should be selected? So let's use an example to demonstrate all the problems mentioned before. For example, there is a back-end service stack of a food delivery application. The service consists of three microservices order service, restaurant service and delivery service. Order service has no Canary. Restaurant service has two Canaries. First Canary is for Android traffic from Beijing and second Canary is for all Android traffic. When the system receives traffic with Beijing user tag, it matches the routing rules of delivery, services Beijing Canary and the traffic follows the green path. Then on the other hand, the traffic with Android user tag matches the routing rules of Android Canary for both restaurant and delivery. The traffic follows the blue path. But what happens for traffic that contains both Android and Beijing tags? It matches all three Canaries and there is no unambiguous way route the traffic. This is how it looks in terms of sets. Since in terms of math, the Canary rule is matched if Canary traffic is a subset of user traffic. So how should we handle them all? A simple and easy way to solve this problem is to specify the priority of Canary. Each traffic rule has a number indicating the priority. From small to large. As you can see from the figure, traffic Beijing and Android matches three Canaries. But because restaurant Beijing and Android has the highest priority, priority one. So how should we handle the multiple matching problem? Because restaurant Beijing and Android has the highest priority, priority one, the red Canary is therefore selected. Even though the priority solves the problem of multiple matching, there is still a problem of misused configuration, traffic shadow problem. In this example, the red rule is shadowed by the blue rule. As blue has higher priority, it means no traffic is rooted to Beijing and Android Canary. Finally, let's explain the technical details of Isamesh and give an overview of how we implement multiple Canary releases. First of all, all our services are running in Kubernetes pods. The three services in here correspond to three services in Isamesh. And even different versions under the same service are part of one service. So a mesh service will have multiple versions running at the same time. In order to root the traffic to correct Canaries. Two things need to be accomplished. The first thing to ensure is to pass through user tags. Without losing any user information throughout the service chain. This involves the sidecar and the business application. The sidecar naturally knows all the Canary traffic rules and user tags. Such as some specific HTTP headers. And it will forward them with the traffic. Also the business application itself needs to pass through the user tags. Which can be done by our officially supported JVogent in cooperation with Sidecar. And does not require user awareness. Sidecar will notify the JVogent to pass through all the information. As for other languages such as Golang. Since there is no baticked technology. Only a simple SDK is enough to forward the user tags. So Isamesh also supports multiple languages in this advanced feature. As long as the user tags are available. Second requirement for Isamesh Canary releases is traffic routing. All components. Including ingress controller of Isamesh. And the sidecar in each service pod have the ability to route Canary traffic to the next service corresponding Canary version. You can see that all the service components in this figure. Whether receiving requests or sending requests. Will pass through the sidecar. And when sending requests outbound. The sidecar observes the traffic characteristics. And decides whether the traffic needs to be dispatched to one of Canary versions of the next service. And this is all done by Sidecar. Without the involvement of agent and SDK. Let's take a look at the YAML map. For example, let's take a look at the YAML of RISTER. Inside it is a deployment. And then through this annotation. For example, this annotation is from Isamesh. We can get this deployment through operator. What is the name of the service? What is the name of the service? It is also a specific configuration. OK. This demo is open source. If you want to see the details, you can go to see. Just pay attention to our source. A report of the source. OK, let's take a look at the report. The operation of the pod. Now it should be up. Let's test the primary flow first. The parameters we gave. For example, what is the order ID? The food I ordered is a bread. Let me send a message. It just started because it was a bit slow. Because there is a restaurant service in the middle. At this time, it is back. Actually, the details are almost the same as above. But it took me a long time to deliver this thing. At this time, we want to finish these three steps. After that, the flow was also passed. Let's set up the delivery of Beijing's Canary first. Let's run the pod first. Then you can see the speed of its operation. Then we will leave the flow of the pod. Through our e-smash, we can enter a branch of the same order. In the details, you can see that I first set the priority as 3. As I said before, it is 3. The service it matches is Deliver. Instance Labels is to release DeliverMesh Beijing. The label is actually to distinguish between Deliver blue part and Deliver Beijing part. It is to distinguish between Instance Labels. Next, it will read some flow rules. For example, in our HTTP agreement, if the header you bring is X-Gang location, and if it is accurate, it is Beijing, then I will pass the flow rate. Then we will apply the flow rate rule. Then we will try the flow rate in Beijing to see if there is a new feature for Deliver. You can see that the Deliver Time is more than the previous Deliver Time. It is the time on the road. In fact, Deliver Beijing's flow rate here will return to the road duration. This means that we have succeeded. The third step we want to deploy is the blue Android part. We want to deploy the two blue forces. It is exactly the same. The Android is already running. At this time, we are still the same. We need to match the two services that are the same. Priority is R, and it needs to match the two services. For example, it needs the two services. Labels is to use this Release, this Refund. The new feature that it shows is something that it will send. Then the flow rate rule it has is to bring an X-Phone or an Android. If this is the flow rate, it is the blue flow rate. We use it up. OK. The application is successful. Let's take a look at the flow rate of the Android. I brought the Android's head. Let's see what the result will be. You can see that it is more than the previous one. The QPAN is back to 5 yuan. This is actually the new feature that the two blue forces match. OK. This blue line is also successful. The fourth step is that we will deploy the red one. It's the same. Restorum. It needs to deploy the flow rate of the Android in Beijing. Let's take a look. It should also be successful at this time. Just like the previous one, we will also pass the return rule through the EaseMatch command line. The priority is 1. The match service is Restorum. The release is Restorum-Match-Beijing-Android. This is the real deal. OK. The application is successful. Let's take a look at the flow rate of Beijing and Android. Let's see what the result will be. OK. As you can see, the previous one is RoadDuration. This is the green one. This time, it is a CookDuration. I need to prepare for how much time I need. The red line is actually the red line. The red line is the CookDuration. The red line is the same as the map. It goes to the Canary. The demo of this page is actually here. We want to go all these lines. Like what we just said, there may be traffic shadow problems. For example, we will turn the blue and red priority of these two canaries. For example, we are going to change the rest of Beijing and Android into 2. OK. We are going to change the blue part of the whole Android into 1. The priority of these two canaries has been changed. At this point, let's take a look at all the flow rates. Let's take a look at all the flow rates. For example, the normal flow rate of Canary is not that high. It is normal. It is the same as before. Let's take a look at Beijing. In Beijing, it is still green. It is not affected. For example, it goes back to the red line. This is the green part. Let's take a look at the part of Android. Actually, Android has not changed the last map. It should have gone to the blue part. Although the priority of these two has been increased, as you can see, it goes back to the red line. The red line is fine. The most important thing is that after we changed the priority, the flow rate of Beijing-Chenan is back to the hotel. The hotel needs time. But if we send the request now, it is back to the U-Huei line like the last request. It goes to this map. This route is blocked. Let's take a look at the result. OK. You can see that it is back to U-Huei line. So, it is blocked by the Android canary. This is also what we have to pay attention to in the process of time. The combination is bigger. The combination is more accurate. It is blocked. That's about it. Actually, the whole demo is like this. OK. Let's now summarize the design principles of the platform and the best practices for its operation. Design principles are following 1. One canary service version can belong to at most one canary release. 2. One request only can be scheduled to at most one canary release. 3. The canary release must be explicitly selected by incoming traffic. 4. Normal traffic that does not match canary rules goes through primary deployments. Here's also few best practices. 1. Tagging the traffic must use the user site information. For example, a client IP address is not a good way. 2. When tagged traffic overlaps use explicit priority to guide the traffic router. 3. The smaller scope canary rule has a higher priority. This is all I wanted to show you today about multiple canary releases. Let's now move on to stress test in production. Now is the full stack stress test part. The topic of this part is how to do stress testing in a production environment. Today's production environment has become very, very complex just like the picture on the right. There are many components in it ranging from dozens or hundreds to thousands and these components are developed by different development teams and in different languages which makes the communication between them very complicated. No one can tell the relationship between all of them. The complexity from a technical point of view makes debugging difficult. In addition, the business has also changed a lot. For example, during the Black Friday promotion the traffic pressure on the online shopping systems is dozens or even hundreds of times higher than usual. In order to know in advance whether our system can withstand such a high traffic load we need to perform a full stack stress test to get the real performance figures but also due to the complexity mentioned above. It is very challenging to perform full stack stress testing in today's systems. Now let's look at the problem of traditional stress test methods. The first is to build a test environment identical to the production environment for stress testing. In the era of stand-alone applications this is a very good solution but in the age of the internet there are at least two problems. The first is money. We can count how many servers there are in our production environment and then how much we need to spend to buy these servers and that's just the cost for servers. The cost will be higher when counting other hardware. Most companies should not be able to afford such a test environment even if duplicating the cloud resources for the test environment is not an issue. Is it enough to get reliable results? I think the answer is still no because it is difficult for our test environment to be exactly the same as the production environment. There are several reasons first because it is a test environment. People will continue to deploy test versions to it but forget to restore it after the testing. Over time the test environment will become more and more different from the production environment. The second is that many development teams will share this test environment and if there is not an excellent coordination mechanism the tests conducted by different teams will also affect the test results but the real trouble is the data that is how to ensure that the data used in the test is completely consistent with the production environment. For example, in a Twitter-like system users like me generally only have a few dozen or hundreds of followers so it will be fairly easy to notify all my followers in a second when I post a message but for a celebrity with millions of followers the situation will be very different. Therefore, we cannot simply use simulated data for testing. The second point is the proportion of different users. Users like me may account for 90% and celebrities may only be one in hundreds of thousands. Only by simulating the proportion of users with different degrees of followers can we get a reliable test result. The easiest way is to take the production data to the test system for testing but it also brings the problem of data security. The production data generally contains a lot of sensitive information. The risk of data leakage will increase exponentially if they are brought to the test environment. People turn their eyes to the production environment and try to use the low-traffic period of the production environment for testing. But it's also a huge challenge because it is an intrusive solution that involves modifying or even redefining business logic. Let's take an example. Assuming it is an online shopping system including a user module and an order module to test it. We need to modify these modules. First, we need to add test logic and then we need to add the logic to detect whether we are in a test or not. This looks very simple. Just requires adding some if else but is much more complicated in practice. First of all, what exactly does test mean and for what kind of request we can think of it as a test request? For the user module we might be able to do this by adding a special prefix to the idea of test users or specifying a range of user it's in advance. This should do the trick. The request comes to the order module. We may still want to use the user ID to determine whether the test logic should be taken but the actual situation may be after a series of complex processing. The user ID has been discarded so the order module cannot see it at all. Then, how to write the judgment logic? The second question is how our test logic differs from production logic. It's easier for us to think about accessing different data sets simulating a third-party service such as payments because we don't want to actually spend money on testing but what is really complicated is preparing data for subsequent components. This relates to the first problem. That is because the order module cannot see the user ID. The user module needs to mark the request sent to the order module so that the order module knows this is a test request. However, in a complex system it is not easy for the user module to know all the modules that the subsequent process will go through so we have to spend a lot of effort to ensure the test state is correctly transmitted between modules to avoid disturbing the production logic. Please notice this is just the work required for one function point and there are thousands of function points in a normal system. So, the big question here is how much effort it takes to do all of these modifications and a bigger question is who can guarantee that all the changes are omissions or errors. The production data will be corrupted. How to solve these problems? We believe that the key lies in isolation which is to isolate the production system and the test system from the four dimensions of business, data, traffic and resources to prevent them from affecting each other. Business isolation means that we should not use the form of adding conditional judgments to decide whether to use production logic or testing logic but to distinguish them clearly from the beginning. Data isolation means the same copy of data cannot be accessed both by the production system and the test system. Traffic isolation means that normal requests and test requests can only enter the corresponding system. The resources and resource isolation mainly refer to hardware. For example, the test system and the production system cannot be deployed on the same server so as not to compete for hardware resources such as CPU and memory. This is mainly a hardware issue but Kubernetes has given a very good solution at the software level. Let's take a look at the solutions given by eSmesh. First, because eSmesh is implemented based on Kubernetes it achieves resource isolation with the help of Kubernetes. For business isolation eSmesh can replicate existing services except for adding a shadow mark. The replicated copy is exactly the same as the original one and eSmesh can replace the connection information of various middleware including Miscal, Kafka, RedEase, etc. according to the configuration and thus change the target of data requests thereby realizing data isolation. When creating a service copy eSmesh also automatically creates a canary rule to forward the request with the x-dash mesh-shadow header to the replicated service copy as a test request and forward other requests to the original service to achieve traffic isolation. The above three isolations are implemented by the shadow service feature of eSmesh. It should be noted that canary is also a feature of eSmesh. The canary in the figure only means that shadow service will automatically deploy a canary rule in addition to shadow service. We also need another feature of eSmesh to make a full stack stress test possible. Mock, because we cannot replicate some third party services for testing such as the payment service mentioned above. We need to mock it. Now take a look at what will be demonstrated today. This is a scenario where a user uses a coupon. We can find there are three services in it. The first is coupon service. The second is user service. And the third is verification code service. Which will send a verification code to the user's mobile phone and coupon service. User service has their own database middlewares. The entire system is deployed in Kubernetes. And you should have found that our traffic entry is mesh in Gress. And there is a Java agent and a sidecar with each service in the system. Which means that these services are also subject to the management of eSmesh. The Java agent is mainly to hijack various requests sent by the application. Including both HTTP requests and requests to middlewares. Sidecar is implemented based on eSgress. It is mainly for various processing and traffic. And also for things like service discovery, monitoring and tracing. It is this management of eSmesh that makes it possible for us to hijack various requests sent by applications to achieve the aforementioned business isolation, data isolation and traffic isolation for stress testing. In this system when a user request comes in it will first go to our mesh in Gress then to the coupon service and the coupon service will send a request to the user service to verify the user's identity. And then if it passes to the verification code service send a request to send a verification code to the user. So let's look at the steps we need to take for a stress test. As a first step we need to replicate the two database middleware. We can simply back up the databases and then restore them. And we do not need to do any desensitization processing on the data. Because all our data is still in the same security domain as the original system. Simply backing up and restoring does not increase security risks. After the middlewares are replicated the second step is to replicate services through the shadow service and automatically deploy a canary rule. As we can see the coupon service and user service have now been replicated. And during the process we have also rewritten their connection to the middlewares through the sidecar and java agent allowing them to access the replicated middlewares instead of the production middlewares. This rewritten can be done through the configuration of the shadow service or through the confine map of Kubernetes. For the test traffic we will add an x-mesh-shadow header to it. Any request with this header goes to the replicated services according to the canary rules we just deployed. Following the orange lines and the normal user request still go to the production services. That is follow the blue lines Now we have the coupon service and user service replicated. But haven't the verification code service because it will eventually call a third party service to send the verification code to the user's mobile phone. Although the cost of each verification code message is not very high. If we send a lot of requests in the test. It is also a big cost. Therefore, we hope not to send the verification code. This requires the mock feature we mentioned just now to mock the verification code instead of replicating it directly. Generally speaking, we need to mock services like payment because their implementation is complex involving various verifications and encryptions which make them difficult to mock. Therefore we need to make a service in our system to wrap these third party services because these wrapper services are inside our system. We can make the interface simpler by saving a lot of security verifications. So what we actually mock during testing is the wrapper services. Not the real third party services. Now let's start the demonstration. I prepared two scripts for today's demonstration. One on the left and one on the right side of my screen. With a shadow suffix after the phillenum on the right side. Now I will run these two scripts. We can see that the output on both sides is exactly the same. In a while I will also show the topography generated by our mega cloud system. From the graph. We can also see that the processing process of the two requests is exactly the same. But because mega cloud requires a little time to sync data. Let's take a look at the content of these two scripts first. We can see these two scripts are exactly the same. Except that the right side carries the x-mesh-shadow header when sending each request. These two scripts execute the get token at the beginning because the demo system requires a user to log in first. After getting the token. They start sending the get coupon request. We will also take a look at the Kubernetes to check the pods. We can see four services from the pods information. We will focus on three of them. The coupon user and verification code services. Let's execute the ease mesh control command of ease mesh again to take a look at the shadow service in the system. We can see that no resource is returned. That is, we have not deployed any shadow service yet. Now the data synchronization of mega cloud should have been completed. Let me refresh the page. As you can see from this picture, although the request with and without shadow have both been sent just now. We can only see one execution path in this picture. That is, coupon service calls user service and calls verification code service at the same time. The coupon service and user service will also access the two middleware, Miscal and Shadow services. Now I will deploy the shadow services. Please note, in the slides, we say replicating the middleware is the first step. But for this demonstration, I prepared the middleware replicas in advance and in order to show the difference, I revised the replicated data. But in practice, we can just replicate the production data directly without any modification. Now let's create the shadow service. Just run the M control apply command. We can see it says that both the coupon shadow service and the user shadow service have been created successfully. Now run the cube control command again. We can see that there are two more pods in the system, namely coupon shadow and user shadow. And if we run the M control get shadow service command again, we can also see that there are two more shadow services in the system. However, although we see that both pods are already in the running state, it still takes a little time for our application to start. About a minute to two. So let's take this time to see the content of the YAML file we just used to create the shadow services. As we can see, there are two shadow services. The first one is named coupon shadow service. And the second one is user shadow service. Witcher shadow copies of coupon service and user service respectively. And as mentioned before, our service supports rewriting the configuration of the middleware directly. We can also see this from this YAML. In the spec of each shadow service, we have rewritten the connection information for miscalin reddies. In this way, we replace the middleware access by these two shadow services. It should be ready now. Let's execute the command and check the result. Since it is a Java application, the first execution takes slightly more seconds. Okay, now the result is out. For a better comparison, I will clear the screen and then run the commands again. As you can see, the difference is that the coupon name field has changed from Chinese to English. This is the result of modifying the database connection. The data in the database is different, indicating that they are accessing different databases. Let's take a look at the topology of the system. Now let me refresh the page. We can see that there are some gray nodes in the system, which are the replicas of the original service and middleware, including coupon service, user service, miscalin and reddies. And the middlewares being accessed by the two replicated services are also the replicated ones. The only problem now is that these two coupon services, the original coupon and the replica, both access the same verification code service, because we haven't marked the verification code service yet. Let's do it now. The M control apply command again. This marks the verification code service. Now let's execute the command with shadow again. You can see that the verification code becomes ABCD. And when executing the command without shadow, the verification code is still 123456. Let's take a look at the content of the YAML file. We can see that the request path is first matched. And then, request with header x-mesh-shadow will be matched. After a complete match, it directly returns the HTTP status code 200, and verification code ABCD. Well, now that all our preparations are complete, let's actually conduct a stress test. Because it is a demo environment. So don't expect particularly high performance. Let's change this test script and replace the last get coupon command with an AB command. Let's use 10 concurrent connections and send 2000 requests to see what the performance of this demo system looks like. A little bit slow. Maybe I should send fewer requests. Finally, we get the result. Request per second is 125. This is a system that needs to be optimized for performance. Now let's check the execution path through the topology graph. Since our topology graph aggregates data based on time, I need to adjust the time range a bit to only use data after we apply the mock. As we can see now, the line from the replicated coupon service to the verification code service is gone. Indicating that there is no calling between them now. That's all for our demo today. Back to slides. What advantages does our shadow service have over traditional testing methods? I think there are five points. First, error code changes. Everything is done through configuration. No code modification is required and no new bugs. Second, low cost in the case of using a cloud server. The hardware resources used for testing can be applied before the test and released after and we only need to pay for the actual usage period. Third, clean environment except for a few services that are mocked. The test system is completely consistent with the production system which avoids errors caused by differences in business logic to the greatest extent. Fourth, true data. The data of the test system and the production system are completely consistent, which ensures the reliability of the test results. Fifth, secure. Although the production data is used in the test, the test system and the production system are in the same security domain. So there is no increased risk of data leakage. That's all for today's sharing. Welcome to follow our open source project on Github. And also welcome to join our open source community. Thanks.