 Hello, everyone. Thanks for joining. My name is Animesh Singh. And this is my colleague Tommy Lee. Together, we work for IBM Cloud and Containers Developer Technology Group. And today, we are here to talk about how you can build resilient and fault-tolerant microservices leveraging Istio. Now, before I go into some of the details, how many people are actually using service measure Istio here? Very few. I didn't see any hands raised, in fact. So I think it will make sense to actually go back and talk a bit about the motivation of the project, except for what led to it. But essentially, in 2014, there was a reactive manifesto which came out. More than 20,000 people have signed it. If you haven't read it, I would highly encourage you to go and read it. Very nice read. But the key theme there was that today's application demands can simply be not met by yesterday's application architectures. Essentially, today, what we want are systems which are responsive, resilient, elastic, and message-driven. Now, if we focus on the resiliency piece of this, what it says is they are significantly more tolerant of failure. And when failure does occur, they actually meet it with elegance rather than disaster. That's essentially one of the true characteristics for a cloud-native application. Typically, during the same time, microservices revolution was happening. It started somewhere around that same time frame. And essentially, as we are all aware, you take a monolithic application, break it down into single functions, which are typically around business capabilities. And you have small teams which are actually developing, maintaining, and deploying those microservices. Now, typically, microservices are encapsulated inside containers. There is a one-to-one relationship most of the times. Everyone's container journey starts with one container, and it's easy in the beginning. But soon, the growth becomes overwhelming. We need container and microservices management. Enter container orchestrator. Now, if you look at a survey which was done last year where we asked respondents, what are they expecting most from a container orchestrator scheduling, cluster management, and discovery where the top three functionalities are identified? Now, if you look at the container stack and all the layers, definitely popular container orchestrators like Kubernetes, Swarm, Mizo, they actually set a layer five of this. We are all aware of the Kubernetes architecture. Any interaction we do with Kubernetes through our UI, CLI, API goes to Kubernetes Master. And then we have slave nodes, which are essentially responsible for running your workloads. They actually run the QBlade, QProxy, Docker Demals. And they are then serving your workloads on top of those slave machines. Within IBM, we have our own IBM Cloud Container Service, which is based on Kubernetes. It's a managed service, so we manage the Kubernetes master. It's single tenant. The slave nodes are actually managed by the customer. We do provide triggers for OS updates, patches regarding Hypervisor, Docker Demon, et cetera. And customers can actually scale up and down based on their need. We actually use the logging and monitoring data from the slave nodes to provide the management. But we don't have any access to the persistent data which is residing on customers cluster. What we did was we actually also created a lot of developer patents to show how Kubernetes and microservices work great together. So they are essentially on-ramps to technology. And they actually help developers be productive with cloud, data, and AI. You can actually find a lot of our developer patents on developer.ibm.com, slash code, slash patents. And anything you need with this getting started with the technology, be it the architecture diagram, step-by-step instructions, the actual code in the GitHub repo, it's all there. So some of the microservices pattern we created for all the Java and Spring Boot fans out here. How can you actually run Spring Boot microservices on top of a Kubernetes cluster? In addition, in this particular pattern, we also show how can you use functions using the very popular open-source open-visc, which is part of the Apache Foundation. There is another Java EE framework, which is becoming quite popular, which is MicroProfile. It's being backed by Red Hat, IBM, Fujitsu, and others. And we'll talk in big detail about this particular MicroProfile microservices framework. So there is a developer patent to actually use that. And if you're in a polyglot land where you actually are having microservices written in multiple languages, there is a patent to get you started there as well. So what we see is Kubernetes is great for microservices. Do we need anything else? And that's a question a lot of people were asking. Kubernetes and microservices already handle a lot of your use cases. Now, what we saw on that slide was the top three functionalities which a container orchestrator gives you, scheduling cluster management and discovery. But is that enough? We need to build fully reactive and resilient microservices. And for that, we need mechanisms to actually avoid fault. So that means when you are doing candidate testing, A-B testing, rolling out new versions, you need to be able to selectively route traffic. You need fault isolation. So that you have ways to create circuit breakers. You are enforcing bulkheads, et cetera. Detect faults when they do happen. For that, you need great metrics capability. And then if failure does happen, you are able to recover from it in a graceful manner. In short, what we need is strong visibility, fault tolerance, traffic control, and a way to enforce security and policies. Enter service mesh. So what essentially is a service mesh? A way to think about it is think of it as a network of devices. So essentially you have a network in your data center which is connecting two machines. In this case, instead of connecting two machines, you're connecting a couple of microservices together. And all the tasks which typically the routers and the switches would handle, that's all offloaded to the service mesh. Now how do you actually build a service mesh? Enter sidecars. So sidecars actually are gatekeepers which actually sit within your pod along with your microservice. And they are responsible for intercepting every incoming and outgoing traffic. And because they are able to do that, they also provide you rich routing, load balancing, plus they're collecting a lot of data which they can pass to the metrics systems. Istio essentially is an implementation of the service mesh launched earlier in the year. IBM, Google, and Lyft came together. And if you look at the architecture of Istio, essentially three key components are there. Pilot, which is actually responsible for configuring Istio deployments and ensuring that all these configuration are propagated to all the components in the system. All the routing and resiliency rules which we create, they actually go in Istio Pilot. Mixer, which is responsible for policy decisions like ACLs, rate limiting, authorization, et cetera. Mixer is also actually responsible for giving you the dashboards which give you the great metrics capability. And then finally proxy, which is essentially the sidecar which is the heart of the service mesh architecture that is based on Envoy and it mediates all the inbound and outgoing traffic and it's responsible for enforcing all the policy decisions, all the routing decisions, load balancing decisions. When your application does get deployed, this is how it looks. You have Istio control plane with all these components and then finally your Istio data plane which is essentially hosting your application. Taking a bit more into the architecture, as I said, all the traffic entering and leaving is being intercepted by Envoy. Envoy essentially is a layer seven proxy which was developed by Lyft, very high-performance proxy, able to actually handle up to five million requests per second. That's essentially the heart of this. You then have ingress proxy which is the gateway to your application. You can actually use Envoy for that as well. A lot of people actually use their Kubernetes ingress controllers. And then NGNX also announced they have added support now so you can use NGNX instead of Envoy. And all these different protocols, GRPC, HTTP, HTTP2, HTTP1, they are all supported by Istio currently. So how do we make microservices resilient with Istio? So let's see the capabilities which Istio is providing. Traffic control and visibility, let's go into them a bit. So for example, when you are rolling out a new version of your microservice, you don't want to redirect all the traffic to that. You want to selectively control, in this case, a rule like, just send 1% of the traffic to this new version will actually ensure that if there is some fault in your new version, you can actually roll it back in a timely manner. You can steer the traffic based on the content based on whether the user is using an iPhone or an Android or a particular browser or the request is coming from a particular geographic location. You can steer it to a particular version of your microservice. Then I talked about visibility which is essentially the key. How do you actually look what's going into your service mesh? So Istio comes by default with Grafana, with Prometheus backend, as well as Zipkin, which actually gives you consistent metrics across the whole fleet of microservices which you have deployed. Essentially, since Envoy is intercepting every incoming and outgoing traffic, it's also able to store that data and be able to transfer that data to Mixer, which is based on a plugin-based architecture and Prometheus and Grafana. They're actually plugins for this and you can actually bring your own dashboard into this mix. And definitely Zipkin, which gives you the request tracing capabilities and which is very important. You are able to look at the source of the request. You are able to look into the request headers. You can see which services are lagging, which are slow. All that information you can get from the Zipkin dashboard. So around the same time Istio was launched, we launched a developer pattern. How can you actually manage micro-to-services traffic using Istio on Kubernetes? So we took the sample book-in for application, but we also, what we did was we added a relational database component to it. When the sample application came out, it was a static application. Everything was being written to the local file system. So what it allowed us to do was, A, make the application dynamic. B, we can also show that for egress traffic, where a certain protocol is not supported by Istio, for example, in this case, you are using JDBC to connect to a relational database. How does Istio handle it? This is essentially what's happening in that particular pattern. So essentially what we are showing is that how you can selectively route traffic to different versions of the microservice. Now, if you probably would have seen book-in for many times, but if you don't, book-in for actually gives you a product page, which gives you the details of the book. Then there are reviews microservice. There are three versions of it, which essentially are reviews of the book. And then finally you have ratings, which are like how many stars this is getting. So in this case, we are showing that, we are splitting the traffic 50-50 across the version two and three and nothing is going to be one. Similarly, if a particular user is coming, we can actually select where to redirect him. The other thing which we can do is that we can limit the access to a particular destination microservice. So in this case, when we see, we probably want to say that reviews version three doesn't and cannot talk to ratings microservice. So you can actually make decisions based on that. So now let's look into fault tolerance. And if you still are questioning why we need fault tolerance because things will go wrong, how many of you were on the rainy street yesterday night? I was like one of these and I was shivering there, right? So can't predict things, things will go wrong. And essentially before going into this, let me talk about some of the definitions, which we will use and I want to make sure, everybody's aware of some of these terms, which are being used in the resiliency features. So circuit breaker and I'm going to read it, we'll be at him because this is a great definition. So the basic idea behind a circuit breaker is very simple. You essentially wrap your function call inside a circuit breaker object. And what it does is once the failure reaches a certain threshold, the circuit breaker trips so that any other further call to the circuit breaker returns with the error without overloading your protected method. So that's the idea behind circuit breaker. Then we have bulkhead, right? So that's a pattern which is talked a lot about in the microservices use case. Now, essentially, if you see in the industrial world, bulkhead was used in ships or aircrafts actually to partition them into sections. So if a hole does happen in the hull of the ship, in this case, the water will only fill one component or one compartment and the ship wouldn't sink. So that's the idea which is also carried forward into the microservices patterns. Now let's test the resiliency of the sample application, right? So within IBM we actually have developed a tool, IBM Research essentially is developing it's in very early stage, but that actually gives you a great visualization whether your application is resilient or not. So it gives you a control panel to inject false. You can actually inject false in your sample application and then visualize the traffic flow using the view panel. In the first case, what we see is we actually abort the ratings microservice which was the last microservice. And in this case, the system is behaving as if the ratings microservice is not available. What we see is reviews microservice is written correctly to come back and just show the reviews if the ratings is not available. But if you actually add some delay, when your ratings microservice is delayed, there is timeout happening. In that case, the application crashes. It is not able to handle and know the reviews microservice just shows that it's not even able to process and show you the reviews, which is the wrong behavior. So this tool actually allows you to test fault tolerance and it's actually pulling the data from Prometheus and Zipkin and giving you those visualizations and you can actually click on the details and be able to see what are the response times where it is lagging, et cetera. With that, Tommy would give a very quick demo of the sample book info application and this tool. As I said, the tool is in very early stage. So if it doesn't work, we do have a recording. We have a fault tolerant. And then I'll come further and talk about how do we actually handle resiliency in application frameworks themselves and then how does this do handle it. With that, Tommy. Hi, I'm Tommy. I'm going to show you the book info samples and how to use that tools. So for the purpose of this demo, I already have Istio and book info deployed on my cluster. So as you can see, this is a book info page and you have the details microservice showing the details and review microservice showing the reviews. And also the rating microservice will help you help some of the review, some version of the review microservice to show the review stars. As you can see, this free version of the review. This is the version one of the reviews. This version three of the reviews. This version two of the reviews. And right now I'm going to apply Istio traffic routing rules to route all the traffic to version two of the reviews. So I will run a Istio control command. So right now all the traffic route to the version two of the reviews microservice. As you can see, no matter how many time I refresh, you'll always see the review with the black stars. So right now I'm going to show you the tools. This tool is called Istio analytics. It's developed IBM internally. What it does is it could apply false injection into the Istio pilots and also help you visualize some of the traffic using the login data from Permanenteus. So let me apply some traffic to our point for page. Or trying to apply some traffic to the sample look-in for page support. But it does take some time on the traffic walls and it picks up the rules to actually show you the visualization. Right. What it shows right now is all the login data for the past 30 seconds from Permanenteus. So you can see all the ingress traffic is going to the product page. And product page is calling the details and review microservice to get the details and reviews. While we will also call the rating to get the rating stars. This is the business as usual, right? Right. So right now I'm going to abort all the traffic to the rating microservice. So what Tommy is doing here is he is essentially aborting any request which is going to the ratings microservice. Hopefully if it is working fine if the request aren't able to go through. We're able to see that the application is written correctly to handle like a particular destination microservice is not a good idea. So the tool does take some time to apply the rules. So give it a second. So as you can see like for the past 30 seconds the traffic to the rating is decreasing because we abort all the connections. And we refresh it. As we can see the review microservice is good at handling 500 status code from the rating service. So it's still working fine. It's healthy. However, if you apply delays on the rating microservice. What if some of us essentially, you know we have ordered every request to ratings and still reviews is working fine. So the application is written correctly to handle that. Now let's see if we delay the ratings. What will happen to that? Now I'm going to apply a 20 second delay to 100% of the connection to ratings. So it does take some time. Adding a little 20 seconds of delay time so it takes about 20, 30 seconds to 40 seconds for the tool to actually start showing you the results. But in this case, hopefully what we will see is that the application is not written to handle that. So what you should have expected is the reviews are still being shown even if the user ratings the stars are not available. As you can see, the red bars are the error message from review to product page. And if you go on the product page, you can see that page is loading because we are having delays, putting delays on the rating microservice and review doesn't know how to handle that. That's why we turn the file for status code back to the product page. And that's the end of the demo. We give it back to Anmesh. Thanks. We just showed you how using this tool we can actually use Resiliency. We can actually test the Resiliency of our application and what's happening, as well as the traffic jam. So now some of you may ask, aren't Resiliency features already available in application libraries and frameworks itself? For example, if anybody who is using Micro Profile Java E framework, some of these capabilities should come out of the box. And you don't know about Micro Profile as I mentioned earlier that it's a Java E framework initially based on JAX, RS, TDI, JSON, P, IBM, Fugitive, Zivu, Red Hat, how we try to figure out the amount behind it. And there is a strong focus within Micro Profile around forward current. So let's see how do you handle it without this tool from just a micro service framework itself, right? So typically what you can do within Micro Profile is you can actually add all these annotations to your Java classes or Java beans and it actually helps you to add things like time off, you can try it all back, okay? For example, you can just add those annotations and to these and it will start giving you those categories. Now if you didn't have this, typically you would perform like this, right? There is an issue connecting to a service, try it five times. Now because of this annotation, you can actually add it directly in the code itself of an annotation to handle if something is not responding or you actually need to try it when you want to reach it. Similarly you can actually add annotations for time off. So you have here saying that there is a delay that time off of these many seconds. Well, okay, which is essentially you don't want to overload your service in this particular case. So what you're saying is that only five requests will be allowed and the rest of the requests can go in the rating issue. It will keep your request in the rating issue. From that perspective, a lot of these capabilities actually come from the framework itself, right? Essentially circuit breaker as well. You can actually essentially give an annotation for circuit breaker inclusive. There is also another annotation which is very useful in the micro profile case which is fallback. So yes, things go wrong, things will go wrong, but what do we do if something goes wrong? So instead of going X, you can go Y, so fallback is an annotation which you can add and that fallback you can define. Okay, this particular thing is not reachable or it's trying to solve its delay. Let's do this. So what happens if we slide this TO into this mix? The application libraries themselves are giving you these what are its features. What we will do, right? So micro profile gives you a way to actually disable all the annotation based approach. So even if you have an annotation within your code, you can actually set a variable and by that you will be able to disable everything else and this TO will take over in this particular case. Except for fallback, fallback is something which is more native to the application what you want to do if something goes wrong. Now, some of you might have question, right? What if both are enabled, right? So if your application is saying three tryouts and you still are saying five tryouts, how many tryouts will happen? That gives, right? So in this case, it will be 15, right? But then our policy is that tryouts after 10 seconds of the application these two are saying tryout over 20 seconds. What will happen? So the more restrictive will be taken first. So, okay, the reason here we still need these two in these cases is yes, frameworks which are mature which are sophisticated who come with these application libraries but what if we have a colleague-lawed application? Which you want to rewrite all these foreign features if you have micro services written in Python, O, PhD, that would be very cost intensive. This actually allows you to do it in a generic manner. So you can actually add forth on this to your application without any tools to force. So you can create simple things like timeouts, three tries with time on budget, you can create second labels, you can control the connection pool size and it plus flow, you can inject force. So for example, in this case, right? We can create a policy to show that there are X plus Y connections. What we are saying is that only X maximum connections are allowed to Y are connected and anything beyond that should be important. So you can create policies regarding this. Actually, there will be a balancing injection pool. So essentially, if a particular port is not responding, you can create a policy to inject it out of the load balancing pool so that any subsequent request are not being sent to that particular port. And then definitely you can add timeouts and retries within this particular application. So, quite important is to give you a lot of HTTP, GRBC, and of course, the laws you can inject force. In fact, the SDO analytics tool which we showed from my research, that's actually using the coordinated feature of SDO to inject force and show what's happening and the visualization what it's showing. And then we also created a pattern which actually goes into the details of it. How do you actually do that? So you can actually get that pattern there and Tommy is going to come and show you that pattern. So that essentially allows you to go step by step and show that how you can actually use the load balancing detection pool. All these features from this tool you can actually use it. This is a micro profile sample application. This is not the sample bokeh point application. So that, Tommy, we'll come and show you. So you go to our IBM pattern page to see these examples. It's using Java logo service. So what this examples is, it has five major Java micro services called speaker schedule web applications, sections and all these micro servers will communicate directly to the SDO ingress. And for the micro servers, they also use the cloud and database image to store it's building data and all of them are running on SDO. And here's the GitHub page of this pattern. And what we're going to show right now is the circus breaker feature on SDO. And we will use it to simulate a load balancing pool ejections. And I also have this example running on another cluster. So once you deploy it, you're able to see like a page similar like this. This is a sample web conference app where you have speakers, you have your schedules, you can read their talks, et cetera. Right, so you can vote anything and you can go to the vote page and see all the voting data. This data is stored in a cloud and database image. So right now everything is broken properly. But however, if you want to roll out a new version of the cloud and image, if that's broken, what happens? So right now let me inject another version of the cloud and image. As you can see the cloud and dp second is our new version of the image. However, it's wrong open property. So it would keep having five hundred status code back to the work microservice as you can see. It's not working properly. So it's creating error to this microservice. So what I'm going to do is going to apply apply this circuit breaker rules. What it does is when it detects server error from any of the parts, it ejects from the load balancing pool. So it will keep all the traffic only passing through the healthy parts. So let me apply a E-steal control rule. So right now you try to keep booking. So the first time you see the vote is creating error. But after this error, the circuit breaker rule is designed to eject that broken image out of this load balancing pool. So from now on every vote you vote is going to be healthy returning 200 status code. And that's the end of this demo. I will give it back to Animesh. Thanks for everything. What it just showed was, you know, that using a simple E-steal policy would enable ejected servers, which is not performing out of the load balancing pool and all of a sudden all your errors. So a lot of these are patients that get a positive image was there, that particular case. So if you do want to try it out, a lot of our development patterns around E-steal from micro-services are actually at that particular link. The slides are at this particular link. So I do want to end it by saying that next film, Reactive Micro-Services, which are responsive, resilient, elastic, message-driven. Let's use this team. Thanks again. Thank you. Thank you.