 Thanks for coming to my talk, I am Matt Turner from London in the UK, I guess some of you are using the translation so I'll try to go a little slow. I'm going to be talking about Istio, I'm going to be talking about running resilient workloads using Istio, so this is looking at a higher level at applications and how you can use Istio to solve some of the problems you get with distributed applications distributed microservices. I am CTO of a managed service provider called Native Wave, we're based a long way away in London but we offer managed in a cloud native IT platforms so if you want to know any more about that then come and find me. Just by way of a quiz, who is using Kubernetes? Okay most people that at KubeCon, who is using Istio in their production system? Okay and everybody else has looked at Istio, interested in Istio? Okay cool well hopefully this is some useful information for you. So just a quick, I can't see the slides from there, a quick introduction about you know why why we're looking at this? Think about what your business wants from the software you write, think about the value you're delivering to users. What is the value to the users and therefore the value to the business of what you're building? Well yes you want your software to be good, you want it to be high quality, you want it to sell for a lot of money, that is your top line right, you want lots of income, you want it to run as cheaply as possible, that's the bottom line so you want to keep your expenses down, Kubernetes helps you with your bin packing, maximizing utilization of your cloud resources but also what is important to a business is the time to market, the speed of experimentation and iteration. So how fast you can develop a new feature, get it out to a user, test it, find out whether it's what the user wants and then use that to influence your next sprint or whatever right and then get useful feedback from your actual users on your product and this reduces the risk of building software, this reduces the risk of writing code for two years in isolation and then releasing it and finding out that it's not what anybody wanted and you can't sell it. So this is really important and there's been a lot of work in software in the last 20 years maybe to get to that point so we had Scrum which says don't plan more than two weeks in advance because you don't have the evidence and we also changed software engineering so we took the monolithic applications that we had and we broke them apart right so this is a rock that is cracked for anybody who doesn't speak English or like Latin, monolith means single rock, big rock so we took them, this is where the picture, it's a bad joke, we took the big rock and we made it into lots of little rocks so we have microservices and that means that we can develop those in parallel, the code bases are smaller, they are easier to edit and we can release the software separately right so if I want to make change to one part of my application i.e. make a change to that microservice and I can release it and I don't have to wait for everybody else so I get my feedback really fast and that's why we decided to write microservices, one of the reasons but there is a joke you know now that we've broken the monolith how do we know that we have don't just have a distributed monolith so the joke says you know microservices no no no what we've got is a completely different pattern we have a distributed monolith so I've taken all the code that I had before you know I've separated my concerns I'm a good software engineer I've got dependency injection and I've separated my code into different Java namespaces behind nice strong interfaces and they call each other right we knew how to write software in the 80s it wasn't nothing is new but now with microservices I split that I put them all in a different pod and I run the one Kubernetes so we have the same code and now it's spread out the problem is what used to be a function call between namespace A and namespace C that could never fail is now a call over a network so there's a whole load of ways it could go wrong the network could be on fire the computer that pod C is running on could be on fire the data center could be on fire so we have all of the same problems as before all of the bugs in this code and there are always bugs and a whole load of new failure modes which we didn't have before so if my microservices look like this you know one front end calls some backends that all depend on each other then if one goes wrong I can start to get these cascading failures if I'm not careful this is very complicated this is an emergent property of this system right so one one service fails nothing else can reach it some of those don't handle that error and they fail some of them block while they're you know waiting and they they sit there forever and I get all these kind of cascading failure modes that I didn't have before so this is a distributed monolith this is a monolith where I've just thrown it into different pods scaled it up but not really thought about what that means for the quality and of my system so we have a way of solving this and you know most of you know what Istio is so I won't spend long on that but one of the ways we can solve this is with a service mesh so the service mesh comes along and puts a proxy I will be demonstrating Istio and Istio uses the Envoy proxy this is the Envoy logo but this applies to to all service meshes we put a proxy alongside each one of these applications and that handles the network traffic for us so you know a call from pod A to pod B now if if pod B is down then the Envoy next to pod A can say okay I will find another instance of pod B and I will reach I have buffered your request and I will retry it and I will I will get you the answer as soon as I get a correct one and if not I will throw a circuit breaker and I will return you the last answer that we got or a default answer or something like that so service A can have the business logic it had before and it doesn't have to worry about the new network environment that it is in because that is complicated and it can fail so to make a service mesh useful we need a way to configure all of these proxies you know they need to do different things we will see this but maybe when service A calls service C you know that is a database transaction and it's not safe to retry that you know if you're updated if service C is your actual bank account the ledger then if service A you need to make sure that service A doesn't try to call it more than once because you don't know whether the failure that you got happened before or after the actual bank account records were updated right whereas if you're just catching a catgif if you got 99% of the way through and it failed then you can just try again and hopefully next time you'll get a hundred percent there's no side effect to that that operation is called idempotent so each of these proxies and you might want a different timeout for different routes you might want all kinds of different configuration so each of these proxies gives you all of these features but they need to be the point of this talk is that they all need to be told what to do and that depends on your application so a service mesh comes with a control plane Istio has a control plane and a CLI Istio CTL which you use to talk to it and the control plane you you give the control plane a simple definition quite simple definition of how you want the service mesh to act and then the control plane will configure all of those proxies for you this is another bad joke I would say it is not all planes sailing where you if you think everybody knows about Kubernetes here if you think about how you use Kubernetes when you first start you just make it a plant right and a service two resources and one says here is the workloads I would like you to run here is the code to execute and then the service says and here is how you expose it over the network so all of these pods are listening on port 80 and I would like you to tell AWS to make a load balancer so that I can get to all of them right simple Kubernetes 101 it's how we start using it and it and it works but if you want to make that application resilient if you want to use Kubernetes in production then you're going to need ingress right because your product owner probably wants several services on a path based route behind the same IP you want a horizontal pod autoscaler that gives you resilience against having lots of users having lots of users is a good thing but if you only have one pod and lots of users that pod will crash so you need to be resilient against that you need to tell Kubernetes how to scale your application you need to set requested limits you need to set readiness and liveness probes you also probably need some config maps and some secrets and some upgrade strategies the translators could ignore this I have a jai affinity anti-affinity pod disruption budget pub security policy network policy pod preset service account in our back limit range cluster autoscaler there's probably more my point apologies to the translator my point is that it's not as simple as just telling Kubernetes to run your workload you need to to configure all of these extra things to get the most out of it so Kubernetes can do some stuff out of the box it will have a get if your application actually crashes if it just seg faults then Kubernetes nodes that's bad and it will restart it but there's lots of other failure modes like it just you know it's still listening on the port but it just never replies because maybe it's deadlocked Kubernetes doesn't know how to to detect that in the general case so you have to write a health check you know a liveness probe that hits the HTTP port and tries to get a response and only by doing that can you tell Kubernetes that you know this this service is a network demon that is meant to be replying to HTTP requests and if it doesn't do that then it's broken so configuring Kubernetes is is about describing your application to Kubernetes saying hey we have this Unix process it's a black box to you but I'm going to tell you that it can only do 1000 requests per second and when it starts doing that you know when it gets to 1001 you need to make a new instance it's listening on TCP so if you send a valid TCP request and you do not get a valid TCP response or you get a HTTP response that says 500 in it then it's broken so it's about a lot a lot of the Kubernetes stuff is about describing your application above and beyond that lowest common denominator of it's a Unix process and it should not crash so as I say we use we can use Istio when we have one of these distributed systems and we need to help out with some of these new distributed systems failure cases so the the tagline says that Istio is an open platform to connect to secure to control and to observe microservices and all of these words are actually quite important and I'll I'll go through them in this talk but basically what I what this talk is about is that you need to do the same with Istio you need to describe your application to it you need to tell it what to do to you to help you the most so that your application can be properly resilient I'm not going to set my demo up I have a record I can't get my VPN to work I have a recording so we'll skip that I just I just can't so so the observability part this is going to be really awkward because I can't see the screen so yeah so Istio can do all of this stuff and I guess by default it will connect your applications if you install Istio and then you deploy your application the sidecar is there and it intercepts all of the traffic and it moves the traffic from one place to another so pod A to pod B involves the traffic going through two copies of omoi and like Kubernetes it does lowest common denominator stuff if you're trying to talk to service B then it's going to pick the traffic up and it's going to move it to service B of course it is but it won't retry it by default because in the general case that's not the right thing to do in the general case that's not safe because it might not be an ident potent operation you also can't necessarily you might think you want to secure all the traffic between pods but in the general case you actually can't do that because it interacts weirdly with some other systems and what does control mean you know if if one pod breaks do I want to route traffic around it do I want to return a default do I want to just sit there and try again all of these things need to be described Istio so I'm going to take you through some of the Istio configuration that you should use to get the most out of Istio and to get the most benefits to make your application reliable so the first is observability observability is is three main parts it's metrics about the traffic it's a service graph of the traffic which pod talks to which you know when you add a monolith you can just attach a debugger you know if there's a problem in namespace a you attach a debugger and you step through and you find out that it's calling some code in namespace C and that's where the underlying core problem is the root cause with a microservice you can't do that service a is returning 500s but you don't know why and it may be because it's calling service C but you don't even know it's doing that so the service graph will tell you that and then distributed tracing I'm sure everybody knows Jager's it can light step stuff that's another very useful thing the green tick is me saying that I think you should do these things all the time right these are safe and great and you should always do them so I have a demo I hope you can read it because I can't make the terminal any bigger now okay so I have some scripts to do the demo because I don't want to type I already have a GKE cluster to show you this in because they take a long time to spin up I recorded this over the VPN in the hotel so the latency is it gives me time to talk it was deliberate I promise so we have a one load Kubernetes cluster okay and I have downloaded the latest Istio 1.02 and I have already downloaded Helm you need Helm to install Istio so okay this is this is me installing Istio into the cluster it's normally quite fast but the latency the latency was bad what's interesting is that this this installs Istio so Istio is some pods right Istio is some workloads the Istio control plane it is code that runs but also when you install Istio it installs a lot of configuration so all of this is is the default configuration for Istio it's not baked into the code it's supplied in Kubernetes config maps you know like I said at the start that is a something you should do with Kubernetes that is a best practice is to have your configuration managed separately and provided through config maps and that's exactly what Istio does so this is kind of what Istio does out of the box you can go and read all of these files they're actually quite interesting so under the box it does a lot of stuff for you and observability is it's one of those things which I'll show you it doesn't do everything though I think my cluster was in San Francisco as well so this is pretty slow fast forward right it's nearly there it's nearly there if I was hacking it would look really good but just standing here you see Istio knows how to use Kubernetes Istio is installing horizontal pod autoscalers to describe Istio control plane to Kubernetes so Kubernetes know how sad to keep Istio running so that Istio can keep your application running must be done it's got a lot of features it's alphabetical order right T no it's not there we go there we go right so Istio is installed I'd forgotten I'd recorded that now going to deploy the book info application it's a sample microservices application if you've looked at Istio or the documentation you've probably seen it I didn't write my own because I think that actually does confuse things I'm not lazy I think actually if it's book info you know what book info is so you don't have to think about book info I hope so we deployed book info I think the lag was really bad here and we'll start to see the the book info pods coming up super bad I should fast forward show me some pods there's something I want to take right so you can see with all the pods coming up this is standard Kubernetes YAML file it only describes it only talks about one container in the pod but the pod can have more than one container and we know that what Istio does is put that sidecar next to our application so this here is 0 of 2 means that we're still waiting for this to start but we're waiting for two containers to start and one is the application which is in the pod spec in the pod.yaml and the second one is the on-voice sidecar that we as a user do not tell Kubernetes about when we describe our application to Kubernetes we don't describe the sidecar because it's not part of the application Istio injects that Istio does that automatically so as I say out of the box by default all traffic goes through that sidecar because the default Istio configuration will inject it is that all of them product page go okay so yeah six pods with our application all with a sidecar this group just finds the public IP address of the of the ingress point it's quite boring but it's necessary and then we can go to that URL so this is book info if everybody knows talks tells a website telling you about one book details are on the left fake reviews of the book are on the right some of the reviews have we keep hitting refresh some of the reviews have the star rating and some do not because we are a B testing three different versions of book info one shows no stars one shows them in black and one shows them in red so we just hit that and refreshed it a whole bunch of times so we generated traffic by all of the microservices work together and gave us that book info page I know it's ugly but like six microservices work to give you that page and we just did it a bunch of times so what I'm doing now is I'm just setting up again setting up a port forward it's it's boring plumbing but I'm setting up a port forward to a part of Istio called Keali and that is an observability dashboard so one of the things we get for free with Istio one one of the things that's very useful in running a resilient application is observability and the ability to get deep insights into your application and how it's running we have to log into Keali and now we've done that we can see that there is a Kubernetes namespace default with four things running in it like six pods but three of them with a different versions of the star rating so four moral applications and completely for free we have not described this right we write a load of YAML that describes our application to Kubernetes we'll soon we'll be describing it to Istio we have not described this graph this is made empirically a posterior I don't know what's easy to translate this is made from the traffic that is being sent through the cluster it's amazing this is a simple application but you see the graph of all the microservices Netflix has right it's huge it's thousands so you'd be amazed how many people don't know this about their own set of microservices so first step to to getting the most out of a system and making it the highest quality is to understand it so this is a really really powerful tool to to understand it which is why I've put it in here and you can get a bunch of different a bunch of different views from this thing running a little bit short on time so I will skip through this then the next thing you want to do to get your resiliency is to take advantage of the traffic routing in Istio this happens at layer seven so this understands the HTTP protocol and can do more advanced things because it it understands more about the intent of the transaction because it can understand the HTTP header so we can describe we can tell Istio about the different versions you see there were three versions of the star rating microservice Kubernetes doesn't know anything about that they all share the same labels so it just round robin's traffic between them as far as it's concerned they might as well be the same thing we know they're three different versions and we know we should be treating them differently so we can tell Istio about that we can then start routing the traffic so we can pin it all that the demo will show you that first of all we will say the AB test is not ready yet you know we have deployed version two and version three but they're only internal so we'll send all the traffic to version one and then we can start shifting it to version two and version three so a connect an AB test basically you can do canaries that's a popular word but really that's just an AB test with a much smaller sample as far as I'm concerned and there's other things Istio can do for you because we have this powerful envoy proxy in our network you can do really clever things like if you're two applicants if one application is serving XML another application wants to consume JSON you don't have to translate that in either application you can just tell envoy to do it the network will just transparently do that so if you can describe to Istio hey I just deployed an application but it's legacy it's some third-party piece of software I bought and it only talks XML I want you to make it look like it talks JSON to all of my other services right and that avoids you writing a little another little microservice that tries to do that translation and sticking it in the middle because you'll probably get it wrong though there's bugs in all code like why would you write that when it already exists so that's another step to making your application more resilient I think the demo for this is quite good so I'll do it quickly super hard super super hard okay so we do the setup I won't bore you without the ammo files work I think I missed it I missed it where is it so the first thing we do we keep it in refresh this is that traffic pinning so a normal user now just gets version one right which didn't have any star ratings star rating is a new feature that we want to test we want to test it real fast we want to get that feedback but we still want control or we still want to send most users to a code path that we know works right most users don't want to crash it it should be resilient and then the new code path is maybe maybe untested but we need to see the feature so we can tell Istio to do because we understand HTTP we're going to do something more clever than Kubernetes can do we're going to say that if you're logged in as the user Jason I think I can't couldn't type if you're logged in as the user Jason then has however many times you hit refresh you always get version two you always see those star ratings but that's not random a b-test that says that you know Jason works for us he wants early access to v2 he's going to hammer that before we let it use loose to the users and Jason will get v2 everybody else will get v1 but that's based on looking at the HTTP header that says who is logged in right so you can take you can do much more clever things like that you can send a new version just to the users in France you can send a new version just to the users who are using alpha browsers because you think they're the kind of people who like to take risks right and will maybe brain blame their browser if your website goes wrong I mean that's that's great then we don't have time for the next demo I don't think but that's showing you are back so we can add it we can get Istio to add actual resiliency features so there's a few things we might want we might want a timeout say one application is being slow this is another cascading failure right everything blocks waiting for it obviously you can write a synchronous code but normally everything will block waiting for it so we can tell Istio that if a back end doesn't reply in two seconds it should it should just send the 500 as if the other end had done that right because the other services in a deadlock and it's just going to sit there forever so we can tell Istio to just give up on your behalf right this is now code that you don't have to write in your microservices because it wasn't there because remember we have that monolith and a call from you know namespace 8 and namespace C could never fail it wouldn't have probably never wait that long either the whole thing has crashed or it hasn't now we've introduced this network that might be super slow and super congested might be on the other side of the world if service A doesn't get the level the service level the quality it needs from service B then we just get Istio to time it out and we don't have to add that to our application we can also have circuit breakers which are more brutal version of the same thing and say that if a service fails three times within a certain window we just we just decide that it's dead we just never talked to it again that makes things resilient if a service is being flaky maybe the underlying hardware is failing then we can just do that as I said at the start we can get Istio to do retries for us and that's very configurable so we can tell it only certain things should be retried we can tell it how often how fast to try we can rate limit things this is really important for making your system resilient you know no software is free of bugs and no software is infinitely scalable so you should test your software you should know how many requests a second your microservice can cope with and still still meet your SLO so still say return 99% of answers within 100 milliseconds or whatever is susceptible and then you find out that that limit for this release is 1200 RPS and then you set a rate limit you say hey Istio don't ever let more than 1200 RPS hit this so that again is another key to resiliency and you you would also you would then describe that to Kubernetes as well so that maybe when you got to a thousand you would start to spin up more copies of the pod in anticipation a counterintuitive thing you can do to increase resiliency is actually to induce introduce faults so Istio can introduce delays and it can introduce faults sorry like HTTP 500s but this is this is chaos testing if anybody is sort of chaos cube or chaos monkey these these things come along and they kill AWS instances right and they kill pods at random and they make sure that you're you know you run a test while that is happening and then you know that your application can cope with it and it's resilient to it right resilient to compute failing now we have distributed our application and it's running all of these calls are across a network we have another failure mode the network might fail so what we do is we in staging certainly we get Istio to pretend that that network is slow or to pretend that that that work is unreliable and then we know that our application we can test whether our application still copes with that either through code in the application or through other things like retries and timeouts that we've configured Istio to do so that may seem counterintuitive there is an argument that you should do this in prod you should test in prod see charity majors talks there is an argument you should have this running all the time but definitely in your staging environment where you're also using you know doing soak tests and using low generators and all that kind of stuff you should have this kind of chaos running and as I say it's counterintuitive but it does increase your resiliency and it's a service that Istio can offer security this is your resiliency against being attacked you know you're if you misconfigure things a the internal part of your application maybe the actual you know bank account records part would only have been called by other things you know only have not would not have had an API directly on it is only accessed you know through other parts of the monolith that are maybe doing authentication that's not good enough now because there is this thing will have to listen on an API of its own so if somebody can break into any pod any other part of your infrastructure they now have access to that in the way that they didn't before so you want mutual TLS for encryption you want strong authentication to know which services you're talking to and you can then turn on service authorization which says that if I am the bank account records only the UI component can talk to me because it displays balances and that's fine but the the microservice I've got that embores new users should never needs to see anybody's bank account records so just ban it from talking to me at all because if it's ever trying then that's a mistake it's an error in the code or that part has been hacked so we can do that when we have TLS in place that was what the last demo was but I've kind of I've kind of run out of time and then you know other things that you just don't want to re-implement or that you don't want to bother using a library for in every application things like injecting course headers any other kind of security middleware JWT validation XXS mitigation any kind of web application firewall function you can have Istio do you just describe your application you say you can see that this thing is listening on port 80 yes it's talking the HTTP protocol and this isn't a rest API it's actually talking that to a browser so I want you to do all of the standard stuff like you know injection attack detection and course header injection so this just means that you don't have to download a library into your application to do that hook it into your build system check that it's working properly upgraded every time there's a patch you just let the network do that and you know I say all of those things about it's being it's complicated to have your app do it of course a lot of people just don't do it at all because it's too hard but with a one yaml file now Istio will do that for you so Istio has a lot of power I think the demos that you see are normally around the clever routing that you can do and that's it's great it's great to have those features but the network is now a really important part of your application it has a really important effect on how all of the different microservices work together as one unit and that one unit has to work as one thing so that your users have a good experience right so in the same way that you need to tell Kubernetes a lot more than just please run this pod you need to tell Istio a lot more than please just move my traffic because the network was doing that for me if it just moves the traffic you get observability for free which is a really good step towards a resilient system because if you can observe it you can control it and understand it but also you can configure Istio to do the retries and the timeouts and everything else that we've that we've seen some of the things I went quite fast some of the items had like a yellow question mark which meant sometimes you need this so a retry is not always safe and some things had a red cross which actually means you really only want to do this in an emergency so you you can do things like log all of the traffic you know there were privacy implications to that security implications of that but if something is going really wrong you can get Istio to keep you know keep the service up keep sending traffic to the production instance but also send a copy of that traffic to your laptop so you can read it and try to work out what's going on so there's a few options that you would leave turned off but you have a YAML file ready to to go and turn them on I think just about on time so that's what I wanted to say thank you very much happy to take any questions sure I'll repeat it yeah hmm that's a really good question so the question was what if you're not running in Kubernetes Istio by the way runs on like console and mays off the older stuff but if you're running in serverless ACI virtual Kubelet then the newer systems what do we do I guess what I would I think you do want the same features and the same technology is actually giving you that so Knative serve is a serverless framework right it can scale to zero on all of that stuff it's based on Istio so two things that are on Knative serve even though they're serverless you don't have to deal with the deployments or the rolling upgrades because it's a serverless framework the the traffic between them is still going through Istio Knative serve installs Istio and you just don't see it so we're already like one layer of abstraction higher than this the other thing is that Envoy the proxy used by Istio just last week released like a library version for iOS and Android so now even if everything is hidden behind a virtual Kubelet and you just hit an HD you know you just deploy a blob of code to ron.sh or Heroku or something and you can't take control of it you can at least have the client do timeouts and retries because you can now download Envoy as a library and you can link it against your Android application and even if the server as a whole just fails and says 500 you can get the Android application to retry and then to time it out because Envoy is now being released as a library so does that help it's about what you have to do at the end of the day is get those Envoy's in there next to all of the components in your system yes I guess if you're using a virtual Kubelet I don't know how you do it if you're using Fargate virtual Kubelet I don't know how you do that because they take care of the compute fabric and you don't have enough control to inject Envoy but a surprising amount of the systems that do present that higher level of abstraction like Knative serve actually have this stuff built in and will do it for you and in the worst case you just have to climb to it I know that but that helps yes cool let's all go get some coffee thank you again