 I am welcome to my talk about gradually implementing Istio for your platform. In this talk, we'll be exploring a way to figure out your starting point, your destination, and the steps between those points to have a gradual and safe implementation of Istio. Most of the material in this talk is based on conversations I had with other end users and our own experience in implementing the service mesh for our platform. First, a little bit about me. My name is John. I've been platform engineering at Vecom in the Netherlands for about seven years. I've also been doing consulting for a very long time, and that's mostly been revolving around DevOps culture and cloud architecture. If you have any questions or have any feedback, you can reach me on the CSEF Slack and on the Istio Slack and even on the Kubernetes Slack workspace all using the same handle. So that's likely to be the gap that we're going to be bridging. I'm going to assume that you already have a choice, made a choice to use a service mesh or at least are heavily leaning towards implementing one. And you might already have a non-meshed system running. So figuring out what's inside of this opening, this gap, what we need to put there to get to our end state, that's what this talk's all about. So your starting point is probably Kubernetes, but it might also be something else. Istio does work with various other systems, although it might be a little bit more to get it working. For example, virtual machines or servers or edge devices are all going to be supported. Looking for one of more features in the service mesh deployment, that is probably also a starting point requirement. If you're not looking for a service mesh, you might not really want to deploy or deploy one. And then there's the accessing expectations. So you probably already have some running applications or some users or some plans that you need to support and adhere to. Then there's the destination, which is slightly hard to predict. But for my conversations, it's usually a combination of ingress control, internal communication, and observability, which we're all very handy to have. We'll go into detail about those features a little bit later. So getting the starting point right is pretty important because that all through implementation, that is going to influence how you perform each step and what the impact is going to be. So I've split this into three topics. There's existing knowledge, there's infrastructure context, and there's available resources. And together, they have a pretty huge impact on how you might implement this. So your containers, Linux or Kubernetes experience that will come in handy, but it's not the only existing knowledge that will come in handy. You might have one of your products or something totally different. And one way or another, some of the knowledge that you have gained when using or operating them that will come in handy when implementing a service mesh because lots of the knowledge that might not be, say, named the same or definitely not just be slightly off, it usually brought them to the same stuff. So having some existing knowledge and taking information of that, that is important because that informs how much of a learning curve you might have. And then there's the infrastructure context. So for example, you might have existing systems and the connectivity of those systems may not be interruptible or it might be preferred that they not be interrupted while you're implementing something. Same goes for dependencies the other way around. So if you are inside your cluster and you want to talk to something outside of the cluster, well, maybe you want to make sure that that also keeps working. Not everyone requires this. Some services have expected downtime, so that might not bite you. But if it applies to you, something to take into context or into consideration. Then there's security and compliance. So you might have existing rules regarding that. And you might have existing users. So making sure that you don't interrupt their work too badly or in unexpected ways, those are very important. So looking at those two and we'll go fill in the blanks in a bit, they impact your learning curve because how much you can use also kind of applies to how much you will have to learn and how much you have to change will cause some friction. So fewer changes or safer changes will result in less friction. And then there's the indirect impact of your speed and scope because if you have to maintain a lot of existing features or connectivity, the scope might get very broad. And if you have a lot of existing knowledge that you can reapply, then your speed might be very high. It also works the other way around, of course. But there's not a factor that also seriously impact your speed and your scope of your implementation. That is the available resources. So if you don't have the people and the time to implement and maintain everything, maybe you want to spend some money on a third party supplier to help you with that. Or it could be the other way around. Maybe your budget is somewhat smaller, but you do have time to build and support everything. So then you might do everything in-house or some balance in between. And these resources, they will generally also impact how many features you deliver at once and how safe that is to do. So if you have a very, if you release very often and you have a couple of environments, say you have a development environment and a sandbox environment and a production environment, it might be safer and easier to deploy a small set of features many times, not the same feature obviously. But that gives you a relatively safe way to do things. On the other hand, if you have a slower release cadence, perhaps you might bundle more features at once, but you have to do more testing and more validation ahead of time. So how do you apply these topics to a real world scenario? Well, for us at Wacom and for some of the people I've talked to, it was some reusable knowledge. So for us, we already had some platform experience that was relatively relatable to Istio, including scheduling orchestration and containers and generating YAML because duplicating the same YAML by hand and keeping it on sync, that's not really a lot of fun if you have enough services to configure. But also metric with Prometheus helps a lot because a lot of the observability does depend on you using the metrics that are available in Istio to act on. And you can use Prometheus or other tools, but in this example and in our case, we use Prometheus. And then there's the infrastructure context. So we had a relatively high release cadence, but we have smaller sets of these features per release. And because we had multiple environments, it was also relatively safe for us to experiment and we were using Git and deployment systems so we could roll backwards and forwards without breaking too many things at once. And there were also existing websites, backend services that we could not modify. So those had to be kept alive during all the implementation phases. So that is how it works for us. But lastly, the resources, we didn't have much of a budget for ongoing support. So we couldn't really have a third party manage the control plane for us, for example, which can be a very useful service to purchase. But we did have a little bit of a budget to help us kickstart the implementation. So we had some time with some people, but not everyone had the same amount of knowledge. And we figured that having some consultants come in and do some pair programming would be one of the ways that we could improve our knowledge and also keep our speed to a level that we wanted it to be. And because we had full ownership and those frequent releases, that was something that we could choose for. But that might of course be different in your situation. So now that we have a starting point and an example of how such a starting point was created, let's look at the destination. So this diagram is a somewhat simplified version of the architecture. It's available on the architecture overview page on the exterior website. But it gives you an idea of the layout of where all the components are. And if you look around, you have some rough sections, there is a section control plane, and there's some stuff in there. There's another section data plane. There's some other stuff in there. And there, of course, the existing services that you might already have. Now, if we were to take this idea that we have a couple of locations with different purposes, and we go browse the documentation site, and there are, of course, plenty of sources where you can go look for features that are available or waging people are using this, then, well, documentation site always has all the features because, well, it's documenting everything that's possible with Istio. So just as a reference, it might be very useful to just browse it. So what we did is we went and looked at the concepts and the tasks. So the concepts and the tasks have similar topics. But for example, the traffic management and observability were very important to us. So we first read through some of the concepts and then decided that, well, we have an idea of why we want this and what it does. But let's see how hard it is to do the things. And that's where the other section comes in handy. There's tasks for traffic management, policy enforcement, and security. And they inform you a little bit about how much work or how much knowledge it takes to do certain things. Now, if you need a little bit more of a complete example, of course, those exist as well. And just as there are a lot of presentations and past Istio conferences contain a lot of user stories about how they're using Istio and what features they're using. But yeah, the examples give you a composed working system that uses certain Istio features. So it became very useful for you if you don't really know yet which features you will be using and in what combination. This is a very good point to check out. So if you look at the documentation, in our case, we also talked to a bunch of other end users. We found that these six things, not all of them, but most of them, most of the time, are the topics that people are most interested about. And also the most desire to implement as fast or as soon as possible. So let's see how we can actually take those six topics and bridge the gap. So we have starting points. And we, of course, it's not possible to represent everyone's infrastructure. But let's pretend that it is at least similar to this. So we might have a bunch of microservices, and they are using an abstraction as a way to package all the resources for a service. So when a developer wants to deploy it, they have an essentially a Helm chart and it's a version. So they have a Helm chart of a specific version. They supply the values and the Helm chart will make sure that all the resources that they need are deployed in their own namespace. So this also requires a working Kubernetes cluster. And in this case, in this scenario, we have the ingress connected to a load balancer. So it's not directly exposed to the internet, but the load balancer takes care of that. And then all the ingresses that you have, they are using a load balancer control to receive traffic. So if we take this starting point and we say we want to start with the ingress gateway, how would you go about that? Well, we want the ingress gateway for a couple of features, central profitability and routing. And because the ingress gateway isn't deployed with the application, but inside the cluster on its own, it means that you're controlled on the ingress gateway applied to everyone. And they cannot really be circumvented as such. With the ingress, you can essentially make your own rules and do whatever you want. But that's because it's delivered with your application. And this one isn't. On its own, the ingress gateway doesn't do anything. So we also need to start with the basics of traffic management. Now, why you would want traffic management at all? Well, we had set it in the name to manage your traffic. But it gives you service specific rules. And it is also used to connect a service to a gateway. So traffic will only flow if there is a resource that says this service would like to receive traffic from this gateway when these conditions are met. And then you also have advanced routing features. We can't go into those in this talk, but that is something that you get as soon as you enable the possibility of using traffic management features. So let's look at the diagram again. Where would these two items, the basics of traffic management in the ingress gateway, live? So you have the data playing section and there is your existing service. And we want to make sure that the ingress of traffic works. And we also want to make sure that it actually arrives at a service. Now, to do anything, you first need to have a control plane. And that's where we're going to start. The control plane enables everything else. Well, it enables the installation of everything else. But yeah, let's start with that one. So you have your Kubernetes, you start with your control plane, you then add your ingress gateway, and then you can do your traffic management tasks. So let's look at these in a more practical sense. So I've added some elements to the diagram. We now have an ingress gateway and an IstioD on the site that's living in its own location. It's not embedded in the namespace of the application. And the application now also has a virtual service resource. So how it works is you install the control plane. It's very well documented. This is essentially one command to install the base custom resource definitions and another one to install the IstioD, so the daemon that actually controls everything. And there's another command to then add an ingress gateway. So those two, you can essentially out of the box reduce them. There's no additional configuration required. And that essentially sets you up for everything else. So then the third step was the basic traffic management. In this case, it means that we need to create a virtual service resource. And that essentially is going to be the connection between the gateway and the service where you want to receive the traffic. Now, when you implement it, there are no real side effects to this. So you can actually do this without bugging anyone, without breaking anything. All the current traffic flows will keep working in a way. You can do this without anyone noticing. And there are, of course, multiple ways to install it. So there's helm-based installation. IstioCTL can do a lot of it for you. You can generate manifest and use those, perhaps with customize. So there are multiple options. Now, let's dive into this virtual service for a second. This is essentially a resource that says if this gateway, in this case, the gateway and ingress gateway receives some traffic, check if the URL is prefixed with cool. If it is, send the traffic to destination my cool app. And in this case, my cool app is the name of the service that you deployed. So that's all it does. It doesn't say, oh, stop all the other traffic or break the existing ingress. It doesn't even touch anything else. So if you were to abstract your application, you could essentially release a new version of your helm chart if that's what you're using and simply add this resource. And that's it. So a very safe way to implement this. However, it doesn't do anything right now. So the question that remains, do you just deploy this and then move on and do something else? Or do you perhaps want to deploy something that actually does something? So first choice is deploy this as is doesn't do much, but it's super safe. But it might also mean that you now have like a billion deployments that do nothing before you get to a point where you can do something. But yeah, the second option at this crossroads is we're going to modify everything at once. And we're going to make sure that the traffic immediately goes into the virtual service into the ingress gateway. And the old path is really broken. That will be more of a big bang implementation, perhaps useful if you release cadence is slower. But it does require everyone to move at the same time. But there's a third option. And the third option allows you to let everyone paste it at their own well desired speed. So let's look at this option three. So we take this existing diagram that we have. And we just add a new load balancer to it. And the new load balancer will actually send traffic to the ingress gateway. And at this point, you can actually already use this to send traffic to any service that also has a matching virtual service does work for everything, of course, if your developers are using a home charts that doesn't deploy a virtual service with the application, then those aren't reachable by the ingress gateway yet because the ingress gateway only knows about services that are connected to it via a virtual service. So we can actually make that even more safe. So we can create a virtual service that is a wildcard service or a fullback service. And what it does is it catches any traffic that couldn't be served by a different virtual service. And in this case, we can then sell the virtual service anything you receive, just send it to the old load balancer. And then any service that has been deployed with a virtual service can receive the traffic directly from the ingress gateway. And any service that was deployed without a virtual service, it will still receive the traffic using the old load balancer, the old ingress, and can continue working as is. You can even roll backwards and forwards without breaking anything. This does require an additional resource. So you have your virtual service that configures the bypass or the fullback traffic. There's essentially a wildcard. It accepts anything. And this to make sure that this wildcard service is put at the bottom of the priority chain, because you would, of course, not want to capture any traffic that you have a actual service for. But Istio doesn't know anything about your old load balancer, because it's not a service and it's not inside the mesh. So we do need to inform Istio about this. So what we do is we create a service entry, which essentially says if you see this host, it is located outside of the mesh. Here's the port. Here's the protocol. And you can find it using DNS. And now Istio knows about this, which means that any traffic that you send to the new load balancer will hit your ingress gateway. The ingress gateway will say, yes, I am aware of a service analyst or no, I'm not. And in the last case, well, it will be aware that the wildcard service exists. And that will then forward the traffic to the old load balancer. And this way your developers can now deploy an old version of your chart or a new version of your chart. And the traffic will always work. There is, of course, a couple of extra hops now included. So if you're using the old way or your developers are using the old way, it means that their traffic has a little bit of actual latency. But that's about it. So it's a pretty safe way and a pretty staged way to allow a gradual move towards mesh integrated traffic. But that, of course, leaves us with not a lot of new features. We have moved traffic over, but that's about it. That's not what we wanted. We wanted to use those sweet service mesh features. And this in pretty much all cases is going to require you to have a Sidecar proxy, because that's where actually all the magic happens now. I'm talking about Sidecar proxies here, but there is a different mode for the mesh. And that's called the ambient mode or ambient mesh. And that uses a different system, but it's used the same thing. There are other presentations at this conference in a previous conference about explaining the actual ambient mesh and what features are and are supported. It's currently in beta. But yeah, it's for this example, it doesn't really matter because once it gets released, the way you work with it will be the same. So let's talk about this Sidecar proxy. So we have a little bit of fingers gateway. That's done. And we have some traffic management, but nothing else. So this injection will enable operability and security and authorization because it actually has access to all the traffic coming in and out of your pod. And that's something that you need if you want to observe it or secure it or do anything with it. So the operability features that you would gain is traffic insight, but also who's talking to who. So it will know who the originator of the request was and who the destination is. And this also is precursor to other facilities. So for example, let's pick Keali. That's a great tool to visualize what is actually happening inside your mesh. So yeah, it's kind of a almost, that's not necessarily required, but it's a must have because this will, well, enable whole world of other features for you. Security wise, the most important parts are traffic encryption. So it does mutual TLS and it also has cryptographic identities for all of your pods. So it means that the identities that cannot be faked because the essentially the Sidecar, the proxy that's injected, that manages your identity. So it means that if you have a container in a pod that tries to talk to the outside world, the later proxy or the Sidecar proxy will attach that identity to it. And you have to implement some of these security features to get access control. And the extra control based on that identity that you get that the cryptographic identity allows you to say who can talk to who based on identity instead of just port numbers or network based policies. So it's protocol based, which gives you a lot of fine grained features. We can be very useful. We'll go into that a little bit later. But yeah, if you take a look at it in steps, you can gradually add Sidecars, which then enables observability, security and authorization. Now essentially, observability and security could be in different orders, but Sidecars has to be done first and authorization only works if security works. So let's go into it. So I've removed all the other resources from the diagram that are not relevant for this part. And what we're going to do is we're going to update the configuration to inject a Sidecar into the pod. Now you can do this cluster wide. So we can say everything must have a Sidecar, but you can also do this based on the namespace or even based on the pod. And that is, of course, controlled by your replica set and your deployment. When you do this, it is important to keep an eye on your applications because there are some edge cases where an application might be doing something that needs some additional coordination between the Sidecar and the application. So all it takes is adding a label. And the label can be set if you package your application using a chart by your chart. You can also do with Customize, of course, and if need be, you can do it manually. And as soon as you have this label set and you start a pod, Istio will notice this and it will inject the Sidecar. So it is that simple. And because it's packaged with the application, everyone can do this at their own pace. So you don't have to do it for everyone at once. But only for the pods that actually have a Sidecar. That's where you get the additional features. So that's good to keep in mind. But again, gradual. So what this gives you is, and this is a somewhat more complicated diagram from the security documentation. This is you a encryption and control points everywhere where the proxy is located. So your service doesn't need to implement it. It is implemented for the service inside the proxy. And that's why this injection is so important. Now, if you go a little bit more simplified, this means that if your pod talks to another pod, there's both encrypted and identified. So that's also where the sensitivity comes from. And it also means that if you, for example, were to enforce this, you can't have any plain text communication anymore. Now, that does mean that if you enforce it on a workload that is not used to it or say an external workload, you might run into trouble. Because if you say M2LS is now enforced, and by default it's permissive, but you say it's now enforced. It means that if anyone isn't upgraded yet with an injection, with a Sidecar injection, they can no longer communicate correctly. So it's important that if you were to, for compliance reasons, for example, want to strictly require it, that can be quite tricky. So yeah, you can do it gradually. But if you make it enforced, you might want to have to wait a little bit to make sure everything has a Sidecar injection done. So now you have those features. You have your traffic metrics, your security status, because that is something also really we'll also tell you. We'll tell you what is the security status of this request. Was it using M2LS or was it using plain text? And you also get an automatic metrics endpoint. So if you're already scraping your services, Istio will append the additional metrics to your metrics endpoint. So you don't need to set up any additional configurations in Prometheus. This is very nice. Security also is automatically implemented because by default, it gives everyone M2LS, it gives everyone identity, and it is also always an enforcement point for policies. So by default, as soon you enable injection, you get these practically for free. So you can of course configure them with more exotic configuration, but it's not required. So let's say we want to also add some access control, some policies. An example scenario would be a REST service. And you want to make sure that the metrics isn't adjustable from the internet. But you want to also make sure that nobody can issue a delete command. For example, you need to quickly change that without modifying the application. The policy can help you there. So a policy for your delete methods might look like this. We essentially say if traffic is matched based on the label, in this case, the app is called My Cool API, then we deny the action as soon as we detect that it contains a delete operation or delete method. So that is a very simple way to restrict what someone can do or something can do. For the other party metrics thing, that's pretty nifty. Because the traffic is entering your cluster using the Ingress Gateway, it means that the traffic from outside, which doesn't have an identity, is now given the identity of the Ingress Gateway. So the traffic comes in in your Ingress Gateway and the Ingress Gateway then communicates to your pod. And that means that on a mesh perspective, your Ingress Gateway is talking to your pod. So your source is the Ingress Gateway and the destination is your pod. So that means that we can say if our pod receives traffic from anyone, sure, give them metrics. But if it receives traffic from the Ingress Gateway, do not give them metrics, deny it instead. So very powerful, but also gradually implementable. That's a real word. You can scope them to a cluster, so then it applies to everyone, but you can also scope it to your namespace or just an application, which again means that if a developer would want such a policy, they can choose to add it to their deployment. And if it goes wrong, they can roll it back and then it's removed. So the impact is very small. And if you want to require a policy for everyone, you would embed it in your chart and you would wait until everyone has upgraded to that latest chart version. Now, there is a little bit of a but in here. If you do things wrong with your policy, well, maybe your service is no longer reachable. There is this very important graph that will help you prevent such errors. So by default, everything is allowed. But you can have a custom or deny policy that changes that behavior. But what is important to keep in mind is as soon as you add a single allow policy, even if it's for one small specific thing, it will block the default allow policy, which means you no longer allow everything. You allow nothing except the thing that you explicitly allowed. So yeah, start with allow policies as a last option if you can avoid it. And it brings to the last feature. So if you want to gain the features of an egress gateway, which would mostly be a metric for the outside world and management for services that you don't own. So for example, something that lives outside of the mesh. And it is not necessarily the only way to do it. If you remember, we did something with a low balancer. But if we look at a slightly more complicated scenario and you want to dive too deeply into it, you can supply a service entry and a virtual service with your application. So for example, if your application is talking to your internal API, and you would like to collect metrics just on what that API is doing, adding a service entry for the external API and a virtual service that refers to the service entry will then allow any of your software that is already making those calls to be instrumented. Because if you add a global egress gateway to your cluster, and you have a virtual service that refers to that, then that is the path that will be taken by Istio when sending traffic out to the internet. Of course, if there is no matching service and there is no strict mutual TLS, I think traffic will just equal to the internet by itself and it will not be instrumented, which would mean that your developer has to do it manually inside of the source code of their service. So this is a very nifty feature as well. There weren't as many people I talked to that were actually using it. But those that were using it, they were super happy about it. So yeah, it's good to know. So that was all I could fit in this presentation. I hope you learned something, or at least enjoyed watching this. And if you have any questions or comments, you can reach me using the handle I mentioned at the beginning. So I hope you had a good presentation and I hope you have a nice conference. Thank you for listening.