 Is it okay to start or yeah, yeah, I think we're scheduled to start at 20 past two, so I think it should be alright. Hopefully be on time. Okay, hi Ron, thanks for joining me today. In the next half hour or so I'll be taking you through my experience with one of our clients of how we basically were an early adopter of service measures and our journey as we went through Lincard 1 and all the way through Lincard 2 and what, how that looked like from that perspective. So quick intro, so my name is Tillen, I'm a senior consultant at Adopent Credo. We are based in London, UK and what we do is basically we are sort of a hands-on consultancy where we help, you know, our clients, all sorts of sizes from smaller companies to big enterprises with their cloud-native, to help design their cloud-native solutions and work on making sure the designs are scalable and good and future-proof. You can find me on either on WeChat or on Twitter via the handles on the slides. I don't worry about that, I'll show this on at the end as well. So what's the agenda for today? I'm going to take you to a brief intro of service measures and what they are. If some of you are unfamiliar with them, I'll be more of a practical example because I think like sort of those end up being better than the numerous theoretical ones that you'll find online. I'm going to talk about the evolution of service measures and how that applies to Lincard D. Lincard D is a great example of how service measures evolve through time as the needs and the landscape changed. I'm going to talk about how we redesigned architecture at least three times, if not more, and how we went through how we took the path of migrating without much issues basically. And then I'm going to talk a little bit about lessons learned on what could we do better, what do we do well, what you should avoid when doing so in Europe. Okay, let's start. So just a quick note of what is a service mesh. Now we'll find many definitions online on what is a service mesh, but one I've found is that a service mesh is an approach and a dedicated infrastructure layer for operating a secure, fast and reliable microsystem ecosystem. Now that's a loaded statement as a lot in there and it's sometimes difficult to actually grasp what it is a service mesh is and what it is it would do for you when you're developing your cloud native highly scalable applications. So I've heard a bunch of things from all sorts of people about trying to explain to me what it is and these are just a few things that I heard. For instance, it's a whole new paradigm of deploying stuff to the cloud. Okay, maybe. It is a low latency infrastructure layer. It is a level seven network exclusively for applications. For instance, it's just Kubernetes and extension to Kubernetes. I haven't heard someone say it's magic, it just does a lot of stuff. It's not quite magic, but it is interesting indeed. Let me show you on an example of what a service mesh does for you. Just a quick one before we dive in. So let's say we have two services. Now they could be microservices, they could be normal services. For the most part it doesn't really matter what they are. For instance, let's say we have a payment service and we have a ledger service and that connects to like a fraud system and external fraud system, like the third party, if you may. You make a payment, it makes sense with the ledger, it sends the info to the fraud system for analysis. Very simple example. Now you might have, and let's say all of those are connected via HTTP or any sorts of other network protocol. Sir? Oh, light. Can we, how do I do that? Figured out. Hey. Is that better? Yeah, sorry about that. I guess the colors are a bit off. Cool. So three simple components. Your system will typically have a lot more, but for probability sake, let's start with these. Now, typically some of your components will start scaling up. For instance, we want more instances for a ledger service because we have a lot more load on it and we need to handle it. That means that the payment system now needs to do a few things. First of all, it needs to know that there's more than one system. It needs to load balance between them and it maybe needs to handle if things go wrong. For instance, what if one of the instances in ledger is frozen a bit, slowed down, one of it is dead completely? How do you deal with that in a fast-paced production environment? Well, it turns out just doing HTTP requests is a lot of times not enough because your availability will go down, your error will spike up, so we try to mitigate those by doing a few things. For instance, what we're trying to do is, well, first of all, we wanted to be discovering instances that we wanted to be dynamic, so we'll implement service discovery so we don't need to statically set IP addresses or URLs for our instances. We want a load balance between them. We could use a load balancer, but that's an extra hop. We might do cloud load balancing, for instance. It's a good way of doing it, but we want the circuit breaking. If an instance is dead or performs badly or doesn't perform at all, we want to stop sending messages to it, potentially with a fallback that is local to the service itself. Of course, you may want to implement retries. For instance, if a call fails, maybe if I try it again in a different service, it'll succeed. And, you know, from outside looking in, the call would ultimately be fine and successful even though there was a retry to happen internally. How do we deal with authentication authorization between the services? A lot of times, we just don't deal with it. We assume that because it's internal to our network, it's fine. But that might not be the case all the time, especially in a cloud-native environment. And, of course, we would like all of this to be automatic. And all of those features are nowadays usually abbreviated under the term resiliency. So we want this to be resilient. So we want the communication between all of this, between the services to be resilient and to have all of those features we just mentioned. Now, we could implement those ourselves. There's a bunch of libraries you can use, whether for Java, for Go, for C-Shop, whatever your services are built in. You can have, for instance, Historic was a popular one, or Resiliency for J is a new one now, that supersedes it. We could use Kubernetes for service discovery. We could use Ribbon for client-side load balancing, and we can use some custom codes to implement retries and authentications and authorization. Okay, that could work. But doing it only on the payments is not enough. We're going to have to do it everywhere. We're going to have to do it in our ledger. We're going to have to do it in all of our services at the same time. Now, what if we don't own all of our services? What if our ledger is now instead of a third-party product that we bought from someone, like a core banking product? Well, we don't have access to the code. We can't add anything into it. Well, we're out of luck here. We can't use any of that over here. So what do we do in this case? What do we do in this case? And in the case where we want to potentially implement a new thing within our resiliency stack, so to speak, well, we need to go to every single one of our teams in our organizations and put a ticket or a feature request on their backlog and coordinate everything, and it might take months for everyone to get on board with a new feature. And, of course, it takes away from developer's productivity and so on and so forth. So we want to take that burden away from the services themselves and instead rip it out and put it in a proxy. So instead of having all of that inside a service and code, why don't we have all of that in a single component, a single proxy, and deploy that next to our components and just route all the traffic through it and then let the proxy handle all of the resiliency problems, let the proxy handle authentication, let the proxy handle circuit breaking, load balancing, service discovery. Why don't we have a dedicated component that a team develops or maintains that does all of this for everyone automatically? But that sounds like a lot of a much better idea, actually. It is another component, true, another hop, but it does give you a lot of features that you would need to build manually yourself. And also, a problem of third party vendors, it goes away completely. We can now just put it next to our third party product, which we couldn't change before, but now we can augment by using a proxy and we route traffic, all traffic to it. So basically, what is a service mesh? If it offers all of those features that I mentioned before, service discovery, load balancing, circuit breaking, retry, certification, authorization and so on and so forth, and it's automatic. And we do that by ripping out the functionality and putting in a proxy and then creating a mesh of proxies that connects all the services together. So basically, if we dumb it down a little bit even more, it's just a collective of smart, configurable, autonomous proxy. And to be honest, that's all there is. It's not any more complicated than that, in reality. There are certain complexes as to how you implement it, but generally, that is what it boils down to. So proxy, which one to use? Did you use NGNX, HA proxy? Well, those are usually not built to handle all of the scenarios, so we'll probably go after a few purposely built ones. So nowadays, you'll find quite a few products actually that solve the issues that we're talking about. So for instance, Linkerd is a popular one, Istio is another popular one. You could use console as a service mesh. Envoy is usually used as a proxy in Istio, for instance, or in other systems. You even have Amazon's managed app mesh, or even Kong's, which is primarily an API gateway that's now moved into the service mesh space as well. So there's a lot of open providers that are established, and a lot of open coming ones as well. But let's go back to when it all started, like one of the first ones that was Linkerd. So let's take a journey that I took, and a journey that the general community of service meshes took, of how it came up from where we started to where we are right now. So Linkerd 1 was one of the first service mesh proxies out there. And what it was, it was basically a single app, it was a single app in an all-in-one network proxy based on Finnegal. It came out of Twitter basically. Finnegal was a product, it came out of Twitter. It was a good HTTP client that supported a lot of the features that we needed, and it got bundled into Linkerd as a way of dynamically configuring it and acting as a proxy on a mesh. Now it runs in the JVM, which may or may not be okay, but in this case it was okay as you can see. Now it supports routing policies. It supports most of the resiliency requirements that we talked about before. It has a pluggable design, which means that it doesn't make any assumption about where it runs. You can run it on a box, you can run on a Raspberry Pi, you can run it on Kubernetes, wherever you want. And you can plug in different components into it. For instance, service discovery into Kubernetes, or service discovery into console, or somewhere, or there's a lot of plugins you can take. And it has a single relatively simple config. You'll see what I mean. And it looks something like this basically. So, for instance, on the left, just an example, there's a thing called DTAPS, which is a bit of a convoluted way of configuring routing. For instance, request comes in, and then you can have a sort of a kind of a table of branching paths that request can take. For instance, request comes in, and I need to forward it to this other service. Okay. Sounds like an epoxy we're talking about. We do need to learn the language, so that's a bit of a pain, but we'll deal with that at least initially. And of course, on the right, for instance, using YAML, we can configure all sorts of parameters. For instance, do you want to have retries? Retries budget? Do you want to have load balancer? What type of load balancer do you want to use? Do you want to use latency-based load balancing, or just round rubble on balancing, and so on and so forth? Okay. So, you know, out of box, it supports a few good features. So, how do we use it? How do we use it? So, one of our first use cases for it, and this is where it really sort of shines, is where we weren't using Kubernetes yet. So, it was a few years ago. Most of the client, most of the services were running either on-prem or on virtual machines, and some of them we started moving to the cloud, but on normal instances. This is sort of before the bigger enterprises became comfortable with the idea of running everything in the cloud, or the idea of running everything in Kubernetes. So, but still, we wanted to try out. We wanted to bring some of the resiliency features that sounded so promising, and bring into what we have right now, instead of waiting for the promised land later down the line. And LinkerD1 specifically enables us to do just that. Because it was just a normal JVM app, we just run it on the box next to the service itself. Now, it was a VM, or an EC2 instance, or a vendor app, or any other cloud app. It didn't matter. We just installed it in there, and it sort of worked. Now, we had to work with the configuration a little bit, had to make sure that sort of all of them see each other and know where to go, but for the most part, this worked, and it was very flexible and allows us to deploy it in a hybrid environment. So, that was a good take one. We were very, at this point, we were quite happy about it. Almost nobody was doing this at this point, so it was a big step forward, and allows us to bring back a lot of that into the mesh itself, and basically release the burden from the developers themselves. They didn't need to know, not know, but they didn't necessarily need to care about doing all that stuff for every one of their microservices, and tweaking every one of their microservices individually. You instead had a single unified place of doing that. Okay, worked well for a while, and then we started looking at Kubernetes. Well, we actually started once using it. We wanted to deploy some of our works into Kubernetes. How do we plug Kubernetes into this? Well, as you probably can guess, we're probably going to have a Kubernetes cluster. We're going to have to put LinkerD on it somehow, and then hook it up into this mesh. So, it's not straightforward, but it was a problem. LinkerD is not light on resources. Recommended amount of heap memory is at least a gig, if not more. You need to at least need a CPU. So, this worked fine for EC2 instances and virtual machines, because usually they had a forward head to handle this, but for containers, or for pods better. Running it as a sidecar was probably not the best idea. It consumed way too many resources and would bloat our basically Kubernetes installation and make it slow. So, instead, we did something like this, which was the recommended approach at that time anyway, was instead of running it as a sidecar, we ran it as a demon set. Now, effectively, a single LinkerD instance would handle all of the pods on a single node, and then all of the communications would go through it. And then LinkerD would communicate between itself on a node level, and then once the request hits a target node, it would then forward it to the target service. So, effectively, you had a mesh on a node level, in each node had a collection of services. It took a little bit of time figuring this out on how to make it work, but eventually we plugged it in here and sort of just connected it to everything else. And again, it worked, although our config at this one was really bloated. And luckily, there was only a few of us working on this, so it wasn't that big of a deal with it, but we had to introduce this to hundreds of developers who are working on everything else. It would have been a pain in getting everyone to understand how all of this sort of comes together. So, in general, it became a pain to manage this. Now, eventually Kubernetes took over the world and so it took over us as well, and we got rid of all of the legacy stuff. No more EC2 instances, no more on-prem stuff, and we just have everything running in Kubernetes. And it looks kind of like this, which is basically a standardized deployment and extended it to all of our workloads in Kubernetes. Now, again, this worked fine for the most part, but it introduced a lot of problems, as we'll see soon enough. The least of which is that while we have nice and great dashboards for instance our linkered instance, like with services it is calling, what's the traffic and so on, you only got a dashboard per instance. You did not get a collective dashboard of everything that was happening in our mesh, which means it was quite a pain in actually figuring out what was going on because you had to go to every instance and figure out where the traffic is coming from, where it's going. And you had to, and if you had hundreds of nodes, it was just not possible. Now, eventually, yes, we did move to Prometheus and start building Grafana dashboards to get around the issues, but there was a lot of steps between us wanting to figure out what was going on in our network and us solving it and us viewing it in a single dashboard we had to get. Prometheus, Grafana, everyone had to manage this, design graphs, so on and so forth. There was a lot of things that happened there. So, what was actually the problem? Well, as I mentioned, style of instances in every node because of large resources consumption, which means you had a proxy per node instead of per instance, the configuration became complex, so were the updates. It didn't support dynamic updates. Monitoring was very disjointed as you saw, and of course we didn't have any proper MTLS support because we couldn't do service to service MTLS only node to node, which was better than nothing, but it was not what we wanted at the end. And the developer fiction was quite high. We found out that developers had a hard time understanding what was going on, and a lot of times blamed the service mesh for things that were either not the service mesh's fault or they just didn't understand how it worked. So, that's where we ended up with LinkerD1 at the end of the day. So, clearly the architecture that was designed for has reached an end, so to speak. So, where to go next? Where did the community or where did all the different providers go from there? How would you design a service mesh that worked for a modern cloud native Kubernetes-based environment? Well, you would still have a proxy, but instead of having all of those features within a proxy you would strip those out and move them into a control plane. And instead, keep the proxy only as a proxy and leave all of the configuration and the bloatiness and the setup and the management of certificate within a control plane, which is just a normal service and normal deployment into your cluster. Now, how does that control plane look like? Well, it's just a normal deployment and what it does is it manages and configures the actual proxies. It's a standard stateless deployment. It has a public API. It collects the telemetry from all the proxies in your clusters and exposes them with a single interface. And that was super attractive to us because we finally had a single place where we can actually, we finally get a single place where we could actually put monitor our service mesh. It can enforce policies. It can issue certificates to the target services to enable MTLS and it is fully cloud native, fully integrating into Kubernetes, fully make use of custom resource definitions to configure the service mesh and everything around it, as we'll see. So that was the control plane. So we took out all the management, all the bloat from the single all-in-one product and move it into a control plane. And as for the proxy, well, the proxy had only one job left and that was just a proxy request. And instead of having it in VM, we should write in a language that's a bit more appropriate for it that uses less resources and can be with the ultimate goal of deploying it as a sidecar to every instance to achieve the true one-to-one proxy to instance ratio that we wanted to. And that was basically the general architecture that two of the most popular service meshes now took. Both LinkerD and STO, if you ever work with them, operate the same way. They're implemented differently, use different technologies under the hood, but ultimately they function roughly the same way and perform roughly the same functions. Now this for us was very exciting, like where do we go from here? We basically had, now because these two are so radically different than we had before, there wasn't a straight migration path, neither to STO or LinkerD2. So what do we do? Do we go to LinkerD2, do we go to STO? Well, we delivered it a lot and you couldn't go wrong either way to be frank in this case. But for us we decided to go to LinkerD2 and one of the deciding reasons was that we had good experiences with LinkerD1, we wanted to continue that, and second of all, one of the greatest features of LinkerD2 is that it's a lot more simpler than STO. And one of the biggest problems we had with LinkerD1 was that it was so daunting to hundreds of developers and organizations that they didn't even want to learn it. So we wanted something more simple, something that everyone can pick up and they don't need to fully understand it. They can if they want to but something they know how it works and they can sort of grasp relatively quickly. And that was very attractive for us in this particular case. So what we did was we started to plan a migration. How do we get from the architecture we had before with LinkerD1 to the one that we just showed you at LinkerD2? So our architecture take three basically. So that was our third redesign of our whole service mesh architecture within the scope of two years, was it I think? And here's what we needed to do. Of course we had to first plan it, but ultimately here's what we ended up doing. So on the left was our previous architecture we mentioned. So we had a demon set per note which handled our service mesh and we needed to move it to something that looked like this. So we had a control plane which consists of a few deployments and pots so it's more than one component but ultimately it's just a single normal standard Kubernetes stateless deployment that stores all of its state in Kubernetes config maps and custom research definitions and so on. So it's fully cloud native and fully embraces the cloud. No more custom configs, custom reloads, any sorts of things like that. And of course we had to move away from using a demon set to using a sidecar container. Now it's quite different. So how do we do that without, well first of all without any downtime and without completely confusing the developer, so what's going on? Well there's a few things. Oh yeah, so there's a few things that we'll get to that in a second. And then the second thing we did was just make sure we can configure it correctly. So this is how a configuration of a service for instance looks like in LinkedIn too. As you see it's a normal standard service deployment. Another Kubernetes manifest that uses a CRD, in this case a service profile that then hooks into LinkedIn too's control plane which then configures the proxies and sells to act as the proxies based on what we configured. And the good thing about this is that yeah the good thing about this is that we could just simply migrate over all of our existing conflicts into Kubernetes with deployment and package them up next to our apps, to our deployments or even help charts or whatever it is that you use at the time. So yeah, we can set retries, we can set timeouts, stuff like that. There's a lot of things we can configure. We'll go through all of them, but it was quite powerful and we could centralize configure. And the developers, the other developers when we showed them is they were a lot more happy with it, well we can just use this just like we can configure anything else Kubernetes to do it. So yeah, it was positive. So what were our goals? So one of the main things we want to do is we wanted no required developer interaction to make the switch. And what does that mean? So it doesn't mean we don't want it. Developers can interact with it, but we didn't want to require it. So if somebody didn't care about it, to them it would be invisible. And the same for our users and customers that use the platform. We didn't want any disruption and we want minimal changes to our current architecture. So yeah, just to end the deployment pipeline. So for number one, I think number one was a particular interesting one. How do we not require anything from developers? Well, if we go back to LinkoD1, the way we configured the pods themselves to use the actual proxy was that we set environment variable HTTP proxy to point to the node that the instance was running on using a snippet like that on the left. So we basically just got the node name and then set it as the HTTP proxy and all of the requests would then be the app would pick up that environment variable and would set up all the requests down to the proxy and then that point LinkoD1 took it from there and even knew what to do it in a center right place. So that was one thing. And the second thing is we sort of agreed on the default standard way of how we identify our services, which is basically just service name dot namespace in Kubernetes and that was it. Luckily, the names, the service names themselves stay the same in LinkoD1 but even if they weren't, we could configure them to be the same. For instance, in Istio or even in LinkoD2, we can always configure what the host names actually are so we can migrate over without any service issues. But how do we solve the left problem? How do we, because if we're going to move over, we're going to need to remove that. Everyone's going to have to remove that and it's going to have to put in whatever it is we need for LinkoD2. So we wanted to remove that manual step of having everyone need to do that themselves. So we embraced Kubernetes and made use of admissions web hooks. So we created our own version of web hooks that basically did that automatically on the cluster. So what this does is it basically serves you deploy and Kubernetes will call the service before each part is scheduled. And what you can do at that point is you can edit the part definition itself to, for instance, add environment variable or add a container to the part without you having to do anything. It'll all happen automatically in the background. And actually, LinkoD2 uses this mechanism as well as the Istio and a bunch of other places as well. So this was a good solution to our problem. So what we did was and it's completely transparent to anyone using it. So what we did was we made sure that that snippet was moved into admissions controller and then everyone removed that. So yes, there was a step that needed development intervention but once we moved that, moved away from manual setup into like an automated setup within the cluster, we could then update this in the background without anyone having to update anything. So effectively, what did we do? So our plan was actually sort of simple when I look at it back. So we had all of our infrastructure definitions in a Git repository. We versioned everything. We sort of followed the infrastructure scope, DevOps approach that everyone follows. So we built a pipeline to do the migration and what this pipeline did was basically we started the migration by like merging the changes into the correct branch for the correct repository. Of course, we tested it a bunch of times before we moved to a part environment. So what we did was we so what the job did was we wanted to deploy both service meshes at the same time and gradually move service one by one from LinkoD1 to service mesh to LinkoD2. So the way we did that is we first we deployed the control plane. Now remember the control plane is just a normal deployment. It doesn't interfere with anything. There's no harm in it being there if nobody's using there. So that was straightforward. We deployed the control plane and that was in there and ready to go. At that point we also made sure to apply all the configurations that the services would need that we did beforehand. Now most of it was generic automatic configurations that we could tweak later on but for the initial migrations we didn't need to. So it was fairly straightforward. So we deployed the control plane and then the important step came up. How do we actually now move every service from using the old service mesh into the new service mesh without causing a disruption without developers noticing or us noticing or the customers noticing or anyone noticing was like a personal challenge for us to do that. So because we moved over to the admissions webhooks before to configure our pods we simply stopped it, which means any new pods that were deployed would not have the LinkerD1 configuration applied. And at the same time that we stopped it we started up the LinkerD2 admission webhooks. Now the way LinkerD2 admission webhook work is that it's automatically there when you deploy the control plane but it's not activated. And the way you're activated is is you basically put a label on the namespace or on the deployment themselves of where you want to activate or where you want it to be activated. Now we didn't, again we didn't want to edit the deployment so we put them all automatically on all the namespaces that were in use and that way we started the admission webhooks of LinkerD2. So what ended up happening is that any new pod that was going to get deployed was going to have the LinkerD2 admission webhooks applied which would add the sidecar container to the pod which would automatically connect to the control plane of LinkerD2 and start using the new service mesh. Now the way LinkerD2 works is very convenient for us is that if the other service is not yet on LinkerD2 it will skip the mutual TLS check so that we didn't need to worry about not being able to connect two sides of the request because one didn't migrate yet. But once everyone migrated MTLS would be activated and everything would work. So once we flip the switch on those two we simply initiate a cluster wide rolling deploy one service at a time to be able to pick up, to destroy the old pods, create the new pods and pick up the new configuration, new service mesh and basically connect to the new service mesh all transparently in the background while serving a request without any downtime. We need to thank Kubernetes and that because rolling deployments work well for us and it will work well for us in this case. Now that took a few hours to get everything rolling because we want to make sure everything is working so on. And then once everything was migrated we did a simple cleanup. We removed the old admissions webhook and controller and removed the old LinkerDDments set from the nodes. And that was it. And again the reason we spend much more time on the rolling deploy is that if at any point something went wrong we could always roll back. We could always re-enable the old one, disable the new one and then re-re-re-deploy the instances that we restarted and it would go back to using the old service mesh. So we had a backup plan if things didn't go well. Of course we tested a bunch of times in production. So that was a general plan of how we got things across and how we used automation and the features of Kubernetes to make big architectural changes to our service mesh. And that's just an example of what you can do. You could use the same sort of concepts to make any sort of changes in Kubernetes or other cloud-based environments. So this will end up, and this is where we're now and so far it's been working good and we're looking forward to basically moving along as the ecosystem of service meshes matures. Before I finish just a few notes on what to learn and what you can take away from this is that do fully automate your infrastructure from the get go or as soon as you can. As you saw, once we had all the building blocks in place for automating the configurations of infrastructure components it actually became quite trivial to switch things over. So just deploy the new version of a web hook and so on and restart and new things will pick up new things and all things will be restored when the time comes. So don't do it manually. Don't rely on tens of teams and hundreds of developers on having to do everything at the same time. It just leads to problems. Try and take the button away from them and automate as much as possible. And of course to follow up on that, let the platform do the heavy lifting. That's why it's there. As the reason Kubernetes is so popular nowadays is because we don't need to deal with where containers are scheduled, how they're used, and how they move around the cluster. Same with stuff like this. Don't do it yourselves. Don't need to think about it. You can just let the platform handle it for you. You can ride the automations you need to be able to do that as well. And our last note is I suppose I talked a lot about in the beginning about what a service mesh is. And a lot of times it's well, it's difficult for people to grasp. What is it? I can explain three different ways, but it's still difficult to explain what it actually is. So I prefer to say what it actually gives you. And what it does give you is it provides you a transparency reliable and autonomous network hub for any service running in your cluster. And the important part is that you don't actually need to know about it. As long as you're aware that it's there, you can go into it as you want and configure it as much as you want. But you don't need to. It's there. And what it gives you is you don't need to think about network anymore. The service mesh takes over for you and there's a lot of cool components working underneath to make that happen. Cool. Right on time. Thank you very much. I hope that was interesting for you. Let me know if you have any questions about the truth like that. If not, thanks for coming and hope you enjoy the conference.