 I'm going to start by saying the logo at the top of the slide there, that's the company I work for. I'm not going to mention it at all during the course of this presentation because I'm not trying to sell you anything. What I am trying to do is provide you with information so that you can make good choices about what kinds of service meshing technology you want to use to support your Kubernetes clusters. That's the important thing. The other important thing is that that's the logo of my company, that bear. They picked me to represent because I look like him. Okay. So this is going to be about service mesh and comparing the architectures between the various types of service mesh technology. And I see a few people who all of a sudden have an inquisitive look on their face like, okay, what's a service mesh? We're actually going to go through that first and then we're going to go into the architectures themselves. That's me. I'm Bruce Matthews, Bruce Basil Matthews as a matter of fact. That's why you'll see me on Google and all that kind of stuff. I've been doing this a very long time and I've worked for a very large corporate America and now I'm with this nice startup that I'm not going to talk about anymore. But that's my email address if you need to get a hold of me. All right. Here's the meat of what we're talking about today, guys. First thing we got to do is figure out what a service mesh is. So technically all of the networking architecture that goes on underneath the Kubernetes framework like Calico or Flannel or those kind of things give disparate resources to individual containers. They don't join containers. They don't actually have any inherent properties of providing resource capabilities or any of those kinds of things. And as a result, application developers have been freed to do it however they wanted. So there becomes a helter-skelter kind of approach to application development for things like load balancing or making encryption happen or those kinds of things. Well, a service mesh actually provides all of those capabilities, that infrastructure layer for the microservice application as a whole. And the beauty of it is in the proper service meshing capability, in order to take advantage of it, you don't need to change a line of code. That's the brilliant part. Okay. It makes all of the communications between services pretty transparent to the application developer. They don't have to worry about it in terms of their coding techniques. And it provides a whole bunch of additional capabilities that I'll get into and we'll go into depth on a few of them. Okay. Here's the meat of what a service mesh provides. The first thing is this idea of service discovery. Now what that means is, if there is a load balancer in your service mesh, every application will know about it automatically. If a new container is added into the mesh by having like a sidecar and a sidecar envoy proxy environment, it's immediately discovered by the service mesh without anybody having to do anything. And that really makes for a very positive experience for the application developer because they don't need to know what's out there in order to take advantage of it. The second thing is it typically involves load balancing and, you know, providing some means by which you can scale and continually add service entities of the same type and it will immediately be recognized and added to the same load balancing technology and everything else so that you can standardize how that's done. In other words, one person wouldn't be using NGINX and another person would be using HA proxy. They would be getting it from the service mesh itself. Now you may provide those options yourself as the administrator, but you're not allowing the application developer to willy-nilly pick one. So another aspect is encryption. And the sooner you can institute encryption into your environment without having to force your application service to provide it, the better it is. TLS is the basic framework for it now. There's lots of ways of integrating the encryption capability, but it happens immediately as it hits the service mesh and then is distributed internally. This prevents a lot of potential attacks on your application services. And authentication, right? I've got a certificate. I'm Bruce. The system knows me, but you, Charlie, wouldn't have that certificate and it would fail. Well, typically that's up to the application developer to decide how to handle that inside of an individual container. This provides that service externally. And it says, I'm not even going to let you get to the container. You know how they have root user container stuff in Kubernetes that you can access the entire system through? Well, this prevents that from somebody coming from the outside who doesn't have a valid certificate to get in. And they can use things like Vault and whatnot to manage those certificates, all capable of being plugged into the service mesh. And finally, and this one's more beneficial to the application lifecycle and management of it, is this idea of a circuit breaker. It actually is pulling the health of the container to provide the service that it was giving. And when it can no longer provide it, it cuts off that avenue and reroutes to the next container that's providing that service automatically. As a result, you don't get a stutter of, oh, my God, I can't find an application service. I literally now prevented that from happening and I'm going to give you a new service automatically without you having to do anything. Okay. Why did we go to all of this trouble of instituting a service mesh in your containerized Kubernetes-oriented world? Well, the first thing that you need to know is that the ability to shape traffic inside of the cluster and externally to your consumers is like pent-ultimate. And you have to be able to provide quality of service in doing that. In other words, I think this service needs more oomph than that one. You can actually establish that rule base that allows you to do that within any service mesh that you acquire. And once again, that circuit breaker feature that I described earlier, that the application developer just needs to tell it, when that one dies, what do I do? Auto scale. Man, that's the way out of this. So those are a couple of things that are key benefits. And now I want to shift from description of what a service mesh is to describing the way that they have been architected up until now, okay? We started off with libraries. They were sort of like the old C-Lib dynamic shared link libraries that you would literally have to implement inside of your container. Each container had to have a copy of that library available to it, which was fine. It was able to then segregate out service from containers X, from service from containers Y, but you were never quite sure how many people had actually taken advantage of putting the library into the container. You'd have to do something from the outside to tell you that, enforce everybody to do it. And if I needed to make a change, I needed to add something to the library. I have a bazillion copies out there in thousands of containers that I'm never sure have one or not. So node agent was the next iteration of things. And that actually took the idea that you could have all of the proxied resources going through a central point on each one of your worker nodes. So every worker node has a node agent on it. It's talking to the control plane sitting at your control service. And that's kind of a lovely world in which to be able to communicate and control things at the agent level. Finally, and this is in the last year or so, they've figured out that it was probably better to spawn a new container that's adjacent to an equivalent to the initiating container. And the only job that container has is to act as the proxy to make sure that all of the traffic that's going to the container only happens between the container and the adjacent container. And then it goes out to the services that are associated to it, the load balancers, the encrypted, you know, all of those kind of things. So those are now the three types that are out there. Oh, I was going to tell you a story. Now I don't know about any of you, but I've gotten to the point now where I worked from my home. And there are advantages and disadvantages to that. One of the advantages of it is I have a four-year-old grandson that I get a chance to help sit during the day because when his mom's at work. And he's a very inquisitive kid, and he would come to me all the time and say, Papa, what's he doing? Because I always ask him what he's doing. And I would tell him, you know, I'm doing this or that. And one day, I was working on these service mesh frameworks, putting together this presentation as a matter of fact. And I told him, I made a smash. And his eyes got about this wide, wow. You know, I'm showing you the smash on the computer. He goes running out to my wife, his grandma, me, me, me, I can hear her run down the hall. And a little while later, my wife comes back in with a broom and a dustpan. She says, where is it? And I said, no, I'm making a service mesh, a smash. She just shook her head and walked out. This is my life at home with my wife, NUK. That's where these smash arc types of service mesh architectures comes from, OK. So let's go through the library. The library is actually the most simple implementation of it. It gets injected into each one of the containers. And all of the communications go from library to library, not from container to container anymore. And they were pretty successful in that. It was the original to be adopted. It had some pain points, most difficulty in terms of updating or upgrading that library because you had to know who all used them and everything. The version control was a little dicey. And if there was conflicting demand for performance, you'd have to hunt around using other techniques because it didn't include things that were allowing you to monitor the shaping of the traffic. And some of the guys who were involved in it were Twitter, FNECL, Netflix, and Ribbon, Eastericks, and Ribbon. OK. Then there was the node architecture type that we talked about. And here, all of the node traffic for all of the application services, even the ones that occur locally on that worker node, all happened through the agent itself. And as a result, you've got a consolidated place that gets transmitted up to a control plane service for monitoring and all of that. An advancement, definitely, from the library methodology because there were fewer of them, less resource that was involved in supporting it and all of that. But the reality is that there were some downsides, too, and that had to do with the idea that each one of the node agents was a single point of failure on the worker node. So if it goes away, all of the traffic from that node stops. And you have to fix that agent to make it happen again. And I think they've improved in some ways because they've set up virtualized IPs over the top of the node agent and tried to make it fail over and everything. But then you're actually doing two times the resource to fix it. So anyway, that's the end. Now, next generation, the sidecar technology. And this is my personal favorite. I like working with this one. And the idea is that each one of the sidecars talk to each other, both internal to the worker node and external to the other nodes in the cluster. In that way, if that sidecar fails, it only affects the one container. And typically, that gets caught in the circuit breaker. The circuit breaker says, start up a new one, create a new sidecar. I'm off and running. So there's a bunch of benefits to this particular methodology. But it's not perfect, and we'll go into how it's not perfect in a little bit. Because the sidecar is so closely referenced to the originating container, there is very minimal chance of it being accessed or attacked from the outside. They have to attack it through the gateway that's associated to all of the traffic that's associated to it. And that gateway is actually some kind of load balanced methodology that we'll talk about in depth in a second. And those examples of those are like Istio and Aspen Mesh for those who follow that kind of thing. Okay, so now we're going to do some comparisons of those three architectures. And I affectionately call that Schmesch Comp. There's four and four here that I picked out to give you some idea of comparison. On the left-hand side are those that are simply supplied by the open source community. In other words, there's no corporate entity that you license things and all of that. On the right-hand side are licensed versions of those products. They typically call them community and enterprise, just like everybody else with a community version of an operating system or those kind of things. And it's sort of stringent in that way. The only one that sort of steps out of that boundary, AWS App Mesh, is really intended to lock people into AWS. I'll go into that when I get there. Okay, open source types. And let's compare those or give you some idea of what each one of those do. Okay, so this is Envoy, the typical proxy. Envoy was one of the first out there. It's lightning fast. It's written in C++, was originally developed by the people from Lyft. And it has very few other aspects of it other than proxying, other than load balancing. But it's really, really lightning fast. And it's got these three parts. The main is a single thread. That's a little problematic if you have one that thread dies. But the workers themselves and the threads, it's multi-threaded in that way. And then a file flusher to say, I don't need those anymore. Close them so I have more resources. Istio, which is the one that I think is becoming one of the more prominent players in the service mesh market, actually took the Envoy methodology and put it on steroids. So for each container that's created, creates a sidecar, that's the Envoy initiator that talks to the mixer that's actually doing the governance as to flow of traffic and all of those kind of things. It also takes care of authentication for you so that no request is ever going to get to your container without going through it first. And it's really lightweight from a data plane standpoint. So, and I'll give you some performance statistics later to kind of prove that. Okay, and then we've got Linker. Linker is kind of a weird one. Linker D has had two iterations already in its life. Company called Boyant started it off and it was written in Java, which I feel really odd about. I worked for Sun for like 15 years when James Gosling was writing Oak. So, you know, and it was never good then. God knows why they'd still be writing in it now. But it was very heavyweight and it took a significant amount of resources. But it did the job in terms of being an agent. And the beauty of it was that it was both the manager and the client. So I could issue administrative commands on any node and it would get passed on to whichever node client needed it. And all things work that way. The 2.0 is the sidecar version and this is why they did it. They rewrote the whole thing in go, I think, go, yeah. And they broke apart the data plane and control plane. And it functions similar to Istio with one major exception. Where Envoy is used inside of Istio as the proxy service. This one actually has a dedicated proxy that's just written to do linker D traffic as no other purpose in life. Okay, let's take a look at some of the commercial offerings. And so we'll have a basis for comparison here in a minute. And of course it's comms smash. Consul, which is one of the originators of this, was originally developed by Hashi Corp. Around the same time that they developed Vault as the key service. And the whole idea of it was to be able to take any kind of pipe. You didn't have to build your intelligence into your physical network. So, sorry, Cisco ACI, sorry, Arista, sorry, you know, you don't have to have any of that stuff laying around. You can do most of what you need to do from shaping traffic to setting encryption policies to doing circuit breaker patterning and all of that through console. You don't need to build it into your network. Aspen Mesh was really another Istio lookalike. It's built on the same components, takes advantage of Envoy and all of that kind of stuff. But whereas Istio stops with the layer below there, the control plane API, they built one on top of it that has an additional API layer to abstract it even further so that you can manage it. And it's fairly agnostic in terms of operating system and all of those kind of things so that it fits on more places and can be propagated pretty quickly. These guys, Kong, I just sat through a presentation from them. And they've written this from the ground up. None of it is, you know, based on other open source tools, each one of them that's proprietary to them. But it's really good. It's really sharp. You may end up paying a lot for it because it's a licensed service. They do have a community version, but the enterprise version is much more robust and capable. It has its own type of Ingress Controller so that you don't need to use like a GKE Ingress Controller or something that's embedded in the Kubernetes world, Kube proxy and all that kind of stuff. And it has plug-in capabilities so you're literally able to build in an NGINX load balancer or an HA proxy load balancer as opposed to the Envoy environment that most people might want to use. And when you do that, it really is native to the product. In other words, it becomes a part of it. It's not like you've glommed something onto it. And they use Lua to really make that work as the language. So it's pretty easy to do. It's almost as easy as Go is to work in. And it has control plane, data plane mode, just like most of the others that work. And it can be deployed both in a sidecar or node agent. But the main way that they provide it is through serverless technology. This is a new thing. It's coming. Folks who don't use serverless technology start looking at it because it will be here faster than we can keep up with it. And what that allows you to do is not have to dedicate a server to host this. It shows up when it's needed and goes away when it's not required. Kind of a neat thing. OK. AWS, and the reason that I threw this in is because I've got to tell you, I don't know how these people are getting away with this. OK. We're using Envoy. That's the foundation for it. It's like every other Envoy out there, implementation. But oh, by the way, you have to use Farragut while you do it. Or you have to use ECS. We won't let you use it outside of that environment. So it locks you into working with AWS, which is their goal in life, is to make everything live there. OK. We're almost there. Let's do a comparison list of some things, a comps smesh. All right. So here's a little foundational information for all of these that we've looked at. Envoy, very low latency, very high traffic handling on GRPC, all HTTP oriented. Very few of these folks do any UDP or any of those kind of things. It's almost all TCP IP oriented traffic over HTTP. That's just the way they're built in general. It's API driven, so you can get in and out of it and add it without having to go into something special. You can integrate it into your own environments. And it gathers a ton of metrics. And I'll go into how people visualize that in a minute. Istio, on the other hand, was more focused on security and how to manage and maintain a secured traffic and to shape the traffic from one place to another, both between the containers and to the outside world. The ability to do fault injection, which is the circuit breaker technology to say something is broken and now I need to fix it and you tell me how and I'll go do it. Once again, HTTP and GRPC. But it additionally has this ability to do quotaing and all of the rating of the traffic itself, which makes it a much more robust methodology than some of the others because it has the ability to do some sophisticated plans around it. And it runs on a lot of stuff. LinkerD and LinkerD 2, in the first iteration, Java ran pretty good. It takes a lot of resource and everything else. LinkerD 2 is actually a better methodology in terms of being written in Go, thinner resource demand, all that kind of stuff. And finally, the idea that you could write another one in Rust and have it hook up into a Go-oriented thing makes it easy to deal with Rust as a similar language to Go. It's a lot faster than the old LinkerD 2. It's a lot faster than LinkerD 1. And it's all open source. But if you need service on a node-by-node basis, you can go to Boyant and get it, which is what they hope you will do because Boyant funded the whole project. If you compare those two side-by-side, you'd find out that LinkerD has a lot more features pre-built now. They just rewrote the Rust and Go versions, and so LinkerD 2 doesn't have as many of the things available, and I've listed some of them here so that you could get a handle on what's different. Okay, some of the service mesh comparisons for the commercial side console has a single binary. You can do commands for the administration of it from any node that's hosting it because it has both the administration control plane layer and data plane layer mixed in a single binary. Aspen Mesh provides an intuitive UI, lets you drill down, do all kinds of, you know, hunting and stuff like that, and I'm wondering how much time I have left. 343, okay, I'm going to get there real quick. Kong, well engineered product, has its own product development kit, plug-in development kit that allows you to plug in different types of things like load balancers and stuff like that, written in luau, pretty cool. I've actually did the Cassandra version of this, and I think it's superior to the Postgres version of it, both in terms of deployment and in terms of performance, so I would recommend that. AWS App Smash, I wouldn't recommend this in any way, shape, or form, except if you're only using AWS for things, and who does that? I hope nobody. Okay, visualization. They all use Grafana, and underneath the covers they've got Kiala and for trending and Gager for these open source projects that allow you to capture who's doing, where the ingress and egress is going within the environment, and all of that stuff is being displayed in Grafana, so you can pick out things and see how you can improve the performance by changing how it's configured and all of those kind of things. Underlying that, they take advantage of Prometheus and Trend Analysis, which is a good data management source. Okay, comparing service mesh performance. Okay, if you're going to compare these yourselves, take a look at these factors. In the control plane, the rates for both the deployment and configuration changes that you're going to make, the number of proxies that it's going to support, because that's going to indicate to you that you don't expect the control plane to perform, and then from the data plane standpoint, request size and response size is a big one, and the CPU cores, how many CPU cores are taken up just doing other things so that it might be resource constrained. And to do that, you want to use benchmarking tools that make sense. I've listed three of them down here, Fort IO, Blue Perf, and the last one, those two were done in May. This one I just added recently because I tested it and it works, and I think it's a good tool from Kinvolk. Okay, load balancing throughput and latency. Those were the five that I took a look at. Here's what I found. Envoys the winner. It's way faster than the others in terms of lower latency and request per second. And it lasts until the 99th percentile way better than some of the others. But to be fair, the gentleman from Linkerdee said, I've got one that shows that ours is faster, so I thought I would provide that to you as well. And I even followed up on that as this was published less than a week ago, and this is the latency stats for Linkerdee 2 versus Istio and both the latency and memory utilization and CPU utilization are all lower, supposedly, on the Linkerdee side. But to be fair, they said, here's the untuned Istio. Oh, well, okay, we tuned it a little bit, so we sort of massaged it. But if you notice, the stats are all the same. So it's sort of like, well, you didn't tune it that well. So I would run the test again using the same tools, because I think they had a bias. Okay, pros and cons. Using a library architecture, you get hit with these particular advantages and disadvantages. From a node agent standpoint, you've got a set of advantages and equal amount of disadvantage on it. Really, from the sidecar architecture, there are far more advantages than there are disadvantages. I won't read through this stuff for you now. I have put these slides out on the internet already. They're sitting underneath the profile for this talk. I just wanted to mention a couple of things in closing. My own concept here, Istio is becoming a member of this lovely body, CNCF, and they've almost got that wrapped up, so they'll be a full member with a technical oversight committee and all of those kind of things. If you work in that kind of environment, please contribute to it and take a look at it and send feedback to those people. Both of the LinkerD and Istio are all the open source and they have open source and commercial implementations, so you're not stuck with just the one type. This is really important. Why is the service mesh important to you as an application developer, to you as an administrator, to you as an operator, to you as an... This is something that needs to be answered, and the only way you can do that is to try it. I suggest that people do why, because a service mesh is going to be a fundamental, a mandatory in the 2020 time frame. Every Kubernetes implementation is going to have one, if not multiples, just to try and figure out which one they want to use. If you're currently using it, you already know what benefits you've got out of it, but quite frankly, continue using it and continue documenting the things that are useful and the things you find that are not useful in a service mesh, you know, performance and capability feature and all of those kind of things, because the technology is improving constantly, which means you have to keep your ear to the ground and listen for things that change in order to take advantage of this. And the final one, if you've not decided and you're working in Kubernetes at this moment, and you're hitting stumbling blocks and you're ticked off as some people get about working with just straight Kubernetes, it's too late to start using this technology to solve those problems, but I'd try it anyway and get into the game and let's see if we can make this work for everybody. Okay, once again, I'm Bruce Basil Matthews. Be Matthews at morantis.com. Last time, I'll have to say the name so that you can know where to get hold of me. Please feel free to write me if you have additional questions that I can't answer now. Does anybody have any questions that they'd like to ask? Yes, sir. I'm sorry? Yeah. Kong, and yeah, and I think it's just more complex, a more complex application service than some of the other stuff, and as a result, only a few companies have dedicated the time to develop the code to support it, but I'll bet you some of those open source guys, if you said, I'd like to put in a thousand coffees of this. Oh, okay. We'll make it for you. Anybody else? Yeah. Yeah, you can see how many implementations rely on it, right? And it's adoption, right? So, yes, it's almost over. I always hate to do that because I don't want to diss anybody. They've all got positives and negatives, and if you develop a good working relationship with a guy like the guys from Kong and the guys from Linker, that they sat down with me for, you know, afternoon so that I would give the proper information out to you folks. So, and you develop that kind of working relationship, maybe they're better than Istio, where you kind of have to figure it out for yourself. But there's Aspen Masch and there's all these others that can help you, right? So, yes, I agree with you. Anybody else? Right, Circuit Breaker. Yes. So, it really is fundamentally the types of rules that surround that. So, the load balancing idea, Cube Proxy, isn't quite as robust as people may need. And they'll typically want to use some other thing. I think NGINEX is the most popular one, but HA Proxy works great and, you know, so that's the primary reason for not wanting to use Cube Proxy. The second one in Circuit Breaker technology is what you have to do after it breaks, defining in your application code where to go, how to reroute, and all of those things, it doesn't have the rule set capability that would make it really meaningful. So, you end up having to write more code to support the version that's native inside of Kubernetes than the versions that are outside of it that say, I know already I got an autoscale, and, you know, so I'm just going to hit the trigger to autoscale and take care of it unless you tell me to do something else, which you can. So, that's those two things. Anybody else? Folks, this has been a real pleasure. I really appreciate your time and attention. I hope that, you know, I've given you enough ammunition to be able to go out there and hit the service mesh trail, and if you do have any questions, please don't hesitate to reach out. It's part of my existence in the open source community to try and help figure out what to do. Okay. All right. Thanks, guys. Thank you.