 So thanks everyone for coming, I'm Jason McGee. This is Ask Me Anything panel on service meshes and microservices. The way Ask Me Anything is supposed to work is we're not prepared at all. We have no materials. You're supposed to ask us whatever random questions come to your mind. So hopefully that'll work out. If not, I will break the ice. Why don't we start with introductions? So I'll ask the panel to describe what they do and then we'll launch into it. So Lynn, do you want to start? We'll go that way. Sorry. Can you guys hear me okay? Hi everyone. My name is Linsang. I work for IBM. I'm a senior technical staff member and master inventor there. I just recently joined the Eat Still project. Since August this year, I've been speaking about monitoring and metrics with your microservices. I also have a session on reliable deploy your application with Eat Still on Friday. I've got to go to what way I'm... Hi everyone. I'm William Morgan. I'm the CEO of a company called Boyant, which makes a service mesh called Linkerdee and another one we just announced yesterday called Conduit. I'm not good at keeping track of what time it is, but I am very fast when I run. Hey everyone. I'm Smenn. I'm from Google. I'm one of the founders of the Istio project that I've been working on actually mostly API management related stuff for the last decade. My name is Christian Posta. I'm a chief architect of cloud application development at Red Hat. I wrote a book called Microservices for Java Developers and I work with our customers to help them build out competencies in microservices and DevOps. Hey, I'm Matt from Lyft. I work on our networking team where we do all types of load balancing, service discovery, proxies. I'm mostly known for Envoy, the proxy. Very happy to be here. Thank you. And then Jason, obviously, and my role at IBM is I run our containers and microservices work and so our Kubernetes service and the work we're doing around Istio kind of fall into that. So with that, maybe I'll open it up with a single simple basic question which is what is a service mesh since this is kind of a new idea, why don't we start with what you guys think what a service mesh is and then you guys can launch into it. Anyone want to take a pass? This is a fun part of panels. You have to like, you know, volunteer to talk. It's interesting. They picked me on a new... Can you guys hear me? Okay. It's interesting. They picked me as a new person on coming into service mesh. So I thought service mesh is really interesting. I always think service mesh as a free tag. I'm a service shopper. I always like to shop for things for a cheaper price. So with service mesh, it really gives developers the visibility into their microservices. It really gives developers to be able to securely connect to their microservices. It also gives developers our traffic management policy enforcement of their microservices. But what's amazing about that is you actually get it for free without changing your application. So that's what I really love about service mesh. Yeah. I think the answer we usually give is the service mesh is a network for services instead of bytes. That's I think our tagline for some more talks. So it's a way to think about services instead of thinking about bytes and connections and IP addresses, talking to IP addresses. You shouldn't have to think about that. You shouldn't have to think about services, talking to services. And how do you make that all work? And I have a background in integration and messaging, building message brokers and that kind of stuff. And to me, this is sort of a nice evolution of some of the problems we've solved in the past with those types of technologies like ESBs and message brokers and that kind of stuff. And it works awesome in a container environment. It's sort of reimagined in this new container environment and allows us to do really interesting things that we couldn't do before, observability and resiliency baked into the system, not just individual applications and so on. Yeah. I mean, I would say that the biggest thing that we're seeing right now is that most companies, they're essentially moving from architectures where they might use one language. Right? Years ago, people might have been using Java or C++ or something like that. And it's a lot more common now that people are using multiple languages. You'll see companies that have five, seven, 10 different languages. And what you find in that type of architecture is that there's a lot of common functionality that people have to use, right? And it's mostly around networking. It's mostly around load balancing, service discovery, observability. And what we've found over the past 10 or 15 years is that you have to be either a giant company that can possibly pick three to four languages and invest a team of 30 to 40 people to maintain libraries in all those languages, or you can implement something essentially once. And then you can use that across all of your different languages. So from a proxy perspective, the benefit that we get of having one piece of code that is highly performant and can have a huge set of functionality, and we can use it across all of the different languages. It doesn't matter if it's Haskell or Go or something else. It just works anywhere. It's a super powerful paradigm. Yeah, I agree. All right. Oh, William. Sorry. We actually struggled with this term, a fair amount, with Lincardy. Because when we released Lincardy's 0.1 or whatever it was in early 2015, we didn't really know what to call it. And we tried a couple of things. We called it an application router, and then we called it an RPC proxy, because that was kind of what was in our head from the Twitter days. And of course, people were like, well, I already have an HAProxy. And we'd be like, well, no, it's different from that. Different set of features. Well, we're not using RPC. We're using HTTP. And we'd say, no, well, actually, HTTP is like a type of RPC. And by the time you went through this explanation, they were bored and wandered off. And so we kept trying different terms. And eventually, we called it a service mesh, which didn't really have a lot of meaning at the time. But it was at least like a blank space that people didn't immediately associate with something else. And I don't know, there's a service mesh company, I guess, where the CEO is in legal trouble in Australia. And that was the number one SEO competition. But yeah, I think once that term started to gain popularity, it became a lot easier, I think, to describe what exactly this is, because it's kind of a shift. I don't think it's something that's totally, it's not like new functionality per se, but it's a shift in really where that functionality is located. Yeah, great. Talk loudly. So do you guys want to comment on how you see in CNCF things being basically sorted out between the controller where Istio plays and the agents in the data plane? Because I think right now there are two projects listed, some agent in the agent space, some, I think, cover agent and controller. So between Istio, LinkerD, and basically Envoy, and the other possible agents, how do you see things sorted out, in a one-all-it fashion? What do you mean sorted out? Like, I mean, in the end, I think I see value in having a controller space, things like Pilot, Mixer, and the authentication part. Maybe other things will be added. And then agents could be differentiated, could do things better than one than each other. So how do you see things coming together, right? To have a service mesh and options in the controller in the agent. Makes sense? Yeah. All right, well, I guess I'm talking about answer. I guess the first thing I want to stress is that the CNCF is not around to pick winners of which one should be the thing. And I think Istio itself is intending to be nicely built to be pluggable for a lot of these different components. And I think the communities as they go and as CNCF helps foster those communities, they'll ideally work together nicely, we'll see. And then their trajectory after that is up to the communities. So I want to add one thing. So I think as part of the Istio project, one of the things we're trying to do is come up with well-defined APIs for these different interactions between the components. And then you get a lot of this kind of innovation in components. And different people can plug in different components. One of the things, for example, self-serving from our side is we've worked on GRPC and SEO. And we want to make sure that over time, GRPC can plug in directly without having to use the proxy. So you have to have those well-defined abstractions, or you just can't do that. So we want to make sure that that all makes sense. And we're also, obviously, all of this is evolving independently. And everything's changing at the same time. And it's really interesting to watch the control plane APIs come together. The data plane API has actually avoided a lot of the work sort of standardizing those APIs and trying to have a standard API actually. Matt should probably talk about this, because he has also thought so. Yeah, so it's a super interesting topic, right? Because you have the actual proxy. You have this thing that's basically pumping packets and doing L7 proxying and essentially doing all of these things. And then you have some controlling entities. You have the data plane. And then you have the control plane. And I think there's a lot of confusion right now around these two concepts. I think people look at Istio. And I've seen a lot of people talk about, is Istio a direct competitor or something like LinkerD? Or is Istio the same thing as Envoy, right? And they're not. I mean, essentially Istio is a control plane. It's a thing that can take a bunch of proxies and it can build them together into a larger system. And as it's designed, you can take Istio and you can use it with LinkerD. You can use it with Nginx. You can use it with HAProxy. You can use it with Envoy. I do think that there's going to be a lot of competition in this space, in both the data plane side as well as control plane. And ultimately, I do think there are going to be winners and losers. I don't think we are there yet, right? So I think this is a super, super active space. But I think the most important thing to kind of take away is that we are attempting to build these APIs so that both sides are essentially pluggable. And that if you want to take Envoy, we're trying very hard where like for example, today at Lyft, we don't actually use Istio. I mean, because we had Envoy well before there was Istio, right? So like we have a huge Envoy deployment at Lyft that uses our own homegrown control plane, which is honestly pretty janky, but that's just the way things work, right? And ultimately, we will move to a more complex system, but because we have these APIs, we can literally swap in Istio or some product that we buy from some vendor and that will be our control plane and we will use Envoy. Or vice versa, someone that's unhappy with Envoy in Istio, if there's a better proxy out there, they can swap that proxy in. I think Matt had a really good blog post about data plane versus control plane, which if you haven't read, I'd recommend taking a look at. That was actually really helpful for us because for Linkardy, we've kind of always talked about it as kind of a data plane, you know, terminology, but there actually is a control plane called Namardy, which we never really talked about as such, but you know, that's what you use. With Conduit, which we released yesterday, we're very explicit about, okay, these parts of the control plane, these parts of the data plane, and similar to Istio, I think separating those two out gives you, you know, there's different requirements for those two parts of the system and so we were able to make different kind of technical choices in the data plane from what we made in the control plane. Yeah, I just want to add, I agree with many of you guys said. One thing I find is super interesting recently is I actually did an experimental on how the on-way sidecar, as the data plane participated in Istio, talked to the Istio control plane, and it's super interesting to learn in the Istio environment, the on-way sidecar is actually a dummy router without the pilot. So it's funny that we have a pilot agent talking to pilots that actually communicate from the data plane to the control plane, which programmably be able to enable on-way to be smart in the mesh. So that was interesting learning for me. I have a question. Istio docs say that Istio uses an extended or enhanced version of on-way. Can you help me understand what that means, what enhancements and why were the enhancements done? To me, it seems like on-way's been forked. The enhancements have been added. Why not just add these enhancements to on-way directly? I'll take a quick pass at that. So this is not a fork, right? So on-way has a pluggable API for adding filters. We add some filters. So we have a filter for calling into our control plane, and we're actually actively working on trying to unify that and actually standardize that and upstream it. We also have things like a filter for doing a job token decoding. We're gonna try and upstream that. We're sort of experimenting with different filters and then bringing them upstream as they make sense to do so. Yeah, I would add that, right? So there's this filter mechanism, and so folks can basically build their own filters and they can compile on-way to actually use those filters. We are essentially attempting to make it so that people that use on-way never have to change the core on-way code, right? That's a design goal of on-way itself. So when people come to us and say, we would like to do this thing in a filter and it can't be done, we are going to change the core on-way to make it so they can do that thing. So at Lyft, for example, we have private filters that we can't open-source because they do auth and they talk about drivers and cars, and it's just not something that we can open-source. But the way that we compile on-way at Lyft is we literally include the public on-way as a submodule, we compile and link in our filters, and that's it. So it will always be a goal that we kind of have that very, very clean separation. And in fact, what we're going to be doing, hopefully in the 1.6 release cycle, is we're going to reorganize the repo to look a little more like Linux kernel, where basically all of the filters, they're kind of like drivers, so they'll be in their own directory and it'll be easier to kind of mark them as being experimental or production ready and have owners. So like for example, if someone wants to come in and do Kafka, but it's not production ready, we can put it in the experimental section and people can go and actually try it out. Great. Yeah, I just wanted to quickly add, it actually took me by surprise as a new contributor into Istio. The Istio proxy image is actually built in the pilot directory, not in the proxy directory, and that took me a while to figure that out. Essentially, out of Istio, we built two proxy image. One is for the regular proxy, which has Envoy, we consume Envoy, along with the additional feature we added to the Envoy. We also have a proxy debug image that have different utilities we need, so it really help us to debug proxy issues when we have it and we ship it by default in Istio. Okay, other questions? Can you speak about the largest deployments of service measures seen? So if I'm running on hundreds or thousands or tens of thousands of nodes, what challenges do I expect and how do I result? Lots of challenges. So we're not using Istio or Linkerday or any of these other service measures, but Google has been doing this internally for a decade and so hundreds of millions of tasks, basically. Basically, yeah. Yeah, I mean, we run Envoy at this point on somewhere between probably about 20,000 machines and we do about five million requests per second through the whole mesh, so we're probably the largest public service mesh at this point. There's a lot of customers coming up and I'm sure Boyan has some too, and I think that what I find is that people are initially very skeptical of this and they're skeptical really because they say, how can you have this thing that you're putting in the data path, like it must be slow or it must add all this latency. That is so stupid, I would never actually do that. And what we find is that it's true that a proxy like Envoy does add latency. It's gonna add probably around one millisecond per hop depending on what it's configured to do. And it's true that there are some people that really care about microseconds and they're gonna whine about 200 microseconds. What I would quite honestly say is that for most people, they don't give a shit about one millisecond. They won't even notice, it's just gone. And if you look at the benefits that come, just the unification of observability, of the features, it is just so incredible that once developers experience this, they honestly will never go back and they don't care about the latency. Slightly slower, but always up. Well, they don't even know that it's slower. We run into this a fair amount with Linkerdee and actually our answer is that we can actually speed things up. Even though we're adding extra hops at each point by being intelligent about how you're routing the traffic, by load balancing based on observed latencies and all sorts of fancy stuff like that, you can actually improve end-to-end times, even if you're introducing this kind of constant, on a per hop basis. And that's with Linkerdee, which has, I think our advertised performance is like the P99 of five milliseconds or something with Conduit, which the data plane is written in Rust or P99 is less than a millisecond. So it's even harder to make that argument. Here, I'd like to ask a quick question. I realize this is a panel you ask us question, but let me ask you a question. If you could, I'm just curious, if you could raise your hand if you're using any kind of service mesh technology today. Just curious what the, awesome. That's more than I was expecting. How many people are exploring it or plan to explore it? Awesome. What are the rest of you doing here? Get out. All right, question. Let's go to the back. Preference the, so. So the thing which is not very clear to me is, if I have an application which is running in my Kubernetes cluster right now and I'm using some network plugin and I want to move it to a service mesh, do I need to change my container image or just I take my existing image and I replace that plugin with a service mesh? It could be in Y or Linkerdee. And if there are two services which talk on some protocol it or they just use rest HTTP, do they keep using rest HTTP or some mechanism change underneath in the data path? So, yeah, was the question just a clarifying follow-up? Was the question about if you have applications that use certain types of mesh technology like Netflix OSS or something in the image, do you have to change that? Or what was the question about the original? So my application has like multiple services so do I change my image or those images are just, I can replace that plugin with a service mesh? I got you. All right. So we've actually tested like Calico and at least Istio work together just fine so you don't need to change anything there. The model for Istio is actually that you can deploy this without any changes to your application code whatsoever. You do have to redeploy because we have to add the proxies and Kubernetes, for example, cannot dynamically add new containers to your pods and dynamically getting the IP tables hooked up and stuff is pretty difficult. So you do have to do a rolling restart of your jobs to get the proxy in place but you don't have to actually change the code at all and the application still communicates using whatever it used to be communicating. So for example, you can do just plain text rest and that's what it looks like from the application from the client and the server. It looks like plain text rest over the internet or the internet over the network or internet of your multi-cluster. It's all encrypted. It's all mutual TLS. So it's sort of dynamically upgraded into more secure and adding identity and all this stuff and the applications don't need to know about it or care about it or make any changes. Yeah, if you were at the keynote this morning you saw Oliver do a live reload of a deployment and had it to the mesh without the application really knowing anything about it. Yeah, one thing I want to quickly add and it took us by surprise also is we actually don't intend to support application that are using host networking. So inside of it still if your application actually requires host networking we are going to not inject the sidecar for you because we really don't want the impact to the host and destroy your Kubernetes cluster. So definitely watch out for that too. Yeah, so there's one small point actually and we love to sit up here and talk about how these service mesh is the most amazing thing and it's fantastic and it's really awesome. But there is one point about deployment and what you'll realize is that when you want to do tracing you actually have to propagate state and that's a very important point. And what that means is that though the service mesh can do all these things if you want to look at your traces and you want to look at logs and you want to have them actually be joined there's no way to avoid the application actually propagating some small piece of state. Now the code in the application is like 100 lines of code versus 100,000 lines of code so it's very trivial. But I just wanted to point out that when we talk about this magic and injecting with IP tables and a bunch of stuff that's not quite true. And because of that I think when people deploy service mesh since ultimately to get full value you'll have to have some client library that might be very thin. You don't necessarily have to use IP tables. So for example at Lyft we've made a conscious decision that because we make people essentially use this very thin library for networking which we call Envoy client we just have them talk to Envoy on a known port, right? And then we don't need any IP tables magic. And from our perspective from an operations kind of thing that just makes it much, much simpler. But whether you use IP tables or not I just wanted to point out that to get full value you're going to have to have some code in your application that is actually aware. And I think going forward we're gonna see a little bit more of that. I think we're gonna see, I don't know how it's gonna play out but I've talked with customers that feel that they would have some value out of knowing a little bit more context about what the proxy is doing. Why is it failing? Why are we getting circuit break broken and so on? Yeah, I think that we are just scratching the surface now of the communication between the proxy and the actual application. And one of my biggest frustrations right now when I actually look around the industry is it's not that we're gonna get rid of networking libraries, right? Like that's impossible. We still have to have them. We have to propagate state. We have to essentially do all of these things but since we have to have this code we can have proxies actually return actually special status codes around things like we did circuit breaking so we were overloaded. So instead of returning a 503 which means nothing, tell me why, right? Like was there no upstream host? Was it circuit breaking? Was there some other problem? And then we can do much more intelligent things. So I think I'm hoping that actually there will be unification not only of proxy technology but also of the libraries that people use in their applications. So I did a lot of work with OpenShift for a couple of years and kind of one of the pitches there I think is that the SDN that it lays down is very Linux-y, right? So there's really nothing special. There's HAProxy for Ingress and then everything is OpenVSwitch and VXLAN all the way down. That has its benefits, right? Because it's very well documented and there's generally a lot of niche community around it and we don't have to rely on updates to specialized software for that. One of the biggest downsides is that as a metrics geek I can't see much of what's happening in that. Observability is a nightmare in that environment. So this is kind of an elaboration of what you've already started talking about but what are some of the, can you guys elaborate on what observability features come with the various mesh technologies that are there and maybe if there is a paradigm shift kind of on how to think about monitoring these things what might that be? All right, great question. So yeah, yesterday I actually did a talk observability with Istio at the Istio Summit. So there's actually a growing rich ecosystem out there. So out of the box, through the Istio add-on there's premises, there's Grafana. Istio actually provide a native Istio dashboard. So it has all the visibility into your service mesh. You have statistical what's going on globally at the mesh. You have statistic breakdown to each individual of your services and even versions of your services. If you have multiple versions so you know like what's going on, how much time you spend on each of the services or what are the error return code, successful rate. So all that information is available through Istio. The other thing Istio provides is distributed tracing so you can just CUBE CTL install Zipkin YAML file which you get the Zipkin distributed tracing and what's that to do is every single request as a user you visit your application it's all captured in the Zipkin distributed tracing and I was joking yesterday I'm a human I don't want to analyze all these single requests like every single seconds there could be hundreds and hundreds requests to come through my Zipkin and we don't want necessarily analyze that. So what we did at IBM research is a project called the Istio analytic where we actually bring analytic through Zipkin data pump through Zipkin and we analyze aggregate data for you. So I expect to see more of these type of data analytic tooling around distributed tracing come through the market that actually really give people visibility into what's going on within a period of time through the aggregate tracing and how do you actually compel your base tracing and your canary deployment? I think the power of the service mesh is that the visibility that it gives you is kind of the top line service metrics like these are the things that you want to be woken up at 3 a.m. for right it's like success rate and latency and request volume like Kubernetes will give you if you have Heapster enabled will give you like CPU usage or whatever but you know CPU is going up memory is going up I don't know maybe you don't want to wake up but a success rate is going down then yeah you definitely want to wake up and the service mesh like that's a layer of visibility that can give you in most cases without you having to do anything like Matt said with distributed tracing you then have to do some work on the application side. But even that layer of visibility I think can be super powerful like at the I'll refer again to the keynote this morning Oliver kind of skipped over that but there was a command line in there where after injecting the application into the conduit service mesh he just listed per method success rates in the CLI and like all of a sudden you have access to this thing that you've never had access to before and forget about reliability, forget about security, forget about any of that stuff just that layer of visibility is super powerful because you haven't had it before especially in a polyglot environment where okay your only option is to do this the hard way. Yeah, actually I just want to echo that and I was going to say basically the same thing so I'll try to be quick but the thing I wanted to point out is that you don't actually want to be monitoring your infrastructure, right? You want to be monitoring your services and so the service mesh gives you that visibility as he's saying that you just didn't have that before and you did have to try and like use monitoring of infrastructure to think about whether the services are working or not and it's like you're trying to make these leaps and that's just not the right way to do this you should be monitoring the services then you know the service is working or it's not the service is working, I don't care what's going on underneath and I shouldn't have to worry about it. I was just going to add one last thing so Istio layer, especially as an OpenShift user or you've used it in the past that would in my mind that sort of just blends into the platform, right? But the platform should have the ability to do logging, metrics collection and now all of these other capabilities you're able to do distributed tracing and that stuff. At Red Hat we're working on adding Yeager support which is open tracing for Mover support to Istio and so on but I'm really excited that now that we have these control points now that these proxies there and live with our application we're going to continue to be able to build a lot more observability on top of that. Yeah, I mean people look at these systems and they look at all the cool features oh you know it does Redis or it does load balancing and shadowing and like this and that and when I talk to people the honest reality is that it is all about observability like that is the reason that we built Envoy and the other thing that I would add is I think the move towards a more DevOps culture for most companies actually I don't like that word but in fact I hate that word and by DevOps I just mean that I am on call for Envoy, me, right? So it is extremely important to me that I can operate it well like extremely important and what operating it well means is having incredible stat output like incredible logging output like actual tracing that we can actually look into and I 100% agree that the future is actually to take this output and in real time basically do machine learning and have a feedback loop and I mean my personal opinion is that most of the data plane stuff is going to become commodity like it's gonna flush itself out in the next two or three years there's going to be a commodity data plane that everyone's gonna use the actual competition is going to be on the data analytics so like feeding the data in telling you what's going on, what's bad like how do you fix it visualization, right? All of those things so I mean there's like a super long winded answer to your question but this is by far the most important thing like Envoy to me is all about observability everything else is totally secondary. Okay, we have one more over here. So the way I see Envoy can become a distributed message bus and as such is it a competitive technology to Kafka is it complementary to Kafka how do you see that path? Oh, that's a good one. You know, I think right now I think there's two discrete architectures and most companies do both, right? So it's like we have real time systems that are basically synchronous and then when you have asynchronous message passing systems and most architectures that I see at this point are using a combination of both and you pick the one that you need and at Lyft we do both. I mean it's like we have a whole bunch of real time systems and we have a whole bunch of stream processing systems. Whether something like Envoy could be extended to do work with Kafka or like replace Kafka or something like that I would say no, like there's no reason to replace Kafka could Envoy augment Kafka with a built-in L7 filter to do all kinds of cool things I think the answer is yes. So I think we will see more convergence of service mesh and message bus systems but I don't think like one is going to take over the other. Yeah, I totally agree with that and there's totally different use cases for messaging but one thing I do want to point out is that because for example control planes like Istio have APIs to be able to plug in different proxies we absolutely could see we have customers who are working with to see how to build messaging capabilities into the infrastructure similarly the way we see service mesh being baked into the infrastructure resiliency, observability, that kind of stuff. We have a project in the Apache Cupid community called Cupid Dispatch Router which is a small lightweight AMQP proxy that knows how to distribute messages across a messaging more AMQP traditional type messaging and plugging that into Istio is something that we're exploring. But yeah, Kafka and Envoy I think those are two kind of different things right now. I just want to real quick add the other really interesting thing coming out that service mesh is going to have to deal with is the serverless and functions and how we, so messages and serverless are like those are the next frontiers. Right, and with that, we're out of time. Thank you guys for a good job. Thank you guys, thank you guys for coming to this panel. Thank you. Thank you. Thanks a lot.