 All right, hopefully everybody can hear me and I know we're kind of tight between talks at the conference today So everything's just gonna be moving quickly So hi, I'm Louie Ryan. I work for Google I might be working on Istio since the project started and I'm say hi Sriram. Hi. I'm Sriram I'm from IBM and I've been working on Istio I guess with Lewis as the project started and also well before that internally at IBM Okay, so we're gonna we're gonna do a very quick run through of a bunch of stuff about Istio You know that the content is pretty dense. The time frame is short We'll try leave some time for Q&A at the end But if that doesn't work out then you know just confines us will be outside and I'm happy to answer as many questions as you guys have Okay, so you know we're gonna kind of power through this a little bit. So what is the service mesh? You know the mental model that we like to use is that it's a a network for services and not for bytes Right today when you write Applications today. There's a lot of things your applications have to worry about that are Features of the networking layer right you have to worry about you know, how do I get my service onto the network? How do I expose ports? How do I talk to other services on the network? And there's just a lot of networking stuff that you then have to start to take account of when you build up your application, right? maybe you you know, you have RPCs going between two different services in your network or your rest API calls or something like that and The network doesn't help you with problems that you face at layer 7 right if one of those API calls fail Your application has to deal with the problem, right? Maybe has to retry maybe it wants to You know take a cached response and serve that or feed that information back upstream instead of you know Trying to go and do something else in the network. So there's a lot of networking features that you know, you end up putting in your application code and What you'll find you know in you know, larger IT organizations is that application code Doesn't just happen once it happens hundreds of times right you write these same set of features Like retries like circle breaking like exponential back off And you go write them in N languages or you're gonna use and different frameworks to do these types of things And then those behaviors start to work differently and they don't really cooperate with each other And you get this kind of inconsistency or maybe don't you know Your teams are developing these features differently or they can't agree on what those features are supposed to be So there's a lot of these networking features that know being baked into your application code That you know really shouldn't be there right there should be something that's doing this stuff for you And that's what is to is trying to do right it's trying to help you with a number of these key problems So one, you know, I want to be able to know How my applications talk to each other what they're talking about You know are there problems with how they're talking to each other and I want to be able to see that not just you know Localize to one specific application or service But I want to be able to go and have this kind of global view of everything as it talks to each other and be able to work From that global view down to particular troubles points or hotspots and things like that You want resiliency so I told a little bit a little bit about load balancing and you know circuit breakers failover's Retry and there's just a lot of complexity in that stuff that if you end up implementing yourself There will be mistakes. I've certainly implemented resiliency features in client libraries before and regretted Some of the horrible things that I've done and I've seen these things fail at massive scale So, you know be very cognizant of the amount of actual work that it goes into making this stuff fly You want control over your traffic, right? So maybe you have problems like You know I have service and I'm trying to roll out a new version of it and I have to move traffic from the client so that service to the new version But I don't just want to do that, you know all at once That's that's a great way to have outages and so you want to be able to incrementally move traffic and sometimes Traffic means something more than just sockets, right? It means you know, I want to be able to move a very small amount of HEP requests I want to move point one percent of HEP requests This is something we do all the time at Google when we do production rollouts, right? We very carefully siphon off percentages of traffic or subsets of traffic to new versions of services as part of rollout And then you want to be able to do policy enforcement on the traffic, right? You have all these services you need to know who's talking to them You need to know why people are talking to them and you need to be able to prevent people who shouldn't be talking to them from talking to them And this is a very important property. We'll talk a little bit about the policy enforcement bits a little bit later on The other thing you want all this stuff for free, right? I mean you guys this is an open-source conference You guys like free stuff I Don't mean free in the sense that you don't pay any money for it I mean free in the sense that it's not part of your total cost of ownership, right? It's not in your application code. You don't have to maintain it right if If you're writing 10 load balancing libraries in 10 different languages That's actually costs an awful lot of engineering effort to do that, right? And I know because I worked on the GRPC team for several years and we did that and it was a lot of effort and I would say that we're still working on it So these types of things shouldn't be multiplicative development effort costs for you, right? They should be something in the infrastructure. They should be something that you get to use And you should care a lot more about that than you know the dollar price tag You know you caught a tribute to buying a particular feature Okay, and last of all You want to be able to do this without really having to change your application code, right again a total cost of ownership thing The truth is you know having worked on GRPC one of the main pieces of feedback we got GRPC does a lot of this stuff is that that's great But I have 40 applications that were written by contractor teams two years ago Nobody knows how to change the code and so I can't adopt it So we need a solution for the long tail of services and by long tail. I mean almost everything That allows them to get some of these behaviors out of the system Without having to make massive development investments So that's that's really important and that's it was that observation that you know kind of led me and the team to you know Thinking about building something like Istio and then you know I met Shuram and a bunch of his IBM colleagues at cubecon last year and we had to We sat down over a couple of years and started brainstorming some ideas and that's how the project got born And it was those last two things the the zero code change the the low cost the low TCO for people Who have large suites of applications and services? You know those are the things that really motivated us to start the project Okay, so we talked a little bit about you know a network for services and not for bytes I'm actually gonna hand over to Shuram at this point And he's gonna kind of go through some of the implementation detail of how Istio makes this magic happen And then I'll come back in at the end and talk a little bit about security and policy and some of the road map stuff Thanks Lewis So under the hood what we do within like let's say you deploy the applications on Kubernetes and So the way we actually like create a service mesh is through like sidecars It's not rocket science at the end of the day So the idea here is that we Implement all the functionalities that almost every language or every library slash application needs like your resiliency traffic management tracing metrics Policy enforcement, etc. And we put them inside a sidecar library that is like well optimized and well engineered And these sidecars are going to be present in each and every pod in your in your Kubernetes deployment And these sidecars would take care of like you know doing everything that you want from like empty less communication Doing HTTP to the proper way and like taking care of all the gipc quirks and so on and so forth so We like if you imagine that like you know you take take your entire Kubernetes deployment and you like Sprinkle sidecars around each and every pod and you add a sidecar at the ingress and a Sidecar or a proxy to be precise at the ingress that will take care of like traffic routing as it enters a mesh Then you now have this nice property where all of these sidecars together form a service mesh in some sense I'll air serve in service mesh where you can have full control over how traffic transits from one application to the other Which versions it touches what kind of timeouts and retry policies it has the access control policies security mechanisms and so on and so forth But one of the things that we actually did and it's a conscious decision that we did was to have the sidecar act on both the inbound and the outbound traffic The idea here being that like when you have to have the sidecar on the outbound traffic You get a standard set of features like timeouts retries tracing fault injection and you know And so on whereas when you put the sidecar on the inbound path You can actually impose policy control mechanisms like who can access the service or rate limit mechanisms And so on in addition to being able to do other features like you know terminating request races To like compute the spans and the causality relationships You can do fault injection and you can do a whole bunch of other things as well so with this as the the foundational thing if you If you were to simply look at this as picture as it is You will quickly like you know find that this is a nightmare to manage and an armada of sidecars in your mesh And so naturally what flows out of this is a control plane Once again, it's like it's extremely hard rocket science, but it's just It's a bad joke With an R control plane what we have is like a standard set of like three components that will manage entire steel mesh The like the first one is the pilot which is responsible for programming all of these sidecars Like whatever configuration they want in terms of like routing to different services the load balancing configuration timeouts They try and you know the traffic routing configuration so on then we have this the mixer component Which acts as a policy enforcement agent and then we have the these two security component Which will set up and manage Taylor certificate and Taylor sort The Istio Certificates across services and it will take care of certificate rotation provisioning and identity management and so on and so forth and One of the other things that we actually focused on was as Lewis said earlier. We wanted to have zero code change So we transparently intercept all traffic that enters the mesh or enters a pod and Pass it through the sidecar and like if there are any transformations to be made or anything's Policies to be imposed that happens within sidecar and then it traffic enters the application But you have an application writer will have absolutely no idea of the fact that the sidecar does exist You do not have to be cognizant of the fact that you have a talk to a car. It happens transparently under the hood You can choose to talk to sidecar directly if you want to but for all intents and purposes This is completely decoupled from your development processes You can just like have a bunch of scripts that inject the sidecar as part of your Kubernetes deployment scripts as part of your CICD pipeline and Your application would work exactly the same way with and without these systems So and one other thing that I want to note is that if you look at this picture If you follow the blue arrows It would look as though each and every request hits the Istio mixer component for the policy checks So that is not really a scalable model and but and that's not the message that we're trying to convey here This is a isn't the logical data path So what happens here is that after you evaluate the policy once it gets cashed And there's like a sophisticated caching mechanism happens within the proxies Such that for the next I don't know few hours or however it is you configure it You are not going to be touching the mixer at all So for all intents and purposes There's actually no mixer later part or it's sort of an in-place component that happens once in every thousand requests that matter so with this We chose an way as the our preferred sidecar Mostly because like you know, it's well written it's written has very low overhead and it's the native language which means the performance gaps is like you know with all the garbage collectors and so on does not exist and And you know, they it has been battle-tested and run at large scale The folks had left run it on like across 20,000 VMs handling about five million service requests per second within the service mesh in addition to that some of the desirable properties that we expected within a large mesh Like existed in on why out of the box the ability to like load up pieces of configuration without having to do a hot restart Which is a nightmare if you were to do it in a production thing and the ability to like you know default injection traffic routing as In a native fashion like as a core feature within envoy and not have to do like Map cookies and map parts of requests and do like you know all sorts of circus within the sidecar So with all of this that plus, you know, these guys in Seattle are just nice So happens to be a very nice team that we work with and so it's a great community to work with That said, you're not tied to use of envoy within these to your mesh If you have your own bespoke sidecars within your organization because you've already spent time and effort Incorporating your internal protocols in that you're welcome to like just swap on why and use that sidecar as it is And there is nothing that prevents you from doing so and there have been folks that have done that For certain internal deployments or a spoof of concepts as well So what do you get out of the box from is to you? The first and foremost as Lewis talked about is visibility observability and metrics You don't have to rely on each and every team to emit and now like one or more metrics or like, you know You don't have to worry about the fact that they Had a typo in all the metrics names and suddenly a graph on our dashboards are all screwed up because somebody missed a few metrics Each and every envoy sidecar in the system will emit a consistent set of metrics along with some of these to your later metrics that Every service in the system will actually get out of the box So this reduces this has a tremendous impact in terms of the ops overhead You don't have to worry about like, you know, is this service like doing like working properly or not working properly? No Each and every service is emitting the same stuff like heartbeat metrics that you would want to look for Like response time the latencies the error codes histograms for like certain specific properties Tail latencies and so on and so forth. That just makes life so much more easier in addition to metrics We also have support for request tracing This is essentially useful if you want to narrow down, you know, find out who's causing the most like highest the Whose share of latency is highest in a request? So when you tell agencies are like, you know exceeding your SLO thresholds Then you'd really want to drill down on those specific requests And you want to actually start seeing where is this latency originating from is it from the network? Or is it from the caller or the, you know, the receiver as it is and this is where tracing actually comes in helpful So I'm a has support for like open tracing by default Which means you can actually use your favorite tracing provider Exit can Yeager light step and so on and so forth. We would configure tracing across the services and as it is application writer the only responsibility you have is To propagate the trace headers as it enters the application as it leaves the application Everything else will be taken care of by an worse. In addition, you could also decide to vertically trace the application Let's say you have a big Java application and you want to trace across different modules within the application You can definitely do those and you can emit the same sort of traces and the same sort of metrics that is to emits to the mixer and all of these metrics are totally mixer and at the mixer you can write adapters that would Take these metrics and traces and send them to any internal like metrics in my monitoring system that you want Or you can just send them to any of the cloud provider systems This once again has a big benefit because you're no longer required to add Agents that are specifically each in every monitoring system since these metrics are all normalized and flow into the mixer as in This common format you can write different plugins at the mixer to automatically swap different monitoring systems of your choice the second feature that you actually get is Traffic splitting traffic steering and so on this is as Lewis mentioned if you're going to be like you're deploying multiple versions of a service You don't want to have a different service for each and every version. You're probably going to be labeling the pods Based on different versions and you would want to be able to like you know Like split traffic based on like various parameters various header parameters or just like you know Version splits by like five percent ninety five percent and so on and you can do this in a very nice way In Istio so this is an example of a routing rule that you just that you would write and you don't have to do any sort of the deployment the rolling deployments or anything about sort within Kubernetes and The key principle or the key takeaway here is that the traffic control here is explicitly decoupled from the infrastructure scaling so unlike Kubernetes and mesos where The amount of traffic that a pod receives is a proportional to the number of pods of that particular type You don't have that requirement here You could have two pods and split traffic 99% to 1% without having to worry about like, you know The rolling deployments as it is the benefit of this model is that you can now apply Autoscalers on each and every version and you can dynamically scale these like set of pods based on the actual load that they get So like in other words not all of us receive like five million requests per second at peak load day in day out so in other words not all of us have like a million pods running at the same time we have like 10 pods 20 Pots 100 pods and being able to do these kind of traffic splits is intensely helpful in such scenarios so The other thing that we kept being asked about commonly is how do I like expand the service mesh? I have my own like service register system that I built with zookeeper My whole infrastructure is running on like, you know, like bare metal or it's running on like cloud foundry and console and so on We spend some time and effort trying to add integrations with other platforms Excuse me like console with nomad or you can as a first-class abstraction within Istio So you can actually deploy Istio without kubernetes or other platforms if you have for migrating to kubernetes it actually helps your whole system and In addition, if you let's say you have your you know deployment with Most of your deployment is in VMs and yours slowly exploding kubernetes And you just wanted to shift traffic into this then we've actually added support to integrate like VMs Outside kubernetes into the kubernetes system itself such that you see the entire thing as one single mesh without having to like You know go through a whole bunch of hoops to get this whole integration going And if you have your own service registries internally and in-house It's very simple to be able to write a simple Adapter within pilot that will allow you to like map your service registry into our internal model And that's it once you do that everything else within Istio flows. Yes Anything else everything? Yes, so the traffic this is independent of what traffic management system it is. It's just for the service registries Yes, you can Yeah, yeah, and now we'll handle to Lewis to talk about the security like some system within Istio Okay, so when you know you we talk a lot about security at Google There's a lot of things you want to make do Implement to make sure that your systems are secure right and there's a lot of different threats that you have against your network I mean You know the bigger the company you are perhaps the more you have to worry about these threats These are things that we worry a lot about at Google Probably the first thing it's that big bold bullet item big bold piece of text at the bottom Google treats its internal network says insecure It's not that they are insecure It's just that this is the mental model that we have when we come to security We want defense in depth We don't just trust the network blindly to protect us from all of these horrible things that could happen And so if you don't trust the network, you know What do you want to do to make sure that your services are secure? Right, so you want to be able to strongly assert an identity and associate it with a workload You want to know the provenance of that workload as I got scheduled onto your production system Right. What was the chain of trust between developer writing code? Bill's getting created images getting put on a repository and those images being started open Kubernetes And how are all those identities tied together and then one of those services start talking to each other How do they know who they are? These are all the types of you know hard security problems that you want to Address if you want to have these types of defense in depth postures And there's a bunch of stuff that you guys probably want out of the system and so I certainly know that Google does right we want the notions of workload mobility and we want the workload mobility to work with security, right? network perimeter security Is working against workload mobility in some sense, right? It's it's if I want to make dynamic choices about where I put my workload if I'm constrained by the network Right or network security models from doing that. Then I have some tension I want to be able to do remote admin and deployment right if I have to be on the physical network to do administration of Services right that's going to make your ops team not pretty responsive, right? Everybody you know gets to go home for the evening gets to work from the beach. Hopefully And so you can't always be on the network per se, right? So how do you get on the network? You also want a security model that works Across vendors and third parties right and you want to be able to share your notions of identity with them So that your services can to interoperate with their services These are important things and you'd like to be able to do this again for not very much cost, right? so you like this to work with your existing infrastructure and You know that means working with your existing identity models or you know in the case of Kubernetes working the Kubernetes identity model If that's what you're using for deployment You want to rely on kind of commodity technologies to plug all this stuff together So, you know, there's a kind of host of issues in this space that we want to solve Anna Istio is trying to Put together a system that helps you Get a more flexible network model While also having a stronger notion of security. So, how do we do this? So Istio uses mutual TLS right for service to service communication and to do that We have to provision an identity to the workload that's represented at an X 509 certificate And we do that using the spissy the spiffy specification. It's just kind of a mouthful But it's this emerging specification about how workloads should receive identities and represent those identities So they can communicate with each other in a secure way And so we have the Istio certificate authority which issues certificates to Kubernetes pods and rotates them And the reason why we rotate them and we rotate them quite quickly is you know One of the potential attacks is those certificates could get stolen or the workload could get compromised and so If you rotate those certificates quickly if one of your security responses is I'm no longer going to issue certificates to that workload Then that will prevent that workload from being able to talk to other services on the network very quickly Right, and now you've limited the blast radius of your security exploit in time, right? When you talk think about security Right, and you think about attacks against systems Even if you have penetration attacks you want to be able to respond both in space and time And this helps you respond in time the fact that we're strongly asserting identities between services And we have ACL policies that says service a a strong notion of what service a is can or cannot talk to service B allows you to control The system in space and so you get those two properties out of this model So that's Istio security. There's a whole bunch of detail here. There's a bunch of Istio folks Who will be more than happy to kind of go into the details we should probably keep moving on because it's gonna be tight for a time I think and Hopefully the network's now working You want me to talk about caching now I Don't know what's going on with the Wi-Fi. That's great Okay, so given that we have no slides and you know yesterday we had some logistical issues too That's kind of funny So we have the Istio security model And so you know should we run talked a little bit about this this the service mesh And we talked about policy All right, and so, you know, you want to be able to enforce ACLs or constraints on how services talk to each other Right, so based on these strong notions of identity, you know when service a call service B You know the policy may not be as simple as do I give access or not? Right, you might want to enforce a constraint on how many resources it consumes You might want to use quotas you might want simple rate limiting quotas Will you say service a can't call service be more than a thousand times a second or you might want something a bit More refined than that says well service a is allowed to call service be between the hours of 8 a.m. And 5 p.m. In eastern standard time And it can't use more than a thousand calls a day Right, so we have a component which you know you saw in the earlier Diagrams called mixer and mixer is really the integration point in Istio that allows you to customize policy and so You know we ship or will ship integrations with common policy engines out of the box And we know we build in some of our own policy things too for you to use But the real value and mixer is that it's this convenient integration point to be able to inject policy into the mesh That's who Istio will help you get all this stuff running and you know bring this topology so that you can then Start to enforce policies in a consistent way Right, you know that if you roll a policy out into makes her every single thing in the mesh name the mesh will adhere to that policy And so it's it's designed to help you get control over your services Right, whether you know you have lots of bespoke stuff that you need to write codes to define those policies Or you can use out-of-the-box systems, you know, whether it's active directory or some other actual management system then The mixers the means by which you give up a behavior and we're still Yeah We're still blue screening or not. Well, it's as the beach bowl is going on over here So let me talk a little bit about the roadmap Or are we back? Yeah, so one of the key things for the mesh office to Expand the mesh to as many places as possible, right? We started out with Kubernetes because we like Kubernetes a lot. It's a great way to run workloads It's also a very convenient integration point for us because of the properties that has as an orchestration system but Most of you probably have workloads Well, I can say this if you have something on Kubernetes. I guarantee you that you also have something not on Kubernetes And so that may be you know developers doing work on their local laptops It may be your massive investment in a proprietary or private data center It may be a rack sitting in a closet in your office But you probably have VMs you have a whole bunch of different workloads And we just need to bring more and more of that stuff into the mesh And so, you know, I'm talking a bit earlier about how we start expanding the mesh to include service registries Which was a convenient way of being able to get more information about workloads So we obviously going to continue to add support for service registries and integrate with other orchestration frameworks We're also going to provide some tools to help people with migration, right? So you have workloads on VMs and you're trying to move them into Kubernetes Well, you're not going to move all of them at once Because unless you like just turning off all your systems and then turning them all back on some other way That seems kind of crazy. So you'll be in states where you know, some of your services will be in VMs and some will be on Kubernetes and you'll also be in a state where some of a service will be on a VM and Some of it will also be on a pod Right. So we had these two kind of deployment models and if you're going up might you know migrating a large service over you'll almost certainly be in a situation where you're You know, you're starting to run some of it in Kubernetes But you're still keeping other stuff in another environment and all the other services that want to talk to it Shouldn't have to care about where that workload is running, right? We want to it's a basic premise of the service mesh that services when they talk to each other Shouldn't care how the other services being run As part of that that also means we have to solve some network reachability problems Right. So when service wants to pay wants to talk to service be it shouldn't really care about the underlying network But that means that the Istio team has to make sure that you don't have to care about the underlying network And that means being able to span VPNs It means, you know dealing with some nitty-gritty details around IP And we're going to do some things for instance so that you can have an Service mesh span multiple Kubernetes clusters and multiple environments without having to use VPNs We can actually use ingress to span environments And that will should work well with kind of the ingress capacity of large cloud providers as a way to get end-to-end security between Services while still using ingress as the kind of basic concept there And the last thing that I want to talk about Before we finish up is the install problem one of the things I hear is that you know Istio sounds great, but it also sounds like there's a lot of stuff. I just have to onboard Istio is actually a suite of tools And you can actually install pieces of it to meet your requirements independently, you don't have to eat the whole meal at once We probably could do a better job with messaging and tutorials around this So you should expect to see some of that stuff coming over the next couple of quarters But if like as an example, maybe you just want to have you know a better ingress experience Then you could just install pilot and use Istio's ingress controller You don't have to use mixer. You don't even have to use Istio off and you would get a better Ingress experience than you would get out of the box of Kubernetes today And then you could start adding features incrementally to that for policy and other things So the last slide is You know just a kind of statement of what we're hoping to target for our 1.0 release I'm not going to throw dates out there because this is software but I Think now is probably a good time to transition over to Q&A and I think we have no time for that So if you want to talk sure and I will be outside and you can also find people wearing, you know, nice hoodies with boats on them Very notably oriented conference and they'll be more than happy to answer your Istio questions I can see Zach and Sven, Andrew and Kostin Thanks, everyone