 All right and our last talk of the day in the container room is Jason Jason comes from data dog. He's a technical evangelist and in his free time. He's a travel hacker Jason, let's go it. Thanks Quick question before we get started. How many people are using kubernetes just playing with it? All right, so you've all kind of played with this. How many of you are using it in production? Okay, fair number of people cool So talking about canary deploys. I mean deployments, what could possibly go wrong right like Yeah a lot everything We've all experienced it And so the issue when it comes to deployments We've all seen the most interesting man in the world meme of when I do blank. I do it in production, right fill in the blank Yolo ops But the problem is when we deployed a production we do that in production like as much as we want to dry run things and test it And we try to set up, you know staging environments that exactly match production environments like at the end of the day You still deployed to production and there's no safety net for that So that's why I suspect we're all here and interested in canary deploys So what are ways that we can actually deploy start to get information without just having everything? Go to shit So as mentioned, I'm Jason. I'm a technical evangelist at datadog. I Like to boil that down to saying that I do docs and talks So I help write our technical documentation, which means reading through the code about what datadog does and then writing Useful documentation hopefully useful so that developers engineers anybody that wants to use a product can do so easily And then I come to events and I I talk I'm also as mentioned a travel hacker. I'm currently a nomad. I live nowhere I've been changing where I live like cities around the world every week if you live in some place interesting That you love come talk to me later because I'm interested in exploring that Aside from that, I love whiskey and I'm still addicted to Pokemon go So if you see me wandering around like Pasadena later, I'm not lost That said if you have any questions After this talk, we'll have some Q&A But if you think of something later feel free to hit me up on Twitter and get bisects email is jg at datadoghq.com a little bit about datadog Datadog is a SaaS based monitoring logging tracing platform We essentially try to make it easier for you to gather all the information about your system and have a central place where your developers the rest of Your teams can all get together and understand what's going on in your systems Give you an idea of the scale of what we handle. We do trillions of points of data every day And just in the start of this talk till now. We've probably handled several hundred million data points We are hiring. So that said if you're looking for interesting new opportunities around large-scale systems helping other people monitor their systems We're hiring all over the board and one caveat. So we're not really running Kubernetes in production yet We're moving stuff that direction. So just like many of you Trying things out So how did we used to do deployments or how do most of us do deployments is I think How many people are doing blue-green deployments for stuff? Yeah, wait, that's not very many hands Hmm, how many of you are just What also you guys using what are you doing aside from canary deploys or canary deploys or people everybody else using that? Okay, also very few hands. Are people just not deploying. I Mean that that is the safest way right the most reliable system is just never deploy again Sorry Yeah All right. Yeah, when things go down you just deploy after that and that's when you roll out the new code So blue-green deploys for those who aren't familiar. I think most of you probably are Sometimes are called red black. It doesn't really matter the colors The notion is essentially you've got a load balancer that sits out front and you just flip a switch from version 1 to version 2 or whatever Sometimes this could be in DNS. If you're thinking of code People do this. This is the whole basis of Capestrano, right? Where you have one folder and you just sim link to that and then when you have another folder of new code You just same link to that and make the switch and the nice thing about blue-green deploys And why most of us are using that is that you get these really almost zero downtime deploys Right. You're changing a DNS entry. You're pointing a load balancer somewhere else That doesn't take much time and instantly you're flipping all of your traffic to that and the other reason that we use that is For easy rollbacks, right? If I'm pointing everything to one location and then I switch it and oh my god That that was horrible things blew up. I can just hopefully switch that back We do this because outages cost money We want zero downtime You know, so there's all these statistics about how much money outages cost us 2013 Amazon comm goes down for about 40 minutes People say that they lost about five million dollars a month later Google comm goes down for five minutes and they lose about half a million dollars As mentioned, I'm a travel hacker. So I love the airline industry and the news around that Do people remember the BA outage from last year? So BA had an outage on a Friday. It lasted the entire weekend 800 flights canceled They lost a ton of money from that But one thing that I always like to remind people of when you're having outages is we often think of the time that it takes and The money that we're losing in sales But there are other amplifications right how many people work in highly regulated industries banking things like that Yeah, there are people there are laws That go beyond just that so travel industry one of the fun things is if you operate in the EU You are forced to pay back your customers for things like outages. So BA Estimated that they had to pay out 61 million euros to their customers for this outage on top of that They're a public company. So they're stock dropped 5% which roughly equals about 67 trillion euros But what does this mean to you? Well your last outage according to the Aberdeen research group your last outage probably cost you about $100,000 per hour for that's their average across most businesses But one of the other interesting things is you know, we talk about deployments and we talk about outages But those don't protect us like as good as our deploys are they don't protect us against the bad code that our developers wrote That we're pushing out Slowness cost us money. So if we think about Just the applications You know Walmart comm has correlated that a hundred milliseconds of latency in their application equals about 1% of revenue Mozilla they shave 2.2 seconds off of their their website and that equates to about 10 and a quarter Downloads per year million downloads per year your slowness so interestingly enough If you deploy code if your developers write some code that's slow and you deploy it 28% abandonment rate per second of extra slowness and you compare that to an actual outage outages are actually much better Outages 9% of people will never come back. They'll permanently abandon your product for an outage But if you make it slow if you actually as an ops person did your job and deployed code and the code was bad and Slow that's much worse So when I talk about blue to green deploys like undo right that was the whole point of this was Blue drink green deploys was your flipping switches. You could easily roll back But how many of you like using blue to green deployments or something similar like that had the easy rollback Like easy rollbacks are in quotes I mean occasionally you do but rollbacks aren't nearly as easy As we we would like them to be or as they seem so that brings us to canary deployments So for those who are in here who aren't familiar with canary deployments The notion is simply this if I am going to do something ridiculously stupid up here on stage Would I rather have a room like this filled with people or would I rather just have like one person like just me and Josh Here hanging out So canary deployments really simple you have a load balancer out front you start diverting part of your traffic to another version and If that works out, okay Then you direct a little bit more traffic to that new version and eventually you get to cut over completely So you're dipping your toe in the water doing things slowly the advantages of this are that you have a small scope, right? And that means you have limited ramifications. So if things go horribly wrong Great only a few are a subset of your customers have experienced that There's easier rollbacks. So because you're supporting two versions of the application at once You don't have the problems of oh my god We updated the database and now it's incompatible with the last version of the code so we can't roll back It's a load tolerant So one of the nice things about canary deploys is if you deploy new code and that code happens to use more resources Well, great You've only done it for a small subset You can monitor that and it gives you time to scale up your infrastructure to match if it is acceptable for it to use More resources and the last reason that canary deploys are really interesting is that you can do concurrency So you can try out a bunch of new features and deploy them to different canary groups And try to try to speed things up that way that said concurrency is often not recommended because running multiple experiments simultaneously Can lead to some problems So does all of this have to do with Kubernetes you're wondering well Kubernetes is a container orchestrator, right? Wrong actually Lucy who is up here for last talk asked what Kubernetes Kubernetes was and that was the answer your container orchestrator But that's wrong. It's actually a service orchestrator Right because if I were to imagine I built a car out of Legos and gave it to a kid and now some what this is He'd be like that's a toy car. That's a car. You wouldn't say this is Legos Similarly with Kubernetes. Yes, it uses containers, but the base unit the thing that it actually does is services By the way, I have stickers for squirrel nutties if you feel like stickers afterward. I Did say I was addicted to Pokemon go So the interesting thing right is If we wanted if we built our system out of microservices and we have this network of all these different services Well doing canary deploys or blue-green with that like this just doesn't make sense, right? You don't have it. You don't want to maintain an entire Network of microservices as a secondary and have some load balancer out front redirecting traffic so that's where a Kubernetes comes in right because Kubernetes as a service orchestrator allows You to it handles all of that it handles all of the networking communications between those services and What goes to what and so if you needed to change out or if you want to change out a service? Kubernetes makes that really easy So then why service meshes right Istio is a service mesh So what do what do service meshes like Istio provide us or why do we want them on Kubernetes? And the first is they make service discovery easier They also handle all the routing and load balancing so you can start setting up routing and load balancing on your services individually They handle timeouts and retries Right, so we can we can set arbitrary timeouts per service and how many times we actually want to retry and They can handle policy enforcement. So Your developers don't have to worry about how they're communicating to other services. For example They might be just calling, you know using curl or some sort of HTTP API That may not be encrypted the nice thing about service meshes is they can intercept that and put it over a TLS connection And all four of these things are really nice because it abstracts away the network from the application Your application shouldn't have to care about how they communicate to other services They should just make those calls and be able to get that So that's these are four of the big things that service meshes get us and the fifth is that? You get monitoring and tracing right if all network is now going through your mesh Well, your network can now see what services are calling other services. How long is that taking? So you can get a lot of interesting metrics from that So how does it all work we started with kubernetes you've got all the the network traffic going through it for your services First up is do you add the data plane? So with Istio you sidecar Istio? Which means that now it's going to proxy all of the network communications So they all go through Istio But that's actually a lie. It's not actually Istio. It's envoy So envoy is what Istio uses it packages it up automatically for you But what you are actually are it deploying as sidecars is envoy the nice thing about this though is it means that the whole Istio ecosystem is Modular so if there's another data plane that you would like to deploy you can actually use that After you have that data plane then you add a control plane So this is essentially the guts of Istio. It's where all of the logic happens all the configurations go through this and similarly if you don't like Istio, but you want some other service mesh like Container D or linker D. You can implement those So let's talk a little bit about Istio What is Istio actually comprised of the main part that I think most people are interested in for Istio is the mixer The mixer is at the core and it handles three main things and the first is precondition checking precondition checking essentially being access control so The second part is quota management. So quota management being rate limiting So again when we talked before about things like being able to circuit break or being able to do retries things like that so For example with the quota management You could say that a service particular service can only receive 5,000 requests per second And the final is telemetry reporting. So again adding in metrics traces logs All of that data because it can see what's going on over the network So you have the mixer, which is the core of Istio and then you have adapters. So adapters are how If you want to integrate anything or you want to extend the the capabilities of Istio, that's how you do it So for the precondition checking There are there are a bunch of things but as mentioned, it's mostly access control. So right now in Istio The four main ones that it's got is the denier adapter, which is essentially just the angry boss that says no to everything You've got the list adapter, which allows you to do whitelist black lists You have open policy agent OPA and role-blade based access control So these are all of the precondition adapters that are currently in Istio For quota management, there's only a couple mem quota, which is the default which essentially sets up all of your Your quota rules and tracks it in memory The downside of that is if things go down, it's in memory. So that's gone So red is quota is also out there it uses red as a back end for all of that And then telemetry reporting. So there are a ton of adapters most of the adapters for Istio are around telemetry So it started out with Prometheus and stats D But as we all know there are a ton of monitoring companies out there There are a ton of monitoring companies here in the sponsor hall and over in the dev upstay the sponsor hall So a ton of stuff has gotten added recently dog stats D for a data dog fluent D Serconis Solar winds there's just like a god like ton of different monitoring adapters, so whatever you're using there's probably something there All that said go and check them out They're all within the the mixer slash adapter folder on the Istio project. There's a bunch of stuff there So now that we've covered Istio, what should be our container or a canary strategy, right? How should we think about doing canary deploys? Well random is probably the default and That's the default in Istio if you set up a service with multiple versions and you were to tell it What to you know To just split between those it would just do it randomly essentially But random random's pretty decent if you're working in a service-based Configuration particularly if it's a non customer-facing service so a data dog when we canary deploy We typically start out with five to ten percent of traffic for any service. We'll start Being diverted to the canary and then we'll scale up from there but canary strategies like Random is bad when you're doing customer-facing Canaries right because random is uncontrolled real random is really hard So you get weird samples that you actually can't control so what you really want particularly if you're doing canary deploys For things that are customer-facing is representative sets And this requires knowing your user base you have to understand your mix of users So a few things to consider when you're considering your users is think about geography, right? for example Grubhub does canary deploys they will canary deploy to small cities like Detroit and then scale up to larger cities like New York before rolling things out globally But keep in mind that different geographies have different cultures you have to understand your users and how people in various locations Use your application Similarly think of time right different time zones. It's when you're doing canary deploys You have to keep in mind. How long do you actually want to monitor this canary? And if you do that when when are you deploying it and is it actually a time that people are going to be using your application? And then finally uses patterns right similar to geography or times Understanding that if you canary deploy and you're targeting a subset of users Who all have a proclivity to use a certain feature of your application? How does that actually affect? Your stats or you're you're monitoring about what's going on And the other thing you have to consider is granularity, right? So you have to ensure that you have the right granularity as mentioned when we do services in the back end for data dog We start with five to ten percent and if you're doing essentially random canaries on a percentage that's easy to scale up But if you're starting to do representative sets, how do you scale up? How do you have a? Small size of users and then thinking of growing to a larger size So again thinking of like grub hub when they they deploy to Detroit first small city And then they can deploy to larger cities so thinking about your users how they actually use your application and The final thing that I think a lot of people miss when they're doing canary deploys is thinking about the resource mapping So if you are trying to test a particular feature and you're rolling out a canary to users that Are heavy on that feature How does that map to the resources behind it? What are the the data stores that are running behind it? What are the the the applications that serve that are you going to start hammering those? So think about how The groups that you're you're presenting canaries to will actually use the application and what that means for the resources That are behind it so then monitoring strategies, right? So thinking about now moving from what we're deploying to people to what we need to monitor and The first thing with our monitoring strategies is tags. You absolutely have to be tagging everything Right when we think of a metric we often think of it as a thing a number and a time But if you're canary deploying and you're mixing in this one Experiment with the rest of your stuff. You absolutely have to have a tag to know that that's the canary deployment Other things that you have to do when you're monitoring look at the extremes So looking at your 90th percentile is common I think for a lot of people but when you're thinking of canaries you're deploying to a very small subset So you have to look further at the extremes to start looking at your 95th percentile your 99th percentile And then when you're monitoring think about the outliers Right the things that don't actually match so can compare your canary deployment to prod So I have to throw in more memes but The outliers right the things that don't actually fit they should be Obvious that they're either not performing the way that they should be They're in line with the rest of your services or they're they're actually an outlier The fourth thing to look at when you're when you're monitoring your canaries is anomalies So anomalies I like to say is it wasn't like this before if you've got Tools particularly if you have a monitoring system that uses machine learning. This is a fantastic way or fantastic use for that So taking a look and letting your systems analyze what were things like before what should we expect? How is this behaving differently? And then when it actually comes to the things that you want to watch I like the four golden signals so the four golden signals is how Google does their their monitoring or the things that they look at It's part of the SRE books that they published and the first is latency. So take a look at Changes in latency are things taking longer, right? The next that they look at is errors So not only the error rate how many errors you're getting but take a look when you canary deploy at the types of errors Have the types of errors actually changed? The third of the four golden signals is traffic Has traffic changed and this this can be a big one Particularly if traffic goes down right if traffic goes down, you know people aren't doing the same things that they were So take a look at your canary figure out where traffic ought to be and Judge against that and then the final is saturation. So How are your resources doing are your research is your resource usage the same as it was before is it the same as odds prod Or is your canary doing something different and as said before one of the nice things about canaries is that you can scale up? So if you take a look at it, and you're like, yeah, we're using more resources, but that's expected Then great start scaling up So at this point I was gonna do a live demo And have you all pray to the demo gods? But the demo gods were very unhappy with me But it did bring to mind that Two weeks ago Kelsey high tower it gave this sage advice. It's you as an early project Please don't run out of here and go deploy it in prod because you'll be on the news and not in a good way So it's you as a fantastic project. They're doing a lot of cool things with it. It's changing really really rapidly But that said it's like not even alpha really so Definitely encourage you to play with it, but don't put it in prod. You probably won't like that if you are going to play with it Installing it is really easy you go and download it and then you unpack it and then You add it to your path because you want Istio CTL To autocomplete when you just type like IST and hit tab And then you cube cuddle it And that really is it. That's that's how you install it. It's pretty easy So what happens when you when you cube cuddle apply on the Istio YAML? Well, it sets up a few services and they're all namespaced within Istio and the first is the Istio pilot Istio pilot handles all of the envoy API. So Going back to that data plane again Envoy is what's actually sidecarring all of your all of your pods and so That all your pods are talking through envoy Istio needs to understand on voice so the pilot handles the envoy API and that's how it can get the data The pilot also is the the platform interface. So I know we're talking about Kubernetes But Istio actually has integrations now for nomad Eureka cloud foundry and mesos So Istio pilot is actually what would talk to each of those if you wanted to roll Istio out on for example cloud foundry The next piece is the mixer and that's what we talked about before it handles the core logic bits and Then the final service that it'll implement is Istio ingress and this handles all the rules for inbound traffic Yeah So What do you do next so you install Istio? It's set up all these services Well, then we have to let Istio actually do the sidecarring. So this is traditionally how you do it You run Istio CTL cube inject And it will modify your your YAML files your manifests and then you apply those in Kubernetes So what does cube inject actually do well? It it does a bunch of stuff. It adds a bunch of stuff But the main part is this is that it adds an extra container. So again, it's setting up that envoy proxy and Getting that all running It also adds a bunch of annotations arguments that it needs in order to redirect all the traffic to that Brand new though in Kubernetes 1.9. It's beta is the mutating webhook admission controllers Which has a really awesome name. I think because it's super confusing Anybody familiar with admission controllers one person So admission controllers really in kubernetes they Intercept requests so when kubernetes is going to make an object like a pod it does some Authorization authentication and then it has this period of time that it allows Controllers or things to go and modify what's going to be done before to actually persist that So the cool thing is that in kubernetes 1.9 You have these mutating webhooks and mutating webhook doesn't mean that the webhook mutates like itself It means that the webhook mutates other things. So the nice thing about this is this is Brand new to Istio as well, but it allows Istio to not for for you not to have to manually update your YAML file It can simply have this webhook and so whenever you go to deploy a pod or you make a deployment After it gets Authorized but before it actually gets persisted the webhook can go in there and be like oh, yeah Istio should be on this and so it can automatically modify that and add the sidecar So this generally works based on labels You'll label your default namespace with the Istio injection enabled And then pretty much anything that gets deployed in that default namespace will Get a sidecar so canary deploying with Istio. How would you do it? Well, you'd first set up a standard service Istio doesn't actually touch any of your services. So you would have your basic service as a manifest You know pretty straightforward kubernetes and Then you would make your deployments and so when it comes to canary deploys With that one service you would have two deployments if you're you're moving just from one version to another And the main important parts here are that you want to label that with the version So here version one to version two and then obviously that has a different container as well So we've got version one and version two tags on those And then for Istio you set up a route rule so the route rule again is just gonna tell us Where we want to direct traffic so the main part here that we're looking at is the route And we're specifying that the labels of where we want to send traffic for v1 And how much weight so the weight here being a hundred percent We want to send all of our traffic and then in order to canary deploy if we're doing random We would just add another label and say that now we want to send things all to the v2 And we just change those weights so in this case we're sending now 20% of our traffic to the second version of our application Now I did mention like this is all random. So it's just gonna bounce between those You know at random you can't control what users but thankfully there are a ton of route rules out there So you can also do things based on request headers. So for example, you can set a cookie Determine use some IP lookup service to determine where that user is set a cookie and then route based on that So what else can I do? Well, I can do a ton of stuff. So that's the link to the the Istio documentation Circuit breakers are also part of routing Destination policy load balancing all of that is in there There's that is a really really long page of configuration things that you can do with routing So just as a quick recap when you're doing Or when you're considering this what do service meshes get us? Well, they get us more control and I think that's why they're super popular That's why I think most of you are here is trying to figure out how you can get more control over what's going on in Kubernetes When we think about canary deployments recap is Try not to do random particularly if it's customer facing try to think about your users because at the end of the day That's your high-level work metric that you should be driving on so think about how you can get representative sets based on Your particular users and how they do things and think of them in granular ways. How can you scale up because? Canary deploying works best when you have multiple levels of deploys when you can constantly ramp up more and Then when it comes to monitoring ensure that you've got tags Take a look at outliers and anomalies and Finally have to be watching the golden signals So look at the latency on your canaries the errors not only the rates, but the type of errors that you're getting traffic and saturation and That's about it any questions