 All right. Just going to give another minute for folks to show up. Looks like there's a few stragglers. All right then. I'm going to get started. Thank you all for taking your Sunday morning to join us here at Scale and your contributions to making this conference a possibility for 17 years now. This is my first time speaking at Scale and it's an honor because I've seen so many speakers ahead of me do the same thing. Today it's my opportunity to teach you or share with you what we've learned in implementing GitOps for our customers and practicing or dog-fooding it on our own. So I'm going to show you how you're going to enable your teams to operate more autonomously and efficiently by practicing operations by pull request. So here's what you're getting yourself into today. First I'm going to share what is GitOps and it's not rocket science. I bet many of you already practice some form of this today if you're continuously delivering your infrastructure as code. I'm going to convince you of why it's so awesome that you're going to want to go back tomorrow and talk to your manager to get permission to do it or if you are the manager tell your engineers to go look into this thing that I'm going to share today. I'm going to give you a hint on how to get started with this. Now the presentation today is going to be pretty focused on Terraform. I'm going to give a little quick intro on what Terraform is in a few sentences. So you can take that and go and get started tomorrow. I'm going to follow this up with a live demo because I really believe in transferring practical knowledge to show you what it looks and feels like so you can see that it's truly awesome. You can go to the URL at the top and start submitting questions along the way. I'll get to those at the end and we'll have a Q&A also at the very end so we can ask some questions. I think the whole presentation is not going to be more than 30, 40 minutes so we'll have a lot of time over for questions. So my name is Eric Osterman and I'm the founder of Cloud Posse we're a DevOps professional services company based here in Los Angeles, California. We help companies typically with migrations. They approach us when they want to move from one platform to the other like they're on Heroku today and they want to go over to Amazon and use Kubernetes. Or perhaps they're already on Amazon but for many years and it's time for them to re-platform, improve their automation and that's when they contact us. So the benefit by being a DevOps professional services company with one specialization is that we get to go really deep in one area and get to know it really well. We're not going to build your mobile site. We're not going to help you with any SEO. Occasionally when mom calls we'll help her out but otherwise there's even pushback on that. So DevOps today is an amazing ecosystem. The tools available today are like none other. My company was founded to help bring the knowledge of how to integrate all these tools together to make it easier for you and that's why we have over 130 Terraform modules that are all public and open source. We see over 10,000 unique visitors every single day hitting one of our GitHub repositories and we have over 300 projects in total that we actively maintain. We see over 100,000 forks of all our work out there. So I feel like we're really contributing back and achieving our mission. So this approach to DevOps that we take, we call sweet ops and it's a collaborative process towards DevOps that translates really well across organizations. We live in an awesome time to be doing what we do. I liken it to a renaissance. When I got started working with the public cloud it was back in 2006. Amazon had just announced a public beta or private beta for their cloud, EC2. So our startup it was called Socialverse. We had just gotten off the ground and we decided to give it a shot. We got into the beta and all we had available to us at that time were M1 smalls and a little bit of courage because there were no persistent volumes, no load balancers, no RDS instances, no elastic cache, none of that stuff. We were just excited we could automate like Nginx with a configuration and some scripts to point it to some instances. Anyways, the tools at our disposal today are phenomenal. We got functions as a service, serverless and lambdas, absolutely everything defined as code, software defined networks, container management platforms that blow my mind. This was not even remotely on my radar 13 years ago when I got started on all this. We use CI CD to continuously integrate and deliver our software and we've started taking that to the next level where we practice chat ops, basically interacting with bots that control our infrastructure and our systems. And the last piece of this that I'm showing with you today is GitOps and I'm going to explain why that addresses some of the problems. So for all the amazing things that we do today, I still see a status quo in how some things are done. That is that we have these complicated manual rollouts that are done via the terminal. As a result of that, we lack audit trails. We don't know what happened when that was done because it was done on some guy's workstation. It's not clear if that was deployed. So we lay these Easter eggs all over the place where something goes into the master branch and nobody actually applied it. So the next guy who gets to deploy the infrastructure gets to find hundreds of changes that need to be rolled out. And this leads to configuration drift, which is a really bad practice. We write a lot of documentation at Cloud Posse and I confess that documentation is also out of date and that's why the same thing happens in a lot of organizations. You have these documents that describe how to do your rollouts but they only tell part of the picture and if somebody else is to follow them, they can't do it. So now as a business, you lack continuity and it's a liability because if somebody goes on vacation or moves on and gets hit by a bus, nobody knows how to roll out the changes. Now if you use Terraform, there's some more problems waiting for you. Many of you probably practice CICD today with your web applications and the ideal web application is totally stateless. So you don't care how or what happens when a rollout fails. You have these health checks that detect a failure and then you just go back to the previous release. I only wish we had the same privilege when we deal with infrastructure because when we deal with infrastructure, it's a lot more like doing a database migration and a tool like Terraform, it is a database of the desired state of your infrastructure. So when you do a plan or an apply in Terraform, what we're doing is we're taking the state of your infrastructure from one state to the next. And now I got a pause here because I realized I kind of forgot to introduce what Terraform is for those of you who maybe aren't familiar with it. So these days Terraform is the leading tool for helping companies provision all of their infrastructure as code. Terraform is a simple DSL, domain specific language that makes it easy to describe, hey, I want a load balancer here, I want a couple instances behind that, I need an RDS or a relational database over here. So all of that is defined as code and Terraform is the tool to make that happen. So this tool Terraform that is used by millions of companies lacks some pretty fundamental things you'd kind of want from it. This is not the fault of Terraform, but Terraform won't automatically roll back in the face of errors and that's because that's a difficult thing to solve. You need an operator to do that and Terraform doesn't provide operators. That's where something like Kubernetes comes in. Anyways, the Terraform plan is just a best guess of what's going to happen and if we just rely on that and we merge everything to master and we do our rollout, there's a good chance we're going to get a failure. And if you have a very busy organization where developers are branching off of master all the time, now that little change that you just did that bug you introduced into master proliferates across branches and we have a problem. So my goal is to convince you that we can still do continuous delivery of Terraform or similar apps but the strategy is what we're going to want to tweak a little bit. To make this a little bit more real, I'm just going to bring up a very typical example and I'm guilty of this. Look, this GitHub stuff is a pretty recent phenomenon. As a developer, I'm iterating and working on the code, I'm making it all work and it always works perfectly on my laptop. Scout's Honor. I mean, it's my honor on the line. I don't want to commit broken stuff. It always works on my laptop. Yet the day comes when a developer or another team now takes my change and deploys it to production and we get some random error like this one. And what I'm going to show you today isn't a way to prevent those kind of errors necessarily from happening but to make the process of iterating on these kinds of things easier so that we aren't polluting the master branch and that we can move faster. And the math is pretty simple when we think about it, why this happens. So if we think about the number of tools that we leverage as part of our DevOps tool chain or our workflow, combined with the dependencies of those tools and the configurations and they're all pinned at different versions, across all of our Amazon accounts and in those accounts we have different projects and each one of those projects have their own software development life cycles and then all the different developers working on it who have historically been doing this on their workstations and in our case across all of our different customers, the number of combinations in ways things can go wrong is astronomical. So we want to get away from this and the way we do that is we eliminate this, this is the final frontier that we want to get away from is provisioning infrastructure from our laptops and I'm going to show you one way we can go about this so that we make it really easy for teams to terraform stuff and I truly believe that this can be made so easy in your organization that different teams, different developers, not just backend developers, not just operations engineers can do this stuff but anybody who has the competence to open up a pull request and see how other people have done it and you can do this without giving up security, controls and approvals and I'm going to show you all that too. So we're going to solve this by practicing GitOps and the concept is simple. So we have Git as our system of record since everything is software defined, infrastructure as code, it can live in the same place that all our other code lives in which is something like GitHub. And then to effect change, we open up a pull request and this is the concept of creating a request that contains a log of all the changes that we want to introduce into the master branch and we assign that then to somebody to review it basically to take a second set of eyes to make sure I didn't miss something and then we integrate that with our CI CD pipeline now if you've used Jenkins in the past it's the same concept where we check that code out and we run a plan and we see what's going to happen but we're going to change this a little bit because in the traditional CI CD workflow what we're doing is we apply changes when we merge to master but we don't know if this even works because we can't test it without actually deploying it so we're going to introduce chat ops into this equation and chat ops is basically interacting with the bot or the CI CD system so hey, hey bot, can you go run a plan tell me what's going to happen and we're going to want that to post back to the pull request so we can see exactly what's going to happen how many of you who are responsible for doing infrastructure right now are doing a code review process and actually check out that code that one of the developers did and test it locally that's tremendous discipline I don't have that much discipline I confess that is good but even if we do do that the problem is that my workstation isn't going to be the same as his workstation and we have problems alright, so then we get an approval on that pull request we're going to take the next step and we're actually going to apply those changes and if anything goes wrong we'll see that we have recourse in this situation so let's see here by practicing this what you're going to achieve is basically a repeatable system where we can apply changes the same way every single time and the benefit with this is then if there are problems where that are not caught during the development process we can fix those as part of this pipeline because it's a single source it's a single place where this should happen boy this is kind of there we go so as a result of having it be very repeatable we can achieve it we can make it so it's very predictable and predictable is that we can see what the expected changes are to happen and it's auditable so unlike the developer doing this on his workstation where you have no insight into what actually transpired on the screen this time we see everything in the pull request history so we can go back and look at it because there's sometimes things we miss and we want to go back and see whoa did this actually happen at another time or you lack the context or the conversation around that pull request and that's what we're going to be able to do and lastly we make it accessible since anyone who can open up a pull request can affect the change so there's this small little tool called Atlantis and I'm going to show and share how this solves the problem so Atlantis was built for Terraform but under the hood is just a simple service that runs commands from webhook requests that come in from your github or your bit bucket we use it with Terraform we've used it with cloud formation using the AWS Client now for those of you using Kubernetes we've also used it for Helm and Helm file which is a tool for deploying Helm charts or Helm releases many of you who are doing maybe more advanced Terraform stuff might be using Terragrunt and Terragrunt is a great tool for large Terraform infrastructures or you might not be using github you'll be using GitLab or Bitbucket on-prem or hosted it still works with that Docker, that's great because you can just deploy this as a simple microservice anywhere you can run your Docker containers so this product came out of Hootsuite they use it in development staging and production they use it everywhere for many many years now the core maintainer ended up leaving Hootsuite and started this as an independent organization called Run Atlantis and at HashiConf this year or last year it was just announced that it was kind of picked up now by HashiCorp itself so Luke is now full time working on this project at HashiCorp which is a great sign that HashiCorp is investing in this open source project and the flow is really simple a developer who's going to open up his branch he's going to make all of his changes there he's going to make sure it works and once he feels pretty good about it he's going to open up that pull request and the very act of opening up that pull request is going to kick off a pipeline here and that pipeline is going to use the credentials of the Atlantis server running in your infrastructure since it's running in AWS we can take advantage of instance profiles and roles so we don't hard code any credentials anywhere and then when Atlantis runs the Terraform plan and gets all that output it's going to take that and post it back to your pull request as a comment what I've just described is this concept of an interactive pull request that hasn't really existed before it's like the combination of pull requests with chat ops so if you are not maybe a developer the concept of the Git workflow might not be totally familiar and the workflow is something like this it opens up a feature branch in Git he does all of his work in that branch without destabilizing the master branch when he's ready for others to take a look at it he opens up a pull request that pull request will show the change log of everything there as a diff and you can comment on it so any line you can add a comment like hey did you mean to specify this instance size so they can request changes or they can reject that pull request that's the workflow and this is what it actually looks like and this is with Atlantis so here the first thing is we're in this example we're going to be adding a user to an AWS account using strictly pull requests and this is kind of a powerful thing because imagine you hire a new guy and through the onboarding process the first thing he does is open a pull request against your infrastructure to add himself to it but it goes through all your standard code review process if you're using a code owner's file you can request that certain teams actually are required to approve this pull request before it's applied anyways here in the pull request we see what the expected changes I want to point out this thing with the key base there this is the secret to automatically provisioning users with Terraform on AWS what happens is Amazon generates a random password and you give it a public key that came from Keybase so the password never touches any other person's hands the only person that could have ever decrypted that password is the guy who knows the private key corresponding to that anyways the act of opening that up triggered a plan so Atlantis wakes up sees there's a change and calculates what's going to change and posts that back as a comment so this is actually a real comment from this demo that I'm going to do today using the code review process somebody approves that and because it's approved we can now apply it so we add a comment to the github pull request saying hey Atlantis apply and we see the outcome of that now if there was an error here we can go back and fix that really easily without having destabilized the master branch and I'm going to show you that workflow and then when it's all said and done that's when we merge I think that was a pretty easy workflow and what's pretty amazing about that is the developer can just log into github look at other pull requests and see how things were done so it's a great form of knowledge transfer of how to do operations now you might wonder what kinds of companies using this today and these are some of the ones that have come out publicly via twitter and said how they're using Atlantis this one particular comment by Shopify I think really captures the essence of what makes Atlantis so cool is that it empowers any developer to make that change but we still have all the power to approve or reject that and we don't need to mess around with complicated IAM policies and permissions in fact we can almost eliminate everyone who has access to production and staging because we enforce a pull request workflow Kelsey Hightower who's a phenomenal influencer in our community mostly in Kubernetes but he was at HashiConf and he tweeted out after the first presentation of this that this is super dope and the Kelsey Hightower scale I think that's second from the top to achieve that so to get started is pretty easy what you're going to want to do is use whatever process you have today to deploy a standalone single process service that runs somewhere in Amazon you're going to give it an IAM role and then you're going to write an Atlantis that will manifest which is the pipeline that describes how everything should work so if you're familiar with Circle or Travis or Jenkins all of those have pipelines as code that's what this is but for Atlantis and it looks something like this is the deployment thing so we have an open source terrible module that helps us deploy it with our customers can't promise that it's going to work for your use case or how you have things set up but at least it's a great starting off point to see how we deploy Atlantis we use ECS Fargate because it's really hard to access that container once you deploy it which is a good thing and it has all the bells and whistles of integrating tightly with the Amazon platform for security the Atlantis manifest looks like this we we find here a friendly name for the VPC and the VPC refers to one of that maybe like this improvised so we find this at Atlantis.yaml and in there we specify a directory and that directory corresponds to this pipeline and you can associate a workflow with that pipeline that workflow is fully customizable and you have two steps that you can customize and maybe this isn't so good it fades in and out let's do that there we go alrighty so this manifest file describes how this pipeline should work I'm now down to the section here on workflows each one of those directories can be associated with a workflow and this is what allows us to handle complex rollouts so we don't have to assume the process is always going to be the same I like to be very explicit and that's what we're doing here so we're defining what are all of our steps and the first thing is run so we're going to do a terraform init which is how we attach the remote state in terraform and then we're going to do a terraform plan and we're going to write the output of that plan to a file so that after approval that's exactly what we're going to execute when we run and apply so there's a promise, a contract here that's going to happen so I'm going to now show you what this actually looks like and what this feels like operating it going to go over here to my laptop get this set up, see here I don't have my hands free here anymore so to drive the point home at how console free this process is I'm going to do the whole process without ever leaving my browser I'm not saying that this is what you have to do with development workflow I just want to emphasize the point that we eliminated the console so root.cloudposse.co this is one of our public reference architectures for how we manage our infrastructure this corresponds to our root AWS account the top level AWS account in here we have our configurations for different terraform projects or modules and in here we have our users I'm going to go ahead and create a new file in here call this demo for scale .tf and well okay I'm going to cheat here I'm going to go back and show you how anybody who's good at copying and pasting can do infrastructure so I'm going to copy this go back here create a new file paste that in here demo for scale .tf alright now this user already exists I'm going to change this to a demo user call this demo I'm going to use my key base user name since that's we don't have a sample for that I'm going to change this output to demo and I'm going to introduce a problem here something that might miss code review it's going to be a demon a little bug and let's hear everything else look good yeah so create that we have our pull request here alright so the very act of me opening up this pull request is going to wake up Atlantis as we see it kicked off right here and in a second we're going to see the outcome of this and I expect something to go wrong cool we got a plan error folks and it caught the syntax error in my output here so this is the kind of thing that would often miss be missed in a code review if you're just skimming over what's there and since we haven't merged anything I can just go ahead and go back and fix that mistake that I had so I commit that change that wakes up Atlantis again Atlantis goes and runs another plan here's that change this yellow dot indicates that it's running in the background give this another minute to complete ok there we go so now we see the output and we can see that it's going to want to add that I am user it's going to want to add that user to a group and it's going to create a login profile so they can access the web console now to prove to you that I can't just go ahead and apply this I'm going to say Atlantis apply and it's probably going to reprimand me and say it failed because it must be approved so I'm going to yeah so I guess that was unclear with my example being very demo like so here I'm actually adding a user called demo so the user is called demo with my pgp public key so in an ideal world that wouldn't be my key it would be that user's key and that's the disconnect there so I'm going to go over to my yes man account well I'm going to first request a code review go over here and search for test and our test user so the test user is kind of like the engineering manager or the gatekeeper to making changes here and the gatekeeper is going to look at the most recent plan here and see that everything looks good and assuming it does he's going to go over here and then go ahead and review that code approve it and there we go so now I am over here I've gotten my approval I can go ahead and run Atlantis apply what's happening in the background is it's going to take that plan that was previously generated and execute exactly that plan and here's the output of that so we can see it added that user and here's the demo decrypt command so the user the demo user can go copy and paste that into their terminal window to decrypt and get their amazon login password the whole process was driven by git now to drive home also why this is so awesome this is a very rewarding process so now that user we don't want them anymore in AWS or they've left the company or they're no longer involved in that project well let's go ahead and squash this first since we've made that change we can go delete that branch then so here's that demo that we had for this conference adding the user I can now remove that user simply by reverting that pull request this is a process that we use for traditional software development for infrastructure management so I create that pull request and we go through the same process and I'm not going to show you all that because it's just what you just saw but that's that so it looks like my sacrifices this morning worked demo went flawlessly so here are some of our best practices now for doing Atlantis in practice even if you don't use Atlantis today even if you're using drone or Jenkins or some other system some of these best practices I think you might be able to take advantage of one of the things that we do that's a little bit controversial is we run one Atlantis per AWS account because we like the idea of sharing nothing so it also forces a workflow where an Atlantis or a pull request can only ever modify one environment at a time and that's pretty nice from a blast radius perspective use IAM service accounts so these are like STS tokens the Amazon short-lived tokens and those will give you access to AWS and be automatically rotated on an interval that you define so like every 30 minutes those tokens expire so if you accidentally leak these AWS tokens the fallout is limited use code owners so all your code is in this git repo now you can use a github convention called .github and it's a file and in there you describe the paths and who has to sign off on changes so if you have DBAs or if you have network admins or different teams within your organization you can ensure that one of the subject matter experts will sign off on that change but you enable anyone to make the change so that reduces the bottlenecks in the equation if you're using Terraform use TFRs for all your settings that are not sensitive so don't put your passwords in there but put your instance sizes or types or things like that and many of you probably use Vaulter console today we use parameter store which is just Amazon's managed equivalent type of service not feature wise necessarily but we still get encrypted secrets and we still have the ability to restrict those with IAM and Terraform natively supports it and now here for you if you're using Jenkins or drone for Terraform automation that's totally fine but one of the things that we find really annoying with Terraform is when you see the plan for IAM policy document changes it's going to be one line with 5,000 characters of JSON code that you can't unpack or unfurl visually so scenery is a small little binary you drop in, it's a go binary and it'll unpack those IAM policy documents so you can see exactly what's being changed the other thing is in Terraform it's leaky so the output sometimes might reveal some information you might not like to have in your GitHub comments so to mask that output we have a small utility we call TFMask and you can find that on our GitHub as well so in the end why do you care about all of this stuff and I argue that why it matters so much for a business is what you want to do is enable teamwork and effective collaboration and you want to eliminate the bottlenecks you want to enable more people to be more productive is what allows us to stop living dangerously applying changes on local workstations where there's no record of what happened to practice total transparency in operations enable team collaboration reduce the total number of individuals who have to have access to environments all together which will increase your security posture and ultimately improve your productivity, maintenance and repeatability so all the examples from this talk are available on our GitHub if you go to Cloud Posse we have our Terraform module for deploying Atlantis as an ECS task in your ECS clusters you can find that there you can join our community we're over 600 members strong of active people doing awesome stuff with Terraform, Kubernetes and DevOps automation in general we have a whole bunch of other projects like I said over 300 you can check them out and at this point, that's it folks open them to questions I'm going to check to see if there was anything submitted via the slides here and there were none so let's do this old fashioned way oh it was not accessible fail well Atlantis couldn't help me there I guess so any questions that you have in your mind right now you thank you yes yeah so the question is can we rest assured that this product is going to remain around given that there's Terraform Enterprise and I don't know if you went to the keynote talk yesterday by Mitchell Hashimoto founder of HashiCorp and CTO Terraform has two types of users basically you have the hardcore practitioners, the end users and they're not trying to monetize those and those are very difficult group of users to convert because they typically want free things I think there's always going to be a demographic for whom this appeals Terraform Enterprise is solving this at Enterprise scale with greater control over policies and enforcement than you can get simply by using GitHub and code owners so I think this is well suited for certain kinds of organizations but maybe if you're a bank you're going to want to consider using something like Terraform Enterprise yes yeah that would be totally that would be a acceptable model if you end up using Atlantis they're going to be a few nuances there that are out of scope for me to bring up right now but shoot me an email and I'll let you know what those are totally feasible, would recommend it it just comes down to how you organize your projects to achieve that yes yes oh this is a great point yeah yeah yeah so you have this plan that was generated well a very important thing that I kind of glossed over in this demo and didn't emphasize but it's essential for working with Terraform or these kinds of things and why I don't recommend using Jenkins or something like that other than Atlantis is because what you really want to be doing is locking the project so if you think about how Terraform infrastructure is managed it's like a mono repo we have lots of different projects and you surgically target one of those at a time you don't reapply all of those projects every time because that would increase your blast radius in Atlantis it automatically locks that project all pull requests so now if somebody else wants to open up a pull request against that that's fine but they're not going to jack my lock there is a process for forcefully releasing that lock so you're not going to block teams and stuff but at least you can therefore collaborate on this stuff without stepping on each other and basically what it means is you got to rerun a plan and then oh yeah my point there is that you can use you can run Atlantis in Docker but you wouldn't run Docker in Atlantis so to your yeah that was just that thing since we deployed as an ECS task that's just a good example of how it could be run as Docker we've also run it in Kubernetes what we don't like about running Atlantis in Kubernetes is you basically have this container you can kubectl exec into with administrative permissions and that's kind of a deal stopper but Fargate you can't do that so it's good any other questions in the back yeah actually that's a good question that was a popular feature request and that's how you can kind of emulate the same behavior of a traditional CI CD pipeline basically a compensating control I think in the last two weeks that feature has been merged another thing in my slides that was not clear was the Atlantis status on the github pull request just showed one Atlantis line now they've decomposed that to plan and apply so in your github pull request branch permissions you can require that the plan succeeds before anything happens for example or you can require that the apply succeeds before you merge it which is essential yeah any other questions alright I guess that's it thank you guys, you're an awesome audience I appreciate it see you next year good afternoon I had a feeling that I wouldn't have any amplification so I might just have to use my theater voice today because neither of these are actually amplifying anything hello maybe there's a volume switch are they coming? let's start master is turned all the way down hello hello hello hello magic all I had to do was put it on my belt and it started working so welcome everyone to what is a service mesh my name is Adrian Otto I'm from Google I work in the office of the CTO I'm happy to be at scale I can say this is I think the third scale that I've attended and the second that I presented at but I'm really happy to be here not because of this presentation but because I'm a California native I was actually born in Santa Monica and I lived here for over four decades before I relocated to go to the bay area so I've been around quite a bit this feels like my stomping ground and so it just feels like I'm home so thank you for making me feel comfortable all right oh there's another California native in the room how many do we have show me like how many are natives like you were born here a lot that's solidly 40% of the room that is awesome and how many of you have been in California more than say five years okay and so you consider yourself from California if you've been here for five years yeah okay that's good I was wondering like how long because when I lived here it seemed like there were no natives it seems like I was the only one and I would talk to people and they'd be like yeah I transplanted here 30 years ago and I'm like are you from California? I'm like how long is it that you can be here before you identify as a so-caller but so anyway wake this thing back up so today we're here to talk about service mesh and I want to start by taking us back into history a little bit the last six years I think have been particularly interesting so if you kind of transport yourself back six years and you're in 2012 at that point in time public cloud is just starting to become a thing alright people are starting to like trust it and starting to put like CICD workloads there and you know backup workloads and things not super serious stuff but people are starting to really do public cloud for real VMware is everywhere like every single corporation has VMware and it's just virtualization is the thing in 2012 around 2013 Docker announces their first release a year later Kubernetes comes out a year after that in 2015 Google donated Kubernetes to the cloud native computing foundation which was pretty special kind of trying to build the open-source community aspect and getting not just Google's point of view represented but those of everybody else shortly thereafter Envoy as an open-source project was announced and it has been categorized by journalists as a service mesh when it came out it was kind of characterized in that way a year later Linkerd became part of the CNCF and Linkerd is about building service mesh as well and then Istio comes out a year after that so this is if you're just paying attention during this frame of time and you care at all about cloud native applications and about how they are set up this is pretty confusing you're like which of these is the right thing and what does it really mean and why is it separate from my container runtime for example to make things even more confusing Google comes along in 2018 and they announce Istio and Knative at the same time and they call Istio a service mesh and Knative is all about running applications that are designed to run an event-driven architecture serverless style in a way that is open and that you can run in your own infrastructure the same way that you would run it in a cloud environment and about 2018 2019 Istio starts building a following in open source to the extent where it becomes a real thing it becomes a critical mass where there's enough big companies that are using it and there's enough work happening in the open source community that you're like okay this thing is for real and it's going to be around so this is kind of the perfect time to fully understand what it is and why you care and to start getting familiar with it because before now I admit that's a confusing history but from this point I think things are going to start to become much more crisp okay now this begs the question is a service mesh just frosting on top of Kubernetes or is it actually something different it's actually something complimentary and if I do my job in this talk you should walk out with a crystal clear understanding of which of those realities is true or both okay now how many of you just by show of hands think at this point before I said much right you think that Kubernetes is just frosting on top of Kubernetes just show me if that's what you're thinking okay nobody's raising their hand okay something entirely different that compliments Kubernetes okay I'm getting 10-15% of the room and how many of you are here and you're just happy that you're going to find out and you're going to walk out of this room knowing okay the rest of us cool so I promise to make this clear okay so I don't like boring talks I like to tell stories I like to do funny things and today we're going to try to do something different we're going to do an audience participation exercise now before all of you came in some of you I went out and I found a couple of volunteers in the audience to help me now because this is Los Angeles and we have the entertainment industry I'm sure you've all seen a set of a movie or a TV show before and you know that the actors don't actually know all of their lines all of the time and so what they do is they hold up cue cards to signal the actors and actresses to say the right thing at the right time okay and they even have like a person whose job it is to hold the cue card okay and that's what our volunteers are going to be doing today so we have two cue cards and by two volunteers and when it's time I'm going to be signaling so let's try let's just practice okay so get your cue card ready and you're going to hold up your cue card and all of us are going to be the actors and we're all going to say the word okay that's the line that's the only one line you need to know I promise not to yeah yeah there we go okay he's going to be sitting during the during the talk so it's not going to be quite as high but when you see that you're going to say the word running alright and we have a second volunteer who's going to have the other cue card who has the other line and the other line is connecting okay and this is important to my talk that we're all going to be able to say our lines to make this interesting okay are you with me let's start now before we start the the whole cue card thing and get that going I want to take you back in time a second time this slide is something that Google uses all of the time to talk about how wonderful containers are okay now the narrative goes something like this we use containers we start four billion of them a week we it's a central to the way that we do our system design and the coolest thing is since 2004 our jobs have been going you know breakneck skyrocket pace but our ops teams have only been growing by a tenth of the size fantastic and it's absolutely true that containers have an awful lot to do with that story but it's not the whole story there's more okay and service mesh is part of that additional story and I'm going to explain how that contributes to this kind of a result so we'll see this slide again I'll come back to this okay so as any of you in the audience read a research paper on either big table or map reduce okay you guys are probably in the ACM or something you're reading all the you're reading all the papers or these are particularly interesting papers for whatever this was 15 years ago that Google started publishing about the computer science innovations it needed in order to make its search engine possible okay and it's about Google being truthful to its mission okay if you read the Google mission statement it's to organize the world's information and make it universally accessible and useful so we thought that part of making that information accessible and useful is about sharing with the computer science academic community what it was that we had actually done in order to figure this out okay what we did maybe anticipate at the time that these were published was that there's an awful lot of other smart people in the world who've come along and implemented open source software that actually do those innovations okay the trouble with this approach is that it takes time from when you publish research until the time you've got software developed and that it's actually working for everybody to benefit from what you truly want is to fuel innovation then you need a better way and so we started doing things in all truth we still publish research papers all the time but what we also do is we take what we learn from operating systems at massive scale with billions of users and we take those lessons and we codify them into open source so that communities can benefit from them as well and in addition to the lessons we're learning right community is going and contributing as well to make it even better and so it's a better approach than just hey I had a great idea and I wrote it down but here's something you can actually use we did this with Kubernetes and I think it's probably safe to say in 2019 that most of us feel comfortable with what Kubernetes is all about are you ready your key card ready Kubernetes is about running with me please is about running applications thank you you're awesome you're all actors to any of you like starring a school play or your actors on the side and you're just doing tech as a thing to pay the bills you're awesome great okay so it allows you to run applications on your own equipment or in the cloud of your choice what about the running of those applications okay so Kubernetes is only one of a number of open source projects that's following the pattern that I'm talking about okay all of these different open source projects all fuel innovation in their own way okay and they're all kind of related and working together so Kubernetes is one of them but there's a lot of others here now I hinted before that there was something special about what Google was doing with how they actually run containers that contributed to the lift between these two lines okay and I have a very strong belief about what that major contributor is above and beyond what you get when you run containers in Kubernetes okay it's this it's the idea of service mesh now I first kind of became aware of the service mesh concept before I ever worked at Google it was around you know maybe the time the you know 2015 time frame when linkerd was joining in the cloud native computing foundation you know the term service mesh started to be bandied about a little bit and I I saw definitions that looked an awful lot like this and in all honesty the first time I saw this I didn't quite get it I got like I get the words I know what the words mean but I didn't really get it and I've you know kind of over time learned to get it more and more and what I'm hoping I can do is you know if you're looking at this for the first time and this is not like truly resonating with you like you're like what's an application network function or why does it matter that it's transparent and language independent because that doesn't make a whole lot of sense I'm hoping to just shine some light on that and get you past the point that I was when I first saw this concept so the engineers they pulled this out and they're like service mesh is this and I'm like yeah but why who cares now don't worry so I've been talking about Istio and service mesh as synonyms so if you want a simple definition of what is Istio the short answer is that okay but as Billy Mays would say wait there's more okay this is where I hope I can just blow your mind a little bit okay so we've talked about the applications running in Kubernetes and then I introduce Istio it'll be natural to think that these things are designed to work together and they are designed to work together but Istio is also designed to work with or without containers okay and it's also designed to be used with or without Kubernetes so that's what this service interactions across container and VM that's why this caveat is thrown in here this little extra nugget is in here because it's not just frosting on top of Kubernetes it is a thing that allows you to do hybrid communications I'm going to get into all the cool hotness about it in a minute but take this away it is not just a thing that relies on Kubernetes it gives you some additional capability it is something that is designed to stand alone and combine a wider environment that is not necessarily cloud native okay it'll help cloud native applications but it will also help ones that are not okay so I'll come back to this now we've already covered that Kubernetes is about hold your cue guard running applications but Istio is about connecting applications okay so the thing about the concept of a service mesh like Istio is that once you have an extensible configurable transparent network connecting everything together suddenly all sorts of amazing things start to become possible okay so a service mesh is like finding a key that opens up a power up in your cloud native world that's how exciting it is so time for audience participation got your cue cards ready, Kubernetes is about running applications and Istio is about connecting communications now if you're happy that you've understood the difference everybody can leave now the good stuff is coming but I've done my job that's it okay now you really want to hear more don't you? okay so when I was first kind of fathoming the concept of service mesh something I did not realize it was not obvious to me was who cares now if you're an infrastructure person you're responsible for servers and networks and ports and machines and power and all this kind of stuff you get kind of things and how things work at a cloud management layer and Kubernetes makes sense to you like you get this but if your job is a service operator your job is to make sure that this thing is always running and the right version of that thing is always running and the right kind of clients are connected the right kind of thing and the people that are in the right regions are connecting to the right regions and the people are not the same person necessarily as the person who's responsible for the infrastructure you care about the actual service that's being delivered and so service operators and SREs STO is really really useful for infrastructure managers Kubernetes is more interesting so depending on who you are you might gravitate to one or the other based on this okay so be careful not to think about this as a layer cake now my colleagues will describe this as building abstractions up okay they'll talk about Kubernetes as an infrastructure it allows you to run service mesh which allows you to do you know native and look that claim is true but these things are for different different audiences right there's infrastructure people who care about this there's a service operators developers care about the interfaces to do their next generation system architecture and to do that in a way that could be a serverless design if they'd like alright so I keep saying it's so awesome what is it what is the awesomeness so there are three areas of value that STO is providing the first is what we call uniform observability now each of these things each of these areas could be a different tool there could be a different solution for each of these three areas but Google kind of has this opinion based on that experience I was talking about that these three areas should actually be solved together as a set not individually now usually the right answer is do one thing and do it well but sometimes the answer is not and this is one of those cases where these three things are all enabled by the same innovation they are all very different benefits and this is one of the problems with service mesh right there's so many things that are compelling about it it's hard to put it all in one idea right I like to talk about this all the time we can remember like six seven things no problem once you start listing more than seven things we start to get really bad at remembering that on the list right so like when you go to a service mesh talk and they're talking about all these wonderful things that service mesh can do and it's like boom boom mind blowing mind blowing great awesome cool hotness and you're like well all of that that doesn't make sense to me okay so if you break it down into these three things there's uniform observability that's about looking at how my service is actually running right what is actually happening services are talking to what services at what query rates at what health at what error rates what things depend on what other things where is my application busted right now complicated distributed application if you're using microservices at any scale that is actually a rather hard problem to answer it's a hard question to answer okay and service mesh gives you a view a point of view that starts to make that much easier to determine alright the second is operational agility this is about doing things like if and I'll talk about this more more in a subsequent slide but it's about what things are going to connect to what other things and being able to control that using a policy how much traffic is going to go to this part of the application versus that part of the application this version versus that version this client has this behavior this client has this other behavior right and if you're a system operator a service operator that's really important to you whereas the developer is less interested in that the developers just making the new features right and fixing the bugs and the things that are there they're less concerned about the experience of all the different users operational agility and then the third is my favorite policy-driven security now one of the things people love about Kubernetes is that it gives them the ability to run their applications on different clouds in a deterministic way right you're going to get the same runtime behavior in your own environment as you get in cloud environments you know the cloud environment now it's not 100% true but it's it's close okay so if we can agree that you get a similar behavior between all of your runtime environments wouldn't it be great if you wanted to express security policy in a way that would also work regardless of what your environment is and that is not true today if you're using the advanced features of your your local environment and your cloud providers and you're doing it in different places the way you express security is different in every environment and that is annoying so this gives you a way to do it in a deterministic way right using a simple expression of what the policy is it can be applied in all the different places you run that application okay now I talked about SRE a moment ago SREs care about something called an SLO does anybody know what an SLO is holler it out awesome we've got an SRE in the room service level objective okay service level objective is about measuring the performance or the reliability of a service that you're providing it's a way of verifying its quality and it turns out that there's this set of SLOs that you start seeing again and again and again and it turns out that google has these things when if you read the SRE book they're called the golden signals and these are four signals they are request rate, error rate latency and saturation it turns out that request rate and error rate and latency can be measured the same way for every single application so why should you implement SLO monitoring individually for every single service if there's a way to do it exactly the same way for everyone okay this is one of the justifications for solving these problems as a set right if you can measure these golden signals universally in a way that's first of all transparent to the developer and second of all this happens automatically all of your services start to have these golden signals measured assuming you're running them through service mesh and if you do then you can start building tools that give you a sense of the relative health of all of these different services no matter where they are in your environment that starts to become pretty powerful okay so Istio provides you the ability to measure these at least the top three of those four golden signals right the request rate the error rate and the latency it just does it for you you don't have to take any effort in order to get that now there's a fourth golden signal called saturation and the reason why your service mesh is not measuring saturation is because that depends it depends on how much capacity you have on the way that your hardware behaves it depends on the way that your software is actually designed okay and so implementing a saturation SLO will be different from service to service but the other three are all the same okay so the second area I talked about is operational agility now suppose you've got a new version of a service okay it's got these new capabilities or it's more efficient or something about it is better okay and you want to start introducing this new service into your production environment now a lot of people what they're doing is they have a test environment they try stuff out they simulate they put it into a staging environment they try it out there maybe they put some traffic into the staging environment more simulation more real and then they gulp down and they put it into a fraction of their production environment which is usually like half of their production environment are sometimes all of it and the trouble is you take a risk when you go from staging into production that might be very nice not to take that much risk maybe you want to take less risk okay and so wouldn't it be nice if you could make a policy that says okay I want 5% of my traffic to go to service B' where 95% of my traffic remains going to the previous version of the service now you might be thinking well doesn't Kubernetes already do that kind of yes you can route between node pools but it only evenly distributes traffic between them so in order to actually implement this you need 100 node pools and you're going to set 5 of them to take new traffic which is really a clumsy way to accomplish this and the configuration that's required in order to achieve that is actually not very simple so if you could just have a couple of lines in a configuration file that would affect how your service mesh behaves all of a sudden this becomes really easy let me give you another example of the operational agility suppose I put a feature in my software that only users of Apple phones care about and if I do that I don't want every single user to route to this new little bit of my infrastructure that's running this new code I just want the ones who have that device type to route to that new thing and if it turns out it's a colossal failure great I'm going to roll that back but I haven't exposed the entire world to that I've only exposed the users of that platform for which this feature is designed okay so you can do all kinds of really cool stuff like this another you know besides just changing versions another cool thing is when you have a complicated distributed system and you've got services connecting to other services over the network and something goes wrong with your network you might have code in your microservice that is designed to retry on a timeout or retry on an error after some delay and what the unintended side effect of that is when the network is malfunctioning all of a sudden you're generating more traffic and you're exacerbating your problem and it gets harder to solve the problem that you originally had because of the retry logic that's actually built into the microservice so what if you didn't do that what if you took the retry out of the microservice and you just say try once and you configure the service mesh to do the retry transparently for you in a way that is completely invisible to you that way you would have the ability to change that behavior in the mesh should you need to so if your network is freaking out and one service is getting clobbered by another because of retries you can quickly change the retry policy and make it stop we do this all the time when we have problems we change the retry policy dial it way way down wait for things to go back to sane again and then try to dial it back up again and having the having the ability to do this again super powerful, very useful especially if you have a big network with a whole lot going on in a very complicated way and you can change the retry policy based on channel attributes again so if you're like I only want to change the policy of this service affecting this service rather than the policy of everything globally you can do that okay this is all about operational agility there are other examples but I think these kind of illustrate the point the third area that I said was my favorite was policy driven security now if you simplify your microservices so that they don't do encryption and you rely on the service mesh to do all of the encryption your life gets a whole lot easier you still get secured communication from one service to another I'll show you a diagram of this to show you exactly how this works if you just trust me for a moment you get secure communication from one node to another but you don't need to do any certificate management you don't need to do any revocation lists in your applications you don't need to do any updating of those things how many of you had to change your code when Heartbleed came out my bet is all of you anyone who had a production application that was doing SSL or TLS and Heartbleed came out everybody need to recompile everything and redeploy it now if you had the ability to do this in the mesh and you're expressing a policy that says I require mutual TLS between all of my services then you have assurance that the only way that service A can reach service B is if it has valid credentials and it's doing it over a secure channel it's impossible for the two things to actually interact without that in accordance with that policy okay and I talked before about this being portable across your different environments right you implement this in your own environment you go when you use it in a public cloud you go when you use it in a different public cloud you still get the benefit of this now let's take a look at what services make up Istio now in all of those early talks that I went to to learn about Istio this was always like the only slide and if you only look at the architecture slide you don't get the whole picture so I recognize we're a room full of engineers how many of you are engineers 90% of the room is engineers you have to see an architecture slide and I'm going to show you that but I need to say a whole lot more than just showing you this because just knowing what's inside of it doesn't really explain why it matters so Istio is a way like I said before of connecting services together this is expressing service A on the left and service B on the right and these things are connected together this is a virtual TLS and that is enabled by a set of four different services that are part of Istio Istio is a control plane for managing and enforcing all of this alright so there's here are the four parts there's something called pilot now pilot is a service to configure and distribute the service communication policies alright you can think of it as like it's the configuration master or all of the envoy proxies that do the communication between the services so rather than having a single copy of the configuration that everybody is centrally referencing you instead have every single participating node in the entire system has a copy of the configuration and this is the thing that makes sure that it's the right configuration it's the right version and all of these things are synchronized and it's taking care of all of that for you the second thing is called mixer now mixer is the thing that makes this a platform remember in the definition before I was saying it's a platform this is where the integration comes to make it a platform and I'm going to show you more examples of how mixer actually works but it's where you can collect like remember I was talking about those signals what is the activity level between these services how healthy are they all of that is being collected and pushed through this thing called mixer and mixer is extensible so you can plug in suppose you already have a big investment in something like data dog and you want to hook that in well there's an Istio adapter for that you can just plug that in and continue to do things the same way you used to do them right I'll come back to this again okay then there's Citadel Citadel you can think of as this is enabling both auth n and auth z through mutual TLS but with built-in identity management put another way you can think of it like it's a certificate authority for all of your SSL right that's automatic that you don't worry about just happens everybody gets valid certificates all the configuration of all the mutual TLS is all set up all the enforcement of that back and forth I'll get to the non replayable identity thing in just a minute but Citadel is what makes all of that possible so this is about the secure communications alright and then there's a final thing called galley and galley is the user config it's the validator right for the user config for the other control plane services and over time galley is going to become responsible for the top level configuration ingestion processing and distribution component for Istio alright so galley is going to become more important over time right now what it's just doing is verifying the configs for all these other things alright and you need this control plane in order to have all of this magic that I'm talking about because distributed systems are complicated and something needs to keep track of it all alright does anybody know what kind of motorcycle this is this is the first one of its kind I have ever seen that is this coolest sidecar I think I've ever seen and one of these times I'm going to give this talk and somebody is going to know what that motorcycle is and I'm going to learn something what is it BS normal I think the badge has more letters on it than that I can't quite read it like if I zoom in really really really close and I look at this I think it might be a triumph but I don't know one of these days somebody's going to have one of these bad boys and they're going to tell me what it is alright the purpose of this slide is to talk about the concept of sidecar now how many of you already use sidecars today okay awesome less than 25% of you are using sidecar pattern sidecar pattern is a way to tie something to a pod an extra thing a logger or an auditor or in this case a service mesh proxy okay in a way that doesn't require you to change the configuration of your pod but adding a couple of lines to say that you want this sidecar to be stapled onto your pod alright so if you're in a Kubernetes environment and you're using Istio as your service mesh your pod descriptor this is a YAML file that just describes how you're going to run your application within Kubernetes has a couple of additional lines in it it's that easy to turn this on in that kind of an environment which is why Istio was first presented as works with Kubernetes because it's so easy to turn that on right now you can also use it without the sidecar right you can always just install the proxy and provide the you know the IP tables rules that are necessary to do the traffic intercept on a VM or another environment right but it's just really easy to do with in Kubernetes alright so let's go into pilot a little bit more so pilot is responsible for configuring all of those envoy proxies that are in all these sidecars right so you start doing this for all of your applications every application that you want to participate in the service mesh lines to your pod descriptor and now you start having you start having these sidecars popping up and connecting all together as a mesh right pilot is responsible for doing that distribution of that configuration to all those sidecars now mixer I told you this is an extensible component that is responsible for keeping track of all the things that happen between your microservices now here are some adapters that are for Google related things but there's a whole community of other plugins as well and if you want to hook into an existing system you can write your own plugin it's not that complicated you can just connect it it uses a gRPC interface it can run out a process it can be called by the mesh externally you don't have to build it into Istio if you don't want now if you do build it into Istio and it becomes runnable within the proxy you don't get really high performance because all of those checks whether you're doing access control or you're doing some kind of a decision about a quota if that's happening in process the performance tax of the proxy handling that connection is very low whereas if you start calling out to these external things you start to trade some performance for the capability now in my mind getting all of this is totally worth a little bit of performance cost where you can't afford any degradation performance cost you can always put a Lewis script into the proxy itself and do it in process ok Citadel is responsible for all of that authentication I told you that I would describe this deeper so suppose you've got service A that is communicating with service B from its perspective it's just connecting to service B but what's actually happening is that proxy is intercepting it and it's using a non-replayable service identity that is bound to that TLS channel so you've got this communication channel between the two services and you use a non-replayable service identity why does that matter? it matters because if you have a service that becomes compromised and somebody gets that service identity it is not a reusable bearer token it is something that only works one time so if you intercept it and then you go somewhere else and you try to use it you can't use that again so even in environments where you don't trust the network and maybe you don't even trust the services you can still achieve higher levels of security resiliency by using this it would be pretty hard to do if you need to modify all of your different services in all the different languages that you write them in chances are you don't have a single library that's just going to answer this globally for you but if it's in the mesh you can now this service level authorization some of it is happening what I've described is things that are happening down at the network level you basically get that stuff for free that actually cause the application to do to get even higher levels of control but there are things that you can enforce to the service policy or the actual policy driven security so for example if you have a JWT claim that claims that the end user of this application is in fact this individual person the service now that gets passed through the entire mesh and the service in addition to what it does can validate that it's actually being acted on by this particular individual and policy can be made by the mesh alright based on the things that are getting passed through by the application itself so when service remember I had service A and service B if service A is sending a JWT token through to service B the mesh that is responsible for service B can enforce against that token is this making sense good alright so imagine you've got a service with personally identifiable information on the right okay and you've got a customer service tool we'll call that case tool on the left and the case tool has to be able to collect the personally identifiable information so it can be presented to this particular authorized user okay this is what it's designed to do and we've got this mutual TLS trust between the mesh nodes right between these two side cars and suppose you've got Joe who's an authorized user right who can interact with the case tool sends the JWT claim through service case tool to the service with the PII and is allowed to do that say I take the same authorized user and we build another service called BI tool but BI tool in accordance with this policy that is expressed on the left is not allowed to get PII out of that service on the right so even though he's an authorized user it's a perfectly valid identity and there is still trust between the BI tool and the PII service it's still prohibited and this is a level of control again you can totally do but this is really hard to get right across a complicated environment that's written a whole bunch of languages a whole bunch of different versions okay this starts to make it uniform okay so I think I've made the case of why it matters now let's talk about how it actually works so we're going to trace the life of the query right of a request through the entire system step by step and we'll talk about what happens at each of those steps okay so in this example we've got service A and service B both of them have sidecars that are running Istio right Istio's got a common control plan across all of the all of the nodes in the mesh when it comes up Envoy gets set up with the configuration from pilot like I talked about okay and it gets its cryptographic identity from Citadel and that's done in a safe way okay so now we've got these things that are allowed to communicate securely great let's see what happens once they're up and running suppose service A wants to talk to service B it just makes a connection to the address of service B as if it were just making an ordinary TCP IP connection okay it doesn't need to be doing it over an encrypted channel right because the mesh is handling that for us it's just making a connection across and that connection is getting intercepted by the proxy on the left okay that is where it gets encrypted with that identity that gets forwarded across to the second proxy in accordance with the configuration that second proxy then hollers down to mixer and says this thing is happening right now right so this is a way of like collecting logs without having logging in any of your apps right so your policy engine and your quota adapter have an opportunity to act to accept or deny this at service B right so service A is only allowed to interact with service B a certain amount this is an example of a quota scenario you can block it at service B if it's been too busy for example okay after that the proxy at service B or in the sidecar of service B forwards it to service B so from service B's perspective it's getting a message from service A from that point the service B generates a response sends it back to the proxy that proxy sends it back to the other proxy on the other side of the mesh decrypts it I'm backwards hold on decrypts it and then after that happens again service B's sends signals down to mixer to say okay that happened here was the outcome of that and then service A's proxy then delivers that response back to service A so service A thinks it just got a response from service B even though all of this stuff has been happening in between okay then after that's happened service A's proxy sends down to mixer this happened and this was the outcome so the thing that you can do once you've got one of these environments up and running is to start to rip all kinds of complicated stuff out of your microservices suppose you have service A and service B two versions of service A version one and version two something okay version one and version two something are both using a client library a different version of a client library to do the same thing suppose it is encryption right in the old world before service mesh is suppose you have a vulnerability security vulnerability that you have in the actual library now you've got a recompile service A against whatever version of the library and its API that it was designed against and then recompile a different one against a different library version which you also need to patch separately from the first one because the different versions aren't guaranteed to have the same patch right so now you're doing this whole mega patch exercise everywhere where every single version of every single thing that you run now needs to be changed okay if I can make that go away and you just handle that in mesh and all of the services just don't have that library at all then updating that starts to become very simple alright let me give you another example hybrid use case suppose you have an old crusty database running on some system that everybody is terrified to touch you know what I'm talking about alright if that service that you're running that's super important that's been around forever that like nobody is brave enough to go in and enable TLS on that thing nobody like nobody's going to nobody's going to touch that like no way but if you could put a mesh in between these things right you could start to secure the communication between the service and that old crusty database and you could do it without changing the protocol at all right so now we get the behavior and the security benefits of running the mutual TLS between this old crusty thing and this hot new cloud native service but without actually disturbing this in the way that you're afraid to do or that I would be afraid to do any of us could honestly say there's something they would rather not touch service mesh to the rescue so the more your system handles in a uniform way for all of your services in the entire system the simpler you can begin to make them right when they're simpler you can do them more easily you can iterate on them more quickly you can make them secure again faster right everything gets better when things are simple so you can implement security policies control what services can connect to what other services at a level of control that's much harder to do otherwise you can do things like monitoring and logging like for example suppose you just want to look and say what services does this service use seems like a simple question right and if you ask an engineer what services does this service depend on you'll get an answer and 90% of the time that answer is wrong because even the person who created that doesn't realize that after the time that they created that something has changed and now it depends on a whole lot of other things or he or she says it depends on some thing but that thing actually depends on a whole bunch of other things wouldn't it be nice to be able to visualize how those relationships actually are so that if you're going to change something way up here you have a sense of what the impact is going to be down there right that is why Google got this lift that is the reason right the services can be created so fast it's the reason that we can make new versions and new features so fast it's because we're not doing all the hard stuff all the hard stuff is done and service mesh is a huge part of that and it's not just containers right it's not just magic orchestration right it's something more so cue cards are ready in review Kubernetes is all about running applications and Istio is all about connecting applications you've got it and now you know why and why you should care so if you want to try this I recommend this codelab um this is a totally like google cloud hosted thing you go to this codelab you can set up Istio yourself and all of these claims that I'm making about how magical all of this is will start to become more clear right you'll see how you actually enable Istio in a cluster you'll see how you turn it on on a service by service basis you'll start to see how these things interact you can start to express policy on top of them and then you'll start to be able to decide you know what I'm going to stop doing all the hard stuff and I'm going to let Istio make this easy for me thank you for giving me your attention for all this time thank you to my volunteers a round of applause for my volunteers in true Los Angeles form thank you I'll be around if you have any questions or want to talk thank you so I'm going to just use the mic because it's going to be a problem okay this one's better so yeah this can be vaguely even interactive so I'm going to talk to you a little bit about managing MySQL as well as MariaDB server in the hosted cloud MySQL users MariaDB user no okay so in a consultancy doing community developer relations, HA scalability, managing remote teams as well as market entry doing a lot of MySQL including today morning's panel and I can definitely talk to you about VC versus non VC backed company and engineering decisions that go with it as well this is generally the rough agenda of what I plan to cover today basically MySQL as a databases service offering your choices, considerations the variants of MySQL that are available so if some of you were here on maybe Friday you would have gone to some of these MySQL 8 sessions and MySQL is awesome with GIS functionality JSON functionality and so forth and you know can you use all of that functionality here obviously the costs and these costs they keep on changing every time so if you give this talk every year for example you would end up having different rates and again depending on which time of the year how you manage, observe, scale, HA security around it, security tends to be usually by the platform itself backups also tend to be managed by said platform so databases service is basically SaaS for databases so software service the old way was you decided hey I'm going to make a new project like you know I'm going to make the scale 20 website 20 20 website then you go get someone to approve it you order your hardware you connect it to your network you install the OS, configure it start it it takes several days maybe even weeks nowadays you just whip out your credit card and you get started fairly quickly on demand, no installation no configuration nothing for you to do it's paper usage typically every hour you can even get free tiers every cloud provider offers free tiers so if you just getting started out the free tiers are probably good enough for you and you don't have to maintain your database so you don't have to be a database administrator or an expert in MySQL or MariaDB you could just be a developer saying I need access to a database give me one now and really it is as simple as entering a card number and then using the GUI or calling an API like using the AWS EC2 rather than scenario so why use a database service typically people say hey we'd like to you know get more because we're going to get a traffic spike also you don't maybe have a lot of DVAs so you want to basically optimize for operational ease so indeed.com likes to keep track of salaries in the US and the average salary last year was $106,000 per annum for a database administrator who knew MySQL and that was compared to the previous year $91,000 per annum on average and it's not uncommon to see these DVAs getting $150k, $350k if you're based in the valley so you can totally have lots of rapid deployment and scale out but again I highly recommend you take a look at the limits your database and service provider provides like RDS from Amazon for example you have 40 instances per account after which you need to actually apply for more and sometimes take time like sometimes to create a new instance it can take you like 5 minutes 10 minutes sometimes it's fairly instantaneous again very much dependent on your cloud provider so I'm only going to focus largely on RDS, MySQL and MariaDB RDS Aurora because I presume some of you may be interested in using it Google Cloud SQL as your database which is maybe the newest kid on the block and then Alibaba cloud and their products are called AsparaDB Oracle Cloud MySQL totally has a service but I've seen no real usage of this anywhere in fact a few months ago there was an article in the information which is a subscription on the website that stated that Oracle salespeople were trying very hard to flush the cloud service down your throat as part of a maintenance contract because they were getting better commissions on that so even if you didn't need the cloud service you were getting it there are obviously more Jelastic is a good example of one which is a PaaS offering MySQL as well as MariaDB server ClearDB used to be the Microsoft partner of choice and you could use them with Azure but now Microsoft themselves offer said service so really it's good for Heroku and the joint of course even offers an image for things like PaaCona MySQL and you'll be surprised to note that PaaCona MySQL while a very popular branch of MySQL is not offered in any of the clouds in any of the major clouds you only really get MySQL on MariaDB and it also has something called Cloud Foundry and that's again MySQL PaaS solution platform as a service and they've got two variants of this the version one which they are now calling Legacy is based on MariaDB 10.1 and it comes with Galera cluster now that is the idea of using virtually synchronous replication where all nodes are equal especially upon commit time MySQL for Pivotal Cloud Foundry version 2 which is the current version actually makes use of PaaCona server 5.7 and this one makes use of standard asynchronous replication which is what they refer to as leader follower as opposed to master slave which is the MySQL term which some now deem offensive so leader follower is the idea now now Cloud Foundry of course works with pretty much every infrastructure and service platform out there be it AWS, Azure, Google including OpenStack and vSphere and I guess the significant difference here is that there is a difference between running asynchronous replication in the version 2 product and running fully synchronous replication which is in the version 1 of the product now if you sell highly available MySQL but you don't tell or running fully synchronous replication Galera makes use of something known as optimistic concurrency control and it also will accept a transaction on a node and when it's time to apply it on said node it then starts applying it across all the nodes and now if you happen to have a schema with a hot row for example it will actually ask you to roll back so you need to actually set up the retry auto comits so that it can keep on retrying say maybe up to 5 times this of course is different from asynchronous replication in where the leader gets transaction committed to it but the follower does not necessarily get said transaction either it could actually be it could be lost if the leader happens to go down for whatever reason which is why MySQL also has something called semi synchronous replication which means that at least one follower will get said transaction now most cloud platforms do not enable semi synchronous replication they all prefer their own variant with the exception of Google and we'll get to that later as well Red Hat's been really making a good push around OpenShift and there are variants of course for the online community version and enterprise edition if you want to get more modern releases naturally the online community version is better however you'll note that only MySQL 5.6 and 5.7 and MariaDB 10.0 and 10.1 are supported this is missing MySQL 8 which has been released for about a year now and MariaDB 10.2 and 10.3 which has been released for a year and two years respectively so these cartridges need updating now cloud services naturally I'd say beware because sometimes they can disappear with two weeks notice sometimes they disappear overnight even if you happen to see them in a trade show so the ones that we're going to cover of course are fairly large they're all probably backed by public listed companies so they're not going to disappear overnight which I think is good HP for example also had a cloud service they also decided to stop offering it because they'd realized that competition was pretty rife however we did get a bunch of useful things that HP managed to sponsor including things like a utility user which basically allows administrative actions to happen upon database without giving that user access to the schema so there's good separation of privileges to enforce the storage engine to say look I only want you to use inodb and nothing else if you managed to use another engine no warnings just fail also the ability to prevent load data and select into out file lately you must have read in the news that there have been security issues where people can exploit mysql arbitrary file reads and you can get lots of rooted servers with an insecure version of adminer software and that was basically making use of load data in file and this feature for example is extensively used in things like Google cloud sql which actually disables load data in file also the ability to restrict the number of binary log files because you are launching up cloud instances and it's a shared environment you can't affect other tenants if your binary log files grow very large so you can actually restrict the number of binary log files to say like if you have a 5GB instance you don't want your bin logs to be larger than say 4 Alibaba has been making a great push it's not very popular in North America but I'd say it's definitely a serious contender the Aspire DB for RDS obviously does mysql and MariaDB they also do like things like Redis, Postgres MongoDB and it's naturally cheaper to use it outside of China but they have data centers in places that you would not normally imagine would exist and if you're going to do any business in China they have China connect and you really want access to that AWS was the first service offering to offer MariaDB server several years ago and now you can find it pretty much everywhere else with the exception of Google cloud and you know there are some cool things you get with MariaDB server like if you show global variables for example things like access denied errors which is useful for a cloud a cloud based system how much memory is being used rose red and so forth things that you don't see in show global variables from mysql for example when it comes to regions and availability zones it's standard to understand that a region is a data center location containing multiple availability zones an availability zone is isolated from failures so a region is basically a collection of zones a zone is an isolated location within a region Google follows what Amazon does they just call it a zone Alibaba cloud also basically tells you it's got independent power grids as well as networks within said region and of course the network latency is going to be within the same zone is much shorter Azure actually takes this a little further where each zone is made up of one or more data centers that are equipped with independent power cooling as well as networking and you have a minimum of three separate zones in all enabled regions now this is not true with Amazon for example so to know if Amazon has got a minimum of three separate zones in one region you will know that that region offers Aurora because Aurora needs to write to a minimum of three three zones and it was only fairly recently maybe about a year or less than a year ago that you could get this say in Singapore where data center space is extremely expensive because it's a little island now on Azure an availability zone in AZ in Azure is basically a region and in that you have there a fault domain as well as an update domain so if you create three or more VMs across these zones in the region your VMs are effectively distributed across three fault domains itself and three update domains so Azure itself as a platform is a little smarter and recognizes the distribution across update domains and also will make sure that the VMs in your different zones are not updated at the same time this in theory is good because when you're doing scheduled maintenance means that on Azure you have higher availability than say RDS now RDS is still available during said maintenance but as much higher elevated IO workload now when it comes to locations Amazon RDS is in a huge chunk of locations as is Google Cloud SQL now Google Cloud SQL you want to make sure you're getting the second generation instances also true for Azure Gen 5 locations not Gen 4 except with Azure in terms of cost Gen 4 and Gen 5 cost the same so definitely get Gen 5 they don't like to tell you where the data centers are located which is kind of annoying so things like Southeast Asia basically equates to Singapore East Asia is Hong Kong Australia Southeast is Melbourne Victoria Alibaba Cloud is extremely heavy in China many of these services also are say in Mumbai Alibaba offers locations like Kuala Lumpur Jakarta which no one else tends to offer again you need to go in for finer grain to look at how many zones you can get per region but the trick with at least Amazon is to see if they offer Aurora then you know they have a minimum of 3 and this keeps on expanding further and further so when it comes to SLAs everyone's fairly certain that you should have very little downtime per annum basically if you have 4 9s uptimes or 99.95% uptime you really shouldn't be looking at downtime that's very long at all per year and everyone tends to offer you some kind of service credit with the exception of Alibaba Cloud and of course naturally the scheduled maintenance does not include the increased IO and latency now when I say 99.95% in a calendar month that means you should never have more than 22 minutes of downtime per month of course again not during the upgrade cycle or maintenance cycle the maintenance cycle means that you could suffer up to 140 minutes of IO latency but generally speaking no one will go down longer than 22 minutes now when you say can I afford to be down 22 minutes a month the answer is probably no and you will then create obviously more replicas this is how you scale out in the cloud and be highly available when you are down at one go is quite minimal in fact you will always find plenty of news outlets covering when a cloud region goes down but this should not stop you because you should be able to run in multiple regions and if you need help for that naturally there are tools like chaos monkey that help so you can do chaos deploys and just kill nodes they all have fairly cheap ramp-ups you can start typically for even $20 and go up but they all give you varying levels of support so you may be able to only get from local business hours like Amazon within say 12 hours of response time and of course it goes up now if you decide one management now management tends to cost a lot more money Amazon for example will sell you managed services but it starts at $15k per year that's generally quite pricey again people like Google and Microsoft Alibaba they all generally give you self-management but there are people like Rackspace Managed Services they will tell you they fully support AWS Alibaba cloud, Google cloud and so forth so you can totally go and use an outsourced service like Rackspace to do it and they truly have fanatical support every time I've had to contact Rackspace I get responses the quickest amongst a lot it's clear that support is in the DNA and I used to also talk about Rackspace's cloud but Rackspace's cloud is based on older version of OpenStack and not really very popular I think they are refocusing on managed services nowadays largely because they are owned by a private equity firm also don't forget that there are many many companies that do remote DBA services so Pythian, Percona, MariaDB who were at the expo hall earlier they will all help you manage this as well so you don't but the only disadvantage I guess is you don't get a single billing service you get to pay Amazon and you get to pay a provider like MariaDB now when it comes to MySQL versions AWS tends to be really really quick in terms of updating they offer 5.5, 5.6, 5.7 and 8.0 now 8.0 as I said has been around for about a year now and 8.0 is very very performant, very useful and from a MariaDB standpoint they support 10.0, 1.2 and 3 so with 10.3 you are getting the ability to do SQL mode equals Oracle and migrate your Oracle applications workloads and you can do SQL SQL over to MariaDB now Google offers 5.6 and 5.7 but they seem to be lagging behind MySQL releases you are getting fairly old 5.7 it used to be updated maybe 4 to 5 months once now now lately it's been a lot longer in terms of delta Microsoft has been doing a good job of 5.6 and 5.7 as well as MariaDB server 10.2 these are all fairly recent again not seeing 8.0 or 10.3 and Alibaba cloud a bit of an anomaly here because they are 5.6 and 5.7 of which they heavily add on patches they call it the Alie SQL patches and you are allowed to go take a look at this on GitHub as well some of this goes upstream to MySQL, some of this goes to MariaDB now the MariaDB offering is quite interesting because they don't actually offer MariaDB community they offer MariaDB enterprise so you are effectively using enterprise product with a max scale router and you are obviously paying for said service you are paying sort of an added premium but you are getting an enterprise product not the community version now everybody else is only shipping community editions so that you are getting basically what you get on your laptop or your dev environment and what you get in the cloud everyone now tends to give you API access as well as standard MySQL client access it used to be such that with Google for example they would only allow you to do it with an app engine Rackspace basically said you had to access it via private host name within the Rackspace network now everyone allow you to access this via standard MySQL client from say your laptop to an IP address it's not a big deal anymore of course if you are an experienced DBA you will find that you cannot configure MySQL as much as you can with these host services this was taken from MySQL performance blog MySQL itself provided 523 options that you could modify while RDS would only give you via the web user interface 283 options so a little over half and 58 of those were immutable all of those were changeable of course you can't change things like your base directory, your data directory and various other variables since this is a managed service things have to be locked down a little and then it's also important to remember things like audit logs, the memcache the binary log settings performance schema all of this sort of is not turned on semi sync replication and so on now you can turn on performance schema on via an option group memcache and audit logs but generally speaking you're getting a very cut down version of MySQL really just meant to deploy fairly easily in the cloud now costs cloud cloud unit economics are quite interesting and they keep on changing all the time snap the company behind snapchat basically said that they were going to spend 2 billion dollars on Google cloud over 5 years that's a pretty penny I think maybe in the news recently in the last week was how everyone said that in lifts IPO prospectus every ride that you take on lift means Amazon gets 14 cents per ride and some people were going wow that's crazy like Amazon's making 14 cents per ride now if you've never actually launched a data center kept that running you may realize that 14 cents per ride is probably not as expensive as you think but naturally that does not stop commentary from coming saying wow it's crazy that Amazon's making so much and you can expect that the same thing is going to happen when Uber files its IPO prospectus now naturally again you can consolidate workloads so sometimes RDS gets or any service gets a new high memory instance and sometimes you can consolidate workloads because it may have 3.6 times the performance versus the previous one and if you have several workloads you can consolidate and make them one prices always vary between regions we find that the US is the cheapest EU is obviously pricier Singapore tends to be matched if not pricier because again of location South America tends to be the most expensive alongside Australia again it takes a lot of effort to get cables down there as well now we've seen price drops every year but the last major price drop we've seen across all the clouds were about 3 years ago now any price drop is fairly marginal in the operation in the 5 or 6 dollar region if at all you see it you know the DBM for large is about $1500 per year and that price has been around since 2017 now a DBM for large in Singapore as a comparison would set you back about $2200 as opposed to $1500 so there is a 48% more expensive premium just by hosting in Singapore versus hosting in say US East or US West now you may be wondering why I've only put a DBM for large here that's because it turns out that Amazon keeps on updating the hardware as does every cloud provider but Amazon has managed to go the 5 generations whereas Google and Azure have only gone through 2 you're basically getting the same amount of VCPUs and they're giving you ECUs relative measure of the integer processing power of an Amazon EC2 instance you're getting more ECUs same amount of RAM a lot more bandwidth but at a lot less price well not a lot less basically $1500 versus $1500 and $33 so upgrading instances can give you more performance for maybe a little bit less price and these keep on changing every couple years so we can go back from 2013 2014-17 and now to 2019 and you'll actually see the prices do drop and it turns out cloud computing gets cheaper as the years go by and you tend to get better hardware so you're getting more bang for buck so to speak when it comes to something like Google you've got to enable billing they've also got several first generations so I don't recommend first generation anymore and in fact if you're a new customer you can't get access to the D8 or the D16 but the closest you're getting to that to something that you can put into production today is a DBN1 standard 8 which will set you back about $6700 per year now Google has other limitations that are interesting that do not apply to say other database vendors for one a max size of a YSQL DB instance on disk is 500 gigabytes per instance but storage is also included in the price that I list here now you'll find that with Azure for example which will be probably on the next slide storage costs are not included so you've got to start adding these costs to realize which cloud vendor is going to be cheaper for you also Google has max concurrent connections and if you do package billing you can save up to 50% as well so down to Azure where you can get a good test machine with 4 vcores, 8GB of RAM for about 3 grand but you're paying storage costs of 0.115 cents per gigabyte this used to be 0.12 cents per gigabyte so the costs have actually dropped over a year and actually they were planning to charge you via I.O but I suspect it was maybe harder than expected to do via I.O so now you're just paying these costs as is no I.O Ali Baba Cloud again a similar comparison about 4500 for a similar machine and if you take a subscription you pay only about 3 grand and you can also pay for it monthly now monthly payments almost unheard of in the cloud world when you're booking in advance and storage again can be about 19 cents per gigabyte per month in the US versus China prices of course do vary and they also add on additional features audits, backups monitoring all of this tends to cost you more when you're hosting your application you definitely keep it close to where your database is so if you're keeping your database in Microsoft as your you do not keep your application in say Amazon's cloud most have multi AZ and Amazon itself implements multi AZ via DRBD which is block level application across the availability zone DRBD was also out in the in the expo hall naturally you have good availability because it can do automatic failovers but again if you have a large database at a cold start it can be slow and you don't get an extra read replica you're basically paying for a passive follower and if you compare performance with multi AZ and without a significant performance with multi AZ largely because of the underlying DRBD structure external replication after five six onwards throughout all clouds actually works fairly well you just enable backup retention which will give you binary log access and allow you to replicate from say Amazon to your own laptop for example or even this will help you get replicated out to another platform and you can also replicate into these relational database services and with Amazon it started with 5.5 with Azure it starts with 5.6 it gets similar around cloud SQL and they all usually have some kind of stored procedure that allows you to replicate into a relational database service getting started it's fairly easy you do my SQL dump and then you load it if you're doing upgrades tend to work just via upgrading read replicas nowadays when it comes to disaster recovery you need to have a plan naturally don't keep everything in one region and naturally have good backups so you can't use a tool like Precona Extra backup or My Dump or Maria backup you're going to have to rely on cloud SQLs basically automatic backup or Amazon's automatic backup with point-in-time recovery and full daily snapshots which also has a backup window and Amazon does this most likely via EBS snapshots because you can also create snapshots and save it in S3 now multi azs you can do backups taken from a standby which is much much better obviously you don't actually take load on your leader or master you want to increase backup retention from one day to seven and naturally ARIA if you happen to use ARIA with MariaDB which is actually enabled this may not necessarily work for automatic backups and it's part of the caveats and documentation Microsoft also does automatic backups with point-in-time recovery and backup retention is seven days up to 35 days now remember you can have all these wonderful backups but restores can take an extremely long time I don't know how many of you use a pocket-like service called Instapaper and they had a failure not long ago maybe about six, seven months ago and they were down for more than 48 hours not because they didn't have access to backups because the restores of all that data took like 48 hours before you could get your related later stuff so this is extremely important to remember that there can be time for backups if you want to avoid this whole restore from backup scenario the best situation is to have more nodes when it comes to monitoring you have things like CloudWatch Azure Portal, they're all very good for monitoring and Google really has improved their read write graphs with the acquisition of stack drivers so you get very good stack driver monitoring Prokona offers monitoring and management as well and you can check this out they basically package up Prometheus and Grafana and allow you to do monitoring, it's quite handy and of course there's Datadog, Vivid Cortex and a whole slew of other monitoring solutions available out there so your options are actually quite limitless now MySQL and MariaDB have many storage engines and MariaDB especially has things like MyRocks and TakiDB, Spyder, Connect it turns out that all these cloud vendors will disable every storage engine that you think is cool they'll only allow you to use InnoDB and MyISM I used to also say they'll allow you to use ExtraDB but MariaDB has obviously ditched ExtraDB for InnoDB so realistically now you really only have InnoDB and MyISM when it comes to HA you really want to make sure you plan for node failures nodes can fail as often as you think node provisioning is not as quick as you think so it's not a two second operation but can be a two second operation always backup you may have bad nodes at which point you can just kill said bad node it's not a big deal Google as I said is the only one that does semi synchronous replication so you actually get a usable replica when you pay Google for their high availability instance and this means that you should set up alerts for replication lag as well and I think Google's semi sync is actually pretty awesome Do not provide us, do not provide you everything storage engines are the least of your worries in terms of what they disable there are many other things that are disabled including things like replication filters, semi synchronous replication data address, encryption Galera cluster the ability to use things like handler socket performance schema installing plugins the latest cool stuff for MySQL 8 which is the X protocol in MySQL S8 which allows you to query MySQL using JavaScript or Python instead of just regular SQL none of this will work with the existing cloud vendors so these are all things that many people talk about, many people think are cool but if you're using MySQL I'm ready to be in the cloud all the stuff just doesn't work and this is just a tip of the iceberg there's plenty more that gets disabled now if you ever think of using these things like mcashd or audits in MySQL and MariaDB Amazon does provide via option groups the ability to have both of these mcashd works only with MySQL and this MariaDB server audit plugin works with both MySQL and MariaDB you definitely want to have provisioned IOPS if you care about your database and you're going to use this in production because the IOPS that are given to you sometimes are so limited like you may only be guaranteed an IOPS which is nothing for a database whereas if you do provisioned IOPS on say Amazon and also Google and Azure you can go out to like 3 terabytes of storage you can guarantee 30,000 IOPS per database instance now if you're going to run in production you want to have provisioned IOPS just remember the MySQL page size is 16 kilobytes and you can generally use the page size of MySQL but it may not be exposed to you for all of these cloud providers either and provisioned IOPS is way better than rating stuff yourself because you can take provisioned IOPS up and down dynamically of course again if you're trying to log forensics like the slow log general log you have to tend to download and then parse or use APIs and pull down log files you don't get access to the super user to skip what happens to occur and it can occur if you're using async replication you can't just turn off the sync binary log you don't get access to the operating system underlying even though it's Linux so you can't just go out there and look at SAW or TCP dump automatic upgrades regressions can happen in point releases of MySQL and this is true of the MySQL of yesteryear and the MySQL of today now I've given you a bunch of 5.5 examples but we've also seen MySQL in the 8.0 range have like one fix which is a regression fix in a point release now again cloud vendors usually don't upgrade instantly but you got to also make sure that these bugs are reported and it does not actually affect you otherwise you may actually realize that you're getting nasty regressions where if you sort of relied on a query cache and then suddenly it's gone your query performance can go down tremendously now if you're going to have benchmark stuff sysbench is probably really good for you basically you can do OLTP tests and so forth yahoo cloud-serving benchmark or google's perfkit benchmarker which allows you to actually compare performance around Google Compute Engine AWS as well as Azure there tend to be no roadmaps so you have to again look out for mailing lists posts typically look for the cloud vendor large events right so Google has a large next event Amazon has this large reinvent and Microsoft only recently started offering Azure SQL so start watching out for Microsoft's events too now this is an example of poor usability in the cloud now I won't label which cloud vendor this is but what this allows me to do is to create a test database with a root user and a password now I can actually create an instance without a password which will then prevent me from ever logging into my instance because I did not initialize it with a password this particular interface allows me to launch an instance without a password that is a usability bug because the only way to then fix it is to actually go kill that instance and start another one so your mileage may vary and I guess the maturity of these clouds also play a huge role in terms of what rock to what doesn't now you can also run MySQL or MariaDB inside of something like EC2 or a compute instance you can do multiple geographic regions via say, semi synchronous replication even you can run a single Percona server instance because you can't get Percona serving any of these cloud environments for example or a MariaDB server instance so you want to try out the latest 10-4 release candidate some good advice here is to use additional EBS volume for an inner-DB table space you can use RAID on your EBS volume you want to mount partitions with no A time as well as no dirt time and it typically should yield about 10% better IO performance you can monitor with tools like Isinga, Nagios Percona monitoring and management you can do your own snapshot backups and save it to S3 snapshot backups can be done via something like MyDumper which will also allow for parallel dump and load you want to save money, you can use spot instances MHA can help you with automatic failover and of course you can set up your own synchronous replication via Galra cluster or Percona XDB cluster stuff that you cannot get in the cloud naturally, you can't actually just easily get MariaDB Galra cluster, Percona XDB cluster Aurora Aurora is what Amazon tells you is a good idea for you to use after you've outgrown my SQL I have also seen customers outgrow Aurora and then want to move to compute instances so this is a very interesting dichotomy that you can also outgrow what they consider very large available database basically each 10 gigabyte chunk of your database volume is replicated six ways across three availability zones they do make use of a quorum write so either you get four or six different writes to increase write performance it has self-feeling, instant recovery it's got new its own commercial, parallel distributed asynchronous redo logs the cache is always warm through database restarts because the cache is decoupled from the database process you can have up to 15 read replicas per leader for scaling your reads horizontally any read replica can also be promoted to being a new leader instantly, instantly this time with Aurora and of course you can have continuous incremental off volume snapshots to S3 with backups having no load on the database at all there are two variants of Aurora the 5.6.10 fork which obviously disables the optimizer because it's got all these other cool stuff and uses Aurora application and also the 5.7.12 Aurora which was launched about a year ago and this all comes with things like JSON support, spatial indexes, generated columns, virtual columns but no one is supporting the 8.0 variant yet this is honestly fairly cool Aurora also has a lab mode which will allow you to get even more optimizer features and in the MySQL space Aurora is the first to offer parallel query only again for the 5.6. Aurora they're working on the 5.7 variant but basically from this picture each node in the storage layer has got plenty of processing power so Aurora can now make use of all the processing power by taking all the analytical queries you may run so things like with window functions you may want to run and rather than in parallel across hundreds of thousands of storage nodes which with a lot of speed benefits so with several orders of magnitude and this model of course reduces network, CPU as well as buffer pool contention you could definitely do analytical as well as transactional queries simultaneously on the same table while giving you high throughput on all of those queries now parallel query is something you've seen in other databases like Postgres we don't see this in any open source variant of MySQL either so parallel query is something new and it's Aurora new and I guess this is kind of exciting because we may start seeing more interest from the MariaDB and MySQL space as well so we can get this inside a regular server now kind of looking ahead we expect that you'll see things like autotune crop up more which allows you to tune database deployments because it reuses trading data from previous tuning sessions and autotune itself doesn't need to generate initial data set for training its machine learning models so basically your tuning time is drastically reduced and autotune also generates a configuration that's almost as good as chosen by DBA and they actually made a blog on the Amazon blog which tells you that their autotune configurations and say EC2 basically gave the 60% reduction in latency and a 20 to 35% better throughput compared to what a DBA would configure so sort of a self-driving database now I put a little oracle banner here because it's an oracle ad that you see very commonly in the Wall Street Journal they love to tell you it's a self-driving autonomous database but this is for oracle.oracle.oracle.mysql but in the open source world we see autotune as potentially useful for this sort of thing and then Peloton is the self-driving DVMS that's built on top of autotune and it also kind of can predict future workload trends before they occur and it mixes with the Postgres protocol with integrated artificial intelligence as well as machine learning it's made by the same people who do autotune and it's from CMU I think I don't know where Peloton will go but I think autotune is something you'll start seeing more MySQL use cases for so if you use MySQL 8 today you can actually it actually has default tuning options that make it more usable from the DVB standpoint and I expect this will only get better over time MariaDB is also working on tuning and they made some announcements last week at OpenWorks their conference but there's no code it's still a lot of slide decks at the moment so a quick amount of closing thoughts because we've got approximately maybe less than 20 minutes hardware obviously varies per region so you can get older hardware in some of the Amazon regions for example and it's becoming quite apparent the others are telling you look we've got version X of a database and version X plus 1 software manageability can also vary per region again this depends largely on what kind of hardware is there and what kind of software they've deployed on top of it a problem in the open stack world a lot less of a problem in these current closed environments credit cards you know not using this well can make a credit card go to the roof famously I don't know how many of you have used smug mug before but they recently acquired Flickr but the guy behind smug mug Don McAllister we used to walk around with an Amax black card largely because he was an early adopter of the cloud and they spent so much money on Amazon that it generated that for him with regards to credit card make sure you have a backup credit card as well so Google, Amazon and so forth they all allow this you want to make sure that you have a backup credit card because if your credit card fails to bill for some reason like limit was hit or you happen to close the account due to fraud you could be fairly miserable because these are all run by machines oh I couldn't charge this person I'll send an email oh I still couldn't charge you maybe I should spend your service not very pleasant definitely don't upgrade immediately to the latest new releases we see lots of this especially with Aurora they go from 5.6 Aurora to 5.7 immediately and then shortly after they go it doesn't work the same it's true because 5.7 it's Aurora 5.7 it's kind of a shell of MySQL so don't upgrade immediately wait for others to upgrade upgrade in test environments and so forth always read the release notes some release notes are better than others but always read them if you're doing the whole EC2 route look for other managed services MongoDB for example has Atlas for cloud management we don't see much of this in the MySQL world yet MariaDB has been talking about doing this since last year so hopefully we'll see more of this soon definitely if you're if you're looking for shodding check out VITES and if you're looking for proxy things like proxy SQL kind of rock well with that thank you very much for listening clearly the scale die hard to here till the end do you have any questions or slides will be up online as well as will the recording so thank you this one or this one so you are clearly a scale die hard you're also a scale die hard you're here till the very end