 WP Engine. So WP Engine today we're undergoing a shift in our infrastructure working to migrate from VMs to Kubernetes. We currently run over 6,000 VMs, which we used to power over 500,000 domains, which are on 300,000 WordPress installations. We estimate that 5% of the online world visits at least one site we host every single day. So for those of you that aren't familiar with WordPress, it was open sourced in 2003 and founded then. It's traditionally ran on a lamp stack. And so you normally run that on a VM or with a VPS provider. Your classic use case with that is generally a pretty low utilization website. So you have a lot of customers that run their personal blogs, things like websites for law firms, engineering firms, and apartment complexes and so forth that run on WordPress. But then you also have customers who run extremely high-performance sites. So some large instances of WordPress that are out there today are like the CMAs for the Country Music Awards. The Tech Crunch website runs on WordPress as does the New Yorker. So with WP Engine, we allow our customers to install any plugins they want in their WordPress installation. And they can add any custom code that they have. So this creates some of our difficulties. And with WordPress, all of your workers require access to the same file system so that if something's changed, they all see the exact same file system. So what we're trying to do is we have today our virtual machines, which have IngenX, a caching layer, PHP, and then we have our storage and our databases. And they all live in that VM. So what we're trying to do, and one of the problems we face with this, is the domains we host, they live on that single VM, so if that instance goes offline, their entire site is offline. So the first thing we're trying to do is move the PHP piece of this out of the VM and into the Kubernetes cluster. This will allow us to eventually move all of these pieces off of the single VM and allow us to provide a more reliable infrastructure for our customers. But we're doing it one piece at a time as to keep the process as simple as we can. So we first started looking into how we were going to do this. And our first thought was we'll just give every installation that we host their own PHP FPM instance in Kubernetes. We'll deploy those, we'll let Kubernetes manage all of this for us. Everything will go great. We'll just Kubernetes all the things. It will do everything for us. We can even do multiples. We can do replica sets. And everything will be great. Yeah, it turns out not so much. Remember how I said we have 300,000 installations that we have to power? So Kubernetes has some limits. Those limits in 1.8 are 150,000 pods per cluster. We have 300,000 installations. And we want to run multiple instances of those. So if one fails, we still have reliability. So that's way more than 150,000. Not to mention, you can only have 100 pods per node. Websites, especially WordPress sites, their utilization is extremely volatile. So one minute it will be doing nothing. Then it might, for the next 10 minutes, use all the CPU and memory it has and then do nothing for the rest of the day. That creates a problem when you're trying to schedule and you can only put 100 pods per node. So we needed something that allowed us to go more dense than that, which led us to go back to the drawing board. So with these limits in mind, we took our domain expertise, which is around serving WordPress, and built our own PHP application server. So what this does is the solution we found was to develop a custom server written in Go, which runs PHP FPM workers. We then mount the site's content into an isolated worker jail for every request as they come in. So this is very similar to what you get with the VM or any other PHP FPM process, but we do it in real time for each request as they come in. This allows us to limit the PHP FPM worker to only the resources that it needs, which brings the total number of deployments we need from 300,000 down to only 6,000, one for each of the VMs we process. So this is what our setup looks like in this first stage. So we've got our VM and the request comes into our VM. And so once it hits the VM just like it always has with our infrastructure, and then from there, the request gets sent to one of the application servers in Kubernetes. And from there, what happens is that application server looks at its available workers that are free, which are all running inside their own namespace to isolate. And then it will mount the files for that specific request in and send the request to the worker. Once the worker finishes processing that request, it will then send the response back out. We unmount the files and the response goes back to the VM and back out to the customers that made the request. And then we make the worker, once everything has been cleaned up out of that worker, we add it back to the pool of available workers so that we can reutilize that same worker for another request for a different installation. So this has brought us down from 300,000 deployments to only 6,000 now. So now how are we going to manage 6,000 deployments? Sounds a lot more manageable at least. So we like to use Helm for all of our internal deployments. And so Helm requires three pieces of information to do a deploy. You first have to have the chart, which is all of your templates that you're going to deploy for your application. And then you have the values, and these are the things that are going to be different between the different deployments of this application. So these could be database connection, these could be passwords, anything like that, that happens to vary between your deployments. And then you have what Helm calls a tiller. And the tiller is something that runs inside of your Kubernetes cluster. And it's responsible for actually deploying and creating the Kubernetes objects that you've asked Helm to create. So the local Helm command will actually pass everything to the tiller, and it will then process it and create that for you inside of your cluster. And we'll see some examples of how that works in a little bit. So now we only have 6,000 workers we have to manage. And our initial, we like to use Helm, but we have 6,000 unique value files that we have to manage for Helm. So this gets to be really hard to maintain. We also have to remember which of our seven regional clusters we have to deploy each of those values files to. So 6,000 spread across seven clusters. What if we have to add a new field to that file? Sounds like a lot of work if we're actually having to manage these files. So our thought was, what if we didn't manage these files? What if something could automatically do this for us? And so something we've seen throughout the community was this pattern of using an operator. And so operators are being built where there's logic that you have for deploying your working unit of software. And so following this pattern allows us to manage our infrastructure with code, which greatly reduces the likelihood of an error during our deployment. And so we decided that we would deploy our new PHP workers using the operator pattern. And so for the deployment operator to actually be useful for us though, we needed a way to make all of the existing data we had about our VMs available to that operator in the Kubernetes cluster. So to solve this, we built a tool that watches our cloud providers APIs and creates a custom resource inside of our Kubernetes cluster that represents the VMs that are in the same region as that Kubernetes cluster. So for those of you that aren't familiar with custom resources, they basically allow you to extend the Kubernetes API. I personally like to think about them as a way to put whatever data I care about into the Kubernetes cluster and work with them just the same way I would any other objects in Kubernetes. You can store any data you want in the meta fields of a custom resource. And they were released in Kubernetes 1.7 and they replaced the now deprecated third party resources. So if you were familiar with third party resources, these are very similar and just the next step in the process. So here we have an example of a custom resource definition. So to be able to create custom resources, we first have to create a definition. So this looks like if you're familiar with Kubernetes objects, this looks very familiar. So we've got our API version and our kind defined. And then for our name, this is a very specific name that's required. It has to be the plural name of the objects you're creating. And then the group that this is going to go in. So then down in our spec, we have our scope. And so this one's currently scoped to a namespace. And so you can have two different scopes namespace or clusters. Most objects in Kubernetes are namespaced. So they're only available in the same namespace you're in. But you can also create cluster-available custom resources. So for our VM custom resources, those are actually available no matter what namespace you're in for the entire cluster. And then you define your group. So in this case, we have stable.nacolorinae.io. And that's just the DNS name that you're going to use. It's no different than what's there at the top, apiextensions.cades.io. And then we define our version, so v1. And then for our names, we define our kind, which is character. And the plural and the singular of that, which are characters and character. And then we're able to create an actual custom resource. So because we deployed that custom resource definition in, we're now able to use apieversion, stable.nacolorinae.io slash v1, and kind of character. This allows then just like any object in Kubernetes, it's required to have a metadata name. So we called it thing1. And then in spec, we're able to add any fields we want. So we created thing1 from cat in the hat, and Dr. Seuss created thing1. So we have a little demo here showing the use of custom resources and how you can interact with them. So if you see, this is our custom resource definition that I was just showing a minute ago. And it's the characters.stable.nacolorinae. And so we can just kubectl apply this just like we could any other Kubernetes file. And we can see a custom resource definition for characters was created. So then we'll take a look here at a CR for things. And so we've got two characters listed in here. So we've got thing1 and thing2, because those two troublemakers always come together. And so we'll go ahead and create these just like we would. And we can see that without doing anything for kubectl, it automatically knew how to tell us that it created a character. It didn't tell us a custom resource was created. It told us character thing1 and character thing2 were created. And then we're just going to go ahead and create one more that we'll call Nemo and characterNemo is created. And so we can work with these objects just like anything else. So we can actually do a Kubernetes git characters. And this works for everyone who has access to the cluster without any client side configuration. And you can see all three characters that we created. And then you can also describe these just like all other objects. So we can actually take a look here at Nemo. And we can see all the fields we specified that it's by Disney from finding Nemo and the names Nemo. But for those of you familiar with Disney characters, you'll know that Nemo is technically a Pixar character. So let's go ahead and fix that. Let's take a look at this file here. Let's switch it over to Pixar. And then we can just apply this file and it will update the custom resource to use the new fields. And we can take a look at it again. And we're able to see that it's now by Pixar. So one of the interesting cases with custom resources is if you name your custom resource the same thing as a built-in Kubernetes object, you'll have to access it using the full path name, including the DNS piece. So in this case, we'll kubectl get characters.stable.nicolrnay.io, and it will print those fours. So if you decide to name yours something like deployment or along those lines, then you have that problem. So now that we've got custom resources for all of our VMs, we built an operator that we've made available open source on GitHub called Losstromos. And what Losstromos does is it watches custom resources in Kubernetes and then will deploy a Helm chart for you passing in the values that are defined in the custom resource to the Helm chart. So we run a version internally that monitors our VMs and will deploy our PHP worker for every VM we have. Automatically when the new VM gets created with Google or Amazon, the custom resource gets provisioned and then the PHP worker gets automatically deployed. No manual intervention is required. So we've got a little demo here showing some of this. We create an Emo and that creates the character. Losstromos gets the notification that is created, which then will call out to Tiller and create anything that's in our Helm chart. And the same thing works for deletions. So in this case, we already have our character and our deployments. And then if we are to go through and delete Nemo, we'll see the character gets removed and that will go out to the Tiller and actually remove all of those services that is deployed. So now let's see this actually work. So what we have first is a really simple demo application written in Go. All it's going to do is print out that the name first appears in where they're from and who created it. So created by. And it's using environment variables to get those. So we're going to use resource name, resource from, and resource by so that we can pass those in. So then let's take a look at our sample chart that will actually deploy our deployment. And so most of this is pretty stock, but down at the very bottom we pass in the environment variables. And right here, the part that matters is if you're not familiar with Helm, you have values, which is all of your values, but we've created values. Resource by passing in all the fields of the resource. So then we can call spec.name, spec.from, spec.by. You can nest these. They can be arrays. They can be maps. We can go all the way down. So now we're going to. So now let's take a look at our config file for LusterOS here. So we can see that we're telling it the CRD that we're going to monitor is the group stable not encoderon.io and the names characters. We're giving our demo chart. We're telling it to deploy into the namespace demo. And we're telling it where to find the tiller. And then we're doing some pre-logging. Generally, if you're deploying this in a Kubernetes cluster, your tiller would be tiller.cube-system. But in this case, we're doing some work to make this where we can run it locally instead of in the cluster. So we port forwarded the tiller to the local host. So as you can see, we still have our thing one and our thing two from earlier. So we can go ahead and start with LusterOS here. And we can see that as soon as it started, it immediately noticed that the resource for thing one and thing two were there and created those. So instantly, it found those on startup. At startup, it finds all the existing ones and considers those added. So then we can come down here. We can actually look at Helm now. And we can see that Helm has two releases, LusterOS thing one and LusterOS thing two. And those were deployed with our KubeCon demo chart. So next up, we're going to take a look at the services. And we can see that those were created. So part of those, we created thing one and thing two. And then we can see the actual deployments that are running for those applications. So those are there to serve our basic website that we called. So now we can curl one of those services. And we can see that thing one first appeared in Cat in the Hat created by Dr. Seuss. So now the nice thing is LusterOS is still running up there at the top. And so if you watch, you can see here that we can go down at the bottom and we can apply to create Nemo again. And you'll see here that when this runs up at the top, let's, we'll clear that so you can see. But as soon as we submit this, LusterOS will notice. Oh yeah, helps you to type the file name right. And LusterOS immediately noticed and created that. So the Helm deployment for that as well was created. And so we have all of that available. So let's go ahead and get the service for that one. So we can curl the Nemo service. And we can see that Nemo first appeared in Finding Nemo created by Pixar. And then so for the next piece, we'll show updating. So we're going to go ahead and edit thing one. And we're going to go ahead and adjust this so that the name is thing A. And then we'll change by, we'll use Nicole and from we'll set to KubeCon. And so if you watch as we save this file up at the top, you'll see it immediately got the update in LusterOS and redeployed the Helm chart. And so now if we come down here at the bottom and we curl thing A, which used to be thing one, we can actually see the new response, which is thing A first appeared in KubeCon created by Nicole. And so this works and allows us to also handle deletions of custom resources. So when we remove a custom resource, it will also remove all of the Helm release for it. And we can see that LusterOS received that the resource Nemo was deleted. And we can see in Helm that Nemo is now gone. So as we were building this, we had some challenges we had to overcome. One of those was the first was how fast should our deploys be? Currently in our environment, when we have to deploy to the 6000 plus VMs that we have, that generally takes us over 24 hours to complete a deploy. We knew it couldn't be that slow. Had to be faster. So we shot for a target of within an hour to be able to deploy to all of these. We found a few bugs that we're working through. I'm working with the Helm community on around concurrent deployments with Helm. So when you pass the tiller concurrent deployments, it will actually sometimes mix the values up between deployments. So we're working on figuring out how to resolve that. I actually had some good discussions about that this week. So with that, once that is resolved, you'll be able to handle deployments. And this is actually able to deploy almost as fast as your cluster can actually spin these resources up. Another thing, we wanted the ability to do phase deploys so that we could do canaries and different stages of deploy instead of hitting all of these at once. So we've added the ability to create those most instances that filter and only watch certain custom resources. So you can apply an annotation to your custom resource and that can exclude it from Loschromos. And so another piece was monitoring. How do we monitor all of this? So Loschromos itself has a Prometheus metrics endpoint that exposes all of the data about what it's doing. So it includes the number of added resources, the number of deleted resources, updated resources, and the number of ones that have failed. It also includes a total number of resources that are being managed by Loschromos so that you can tell if you know you have 300 custom resources and Loschromos says it's only managing 298, you know you have a problem. And then there was reconciliation. What happens if we miss an event because Loschromos is offline? Say we're deploying a new version and a custom resource gets recreated or changed or created for that matter. And then what happens if someone changes something? So what happens is Loschromos has an option to do a resync and you can specify any duration of time and at that duration it will resync every custom resource and redeploy the Helm chart to ensure that it's the correct version and undo any changes that have manually been made in your cluster. We also at startup read in every custom resource that exists and update the state in the tiller to match the state that it should be. So some other use cases besides our simple VM use case is you could create databases for your application, create custom resource called database and every time you needed a database for your application you could just create a database custom resource and that would deploy your database for you using the Helm chart. You could also use it to deploy a monitoring agent. Say you're using RDS or Cloud SQL for your databases and you wanted to run your own monitoring agent in the cluster for each of those instances. You could represent those items in your cluster and when they change or get updated automatically deploy a monitoring agent. So one of the pieces of Loschromos is a CR watcher and that's a library within there that makes watching custom resources extremely easy, provides a nice library around that and so you could import that into another package if you wanted and use them. So some examples there is you could actually do DNS updates where you can just manage your DNS through custom resources or you could actually create cloud resources. So if you wanted to actually create the cloud SQL database or RDS instances you could just create those using Kubernetes objects and have them created with your cloud provider. So some ideas we have for where we want to go with Loschromos include supporting watching resources beyond custom resources. So for every deployment we have we want to be able to also deploy this other thing or anything along those lines or for every service in our cluster we want to and then support for additional deployment mechanisms. So in addition to Helm we also support using Go templates. So if you don't want to use Helm and you just want to have Go templates for your YAML files you can do that and that's and we also welcome any other ideas or pull requests. Any questions? Yeah so that's a good question. Question was if we allow our customers to create any files they want what stops them from creating a PHP info file and dumping the environment variables that could include database passwords. So in our case we actually limit the environment variables that we pass in to our customers to remove any that we use on our side which aren't very many and then we also in this case the customer would only be leaking their database password which luckily we don't actually pass in passwords for those such as the database via environment variables we stick those in the WordPress configuration file. Anything else? Awesome oh yes I honestly don't know I haven't looked at the dashboard to see if custom resources are represented in it I'm not sure. Was there another question? Yeah so for our clusters we actually run on top of GKE so our Kubernetes master is managed by Google for us so and it scales with the size of our cluster so we haven't had any problems there. So you can pass in the chart so if you looked at that config file for Lyschromos that was actually a path. I was running it locally on my machine normally so in our case we actually deploy it as a config map so we add the chart as a config map and then mount that as a volume in our Lyschromos deployment and if you were using a public chart unfortunately that doesn't work so you'd have to pull it down and push it up yeah so so if we look here at our entire stack for a VM wow that was all the way back here all of those things potentially we haven't done all the research research on moving the database and the storage in but we know after we successfully complete moving all of the PHP in we do want to start on the nginx and caching layers and then once that's complete we want to start researching what we can do for storage and database