 Okay. I think I'll start on time just to get through all the material. So I'm David Strauss. I'm here with Pantheon and we run a lot of containers, all for PHP, because we exclusively work with Drupal and WordPress. And there are a lot of challenges running infrastructure at the density that we have. And I want to kind of provide the story of the lead up of why these kind of, at least previously exotic approaches are necessary and what we're delivering with them today and how we're delivering it. And it all kind of starts with the growth of the web, the exponential growth of websites over time. It's gone from some major companies and some user groups having static HTML sites up on the web to almost every single organization, division, product, event, conference, et cetera, having a website and that shows the numbers here. But they've also gotten much more focused in their audience. They've gotten much more personal in terms of the niche that each website is targeting, the specific goals of each group and project that people are building, which is part of what is fueling this because websites are expected to cater to every part of an organization's functions, not just say post their press releases anymore. And this has manifested itself in terms of demands on these websites in the form of the stacks that they use and what is necessary to deliver those. So we're not just looking at exponential growth of the websites in their account itself. We're looking at an exponential growth in the complexity and the resources necessary to support the development, testing and deployment of these websites, where we've gone from you could have a single production website in the 90s, push up HTML files and you would be in a situation where if you made a mistake, you'd mess up one page and the stakes were low enough that you didn't necessarily need to have any kind of staging environment or even a development environment. You could pretty much make changes, push them up, fix mistakes, no one would care. And then that evolves into greater demands on sites in the kind of early aughts with the idea of dynamic websites as PHP rose to prominence, projects like WordPress and Drupal got founded. People started building personal sites, increasingly corporations adopted websites with these dynamic features. And this marked an important change in the way people built the sites because you could make a change of starting now or starting then that would break an entire site. You could make a syntax error say in your bootstrap part of the website or a module and that syntax error would crash the entire site. So you needed to have somewhere to actually test your development before pushing it out to a live environment. So this started mostly focused on say desktop development where you could install a product like WAMP or MAMP and you would have everything you needed to develop and test your website on your local machine and then you could push it up to a production environment as soon as you felt like things were working well enough on your local box. And that got us past the idea of making silly mistakes on say syntax or calling functions that weren't defined or other basic things that you go through as you're iterating on development. It was also a difference in terms of the devices people were using to access these sites where viewing it on your desktop and maybe a couple browsers was also sufficient to perform quality testing of the deployments you were making. So you could handle all that in your local machine. But we've evolved considerably past that now where we now have really advanced stacks for delivering websites. That's almost every major site that people are delivering from this conference or from people who are attending this conference is running behind something like varnish, running with a CDN, using some type of full text index whether an external service that is spidering the site or a back end like elastic search or solar that is able to index it so that you don't have to do crazy queries on the database. We're using external caches for the objects in the sense of things like redis and memcache. And that's just kind of table stakes at this point for doing professional websites. You can't install WAMP or MAMP on your desktop and be able to get that kind of environment. And you're also in a situation where there's massively increased mobile use of the internet as well which means you need to be able to view the websites on a mobile device. And while you could probably hack it through some Wi-Fi thing of viewing a site that's running off your desktop, by and large the answer for testing a site on mobile devices is deploying some sort of pre-production environment, viewing it on those devices, making sure things are working right. So we've moved past the idea that you can just simply do the development on your desktop which is another multiplying effect that not only are there now that many millions of websites but any of them with any sophistication require additional deployed resources often in cloud environments to be able to support the development and maintenance of those sites. So what we deal, what we end up with is a massive, massive number of stacks that need supporting. And let's look at some of the ways that we've deployed these things in the past. The three kind of architectures that are traditional are shared, single VM and clusters. Of course single VM it's, I don't know if I'd call it quite traditional because it rose with the evolution of Amazon Web Services. But the, but by and large this is the kind of breakdown. All of, you'll notice I have some kind of, I have thickness of lines kind of denoting how much isolation there is there where physical machines have these really thick borders. And virtual machines have thinner ones and then local UNIX users that the kind of ones used to deploy traditional shared hosting, you know dream host style, have the like permeable borders because that's really, really lightweight. And let's step through these. So with the old style shared hosting, one of the benefits was that it was dirt cheap and it continues to be dirt cheap. And the value it provides continues to be dirt cheap in the sense that it doesn't, it's really efficient but it provides almost no isolation. There's a famous number of security vulnerabilities that have been caused by bad configurations on these types of infrastructures. And they often are, they're often single points of failure. If the machine you're on goes down or even another website on the machine you're on gets really popular. It takes your, your site down. And also there was usually limited ability to scale up with web traffic that the resources you got on the machine you're on was pretty much it. So not great. And there are very few major professional projects that I see being done in this kind of infrastructure today. More common, I see medium websites targeting kind of single VMs. Like initially the first version of Pantheon actually did this. And it was nice because it was familiar. It was like the same tools you would use on a single physical machine you would deploy at a place like Rack Space. You got good isolation. Like there were no other customers that were going to step on you from a security or a resource perspective. But it was hella expensive to run this because you, every single box that would be able to credibly pack all these elements of the stack into one machine would be at least 100 bucks a month in terms of the memory necessary. And then it, so it couldn't go downscale because 100 bucks a month is pretty expensive on a per site basis, especially if you spin up additional ones for development and testing. And it also had no ability to really scale up because you couldn't spread it across a cluster if you were just consolidating all the resources onto a single machine. And the approaches inherent in this limited that. And then we kind of have the traditional big choice for, cluster choice for big websites where it was even more expensive to deploy one of these for a site. And the biggest problems with this is that as I talked about before, you really need great development resources to maintain and develop these sites because you need a representative stack. You can't debug varnish if you don't have varnish available. You can't debug how you're indexing to something like solar if you don't have solar available. And more importantly with clusters, you can't debug latency issues between your database and application or latency and performance issues with the file system unless you have those latency, a representative latency in your test or development environment. So that left us with three choices here where for doing development on this type of thing, you could clone the whole stack and have a separate development or testing cluster, which is really expensive. You're basically creating a multiplier, maybe less than 100% multiplier, but still a big one on the stack. You could set something up that was not representative elsewhere. You could say do development on single virtual machines, which means that you're not actually able to test things like latency issues or performance of the architecture when outside the production environment. Or you could do code deployment, which is where you have this production cluster and you're also doing the kind of QA or development on that same cluster, which is perfectly representative, but puts your production environment in danger. Because if you make mistakes in development, you could suck up a bunch of resources and undermine the stability of the production site. None of these are great choices for cost or accuracy. And so we have to think about what we're doing here with all these sites. As we have more development and testing environments, as we have more and more websites, are we going to kind of rubber stamp this out a million times over? And I think I can make the case of why we shouldn't actually do that. Because it's ridiculously inefficient, not just from a cost perspective, but it's ridiculously environmentally damaging as well. Just a little bit more background on me. I have a bit of an interesting kind of combination of things that I work on. Because my roots go back all the way in Drupal. I think I deployed my first Drupal site around 2006. And around 2007, I started doing infrastructure work for Drupal. And then as time went on, I joined the security team. I've done a bunch of scalability and performance work. More recently, I've gotten involved in the kind of core tooling of how we deploy systems with my work on system D, which is the first thing that starts after the kernel and almost every major distribution now. It starts all the services. It starts as process one. And I've mostly focused that work on the needs for deploying things at density, managing resource management, building resource management tools, handling the aggregation of events in the infrastructure, and making sure that everything can kind of launch on demand, as I'll talk about. And then I fed all of this into kind of why I started Pantheon with my co-founders, all of which are kind of here at the conference. And we've dedicated our company to solving this problem of how developers build sites and how we can deploy all this infrastructure efficiently. And we do that for over 100,000 websites. My back of the envelope calculation shows that about one out of 1,000 pages on the Internet loads from Pantheon servers in some way. And I don't know if that's exactly right, but it's probably the right order of magnitude. There's no really great comprehensive measure of all web traffic. And we not only run 100,000 websites for every one of those, we provide a development and testing environment, if not another stack. And to do that, we could have deployed 400,000 VMs, which would have made us very popular with Rackspace or Amazon, but it would have made us very unpopular with anyone using the service because we would have had to pass that cost on. So instead, we've developed some novel approaches, and we did containers before they were really called containers. And if you look at the source code to Pantheon, we call them bindings because they associate a service with a website or an environment on our infrastructure. But I'll call them containers for the sake of popular language now that everyone calls it that. And we think this can be a choice for all websites. There are various elements of how you would deploy it differently for different sites, but we think it captures almost all of the advantages without capturing most of the problems. And the sense that it provides the most reliable resource isolation that you can get outside of separate machines or separate virtual machines. It provides a representative environment that you can actually have clustered with representative latencies for every single stack. And it's really low cost in terms of the resources you have to deploy for each one. Because you're not deploying an operating system, you're not deploying a hypervisor, you're just deploying the runtime. And I think the cons we can engineer our way out of in the sense of how familiar it is, how complex it can be to configure in orchestrate containers, and dealing with the different types of isolation, whether you're using kind of a hard isolation of specific assigned resources, or sort of a proportional isolation that provides something more like a power grid where you turn on more power plants as demand goes up and then you control the distribution of that. Which is another thing that we've done a lot of engineering work that I'll talk about on here. And also it gets down to the fact that Nick Steeleau, who's our director of engineering, likes to say Mo servers Mo problems in the sense that it's also a cost issue of just having the number of people to deal with the amount of percentages of failure of physical hardware or virtual machines where for every machine you have deployed a certain percentage of them will fail over a certain amount of time. And you have to have a certain amount of human staff, whether you're contracting it through a vendor or directly employing it to be able to deal with that. Amazon pretty much has people constantly going through the aisles and their data centers, unracking damaged servers, racking new ones. There's a lot of human overhead beyond just the power for this stuff. And it ultimately goes to the goals of kind of doing any kind of computing. I don't think that really delivering Drupal websites is a particularly unique problem because the goals that we're all trying to accomplish here is making something work and making it efficient, which is about making it efficient for the developers, the human element, making it efficient for the underlying infrastructure. And that's pretty much it. If you accomplish your goal and you do it efficiently, there's really nothing, there's no other factor in there. And I think that that's really, this is really the core of why we build the infrastructure we build. To give an idea of the magnitude of the problem, in 2012, which is quite a long time ago at this point, and it's only gotten worse since data centers took 2% of all power in the US, that's a lot. When I was looking up the numbers on this, I was expecting some fraction of a percent, but we run a lot of data centers and they're really power hungry because every watt that a computer uses is basically multiplied by three because there's the watt going into the computer, there's the watt, and then there are two watts going, there's a watt going into the computer, almost all of which is released as heat. And then there's about two watts going to cooling that heat out of the data center because air conditioning is at best 50% efficient. So most of data center engineering revolves around power management and cooling management because managing that at density is tough. And part of the reason why that's so tough and we've created such a problem is we're really bad at doing it efficiently. So this was a neat study where they took advantage of scheduling, various scheduling anomalies that you can monitor in Intel processors to identify how much utilization was happening on machines on Amazon's EC2 infrastructure based on how the CPU behaved with the other co-resident virtual machines. And so what they basically did is spun up a ton of tiny virtual machines and then asked the hypervisor certain things that go back to the core CPU where they could determine what the overall load of the box running the hypervisor was. And even on an infrastructure like EC2 where you can literally the provision boxes on demand and tear them down when you're done with them, which is massively easier than it ever was with physical hardware, they only found about a 7.3% average utilization of the CPUs on those machines, which isn't super shocking. It's and it's probably far worse for most physical hardware that people have a high overhead bureaucratically for provisioning and tearing down. And this is pretty awful. So 2% of power in the US and we're actually making about 7.3% efficient use of it, which is I think better than an incandescent bulb, but not massively. And so this is not a new story of the time share computing to efficiently make use of computational resources has been around for a long time, as long as this guy on the screen has been around. And ultimately it comes back to that problem of how do you effectively take the CPU resources and other resources we have and efficiently break them into units so that we can make use of them and not make use of them at 7.3% but much, much closer to 100% as much as is possible without compromising the quality and reliability of the result. And it started, this has always been a problem because something is always scarce about doing computation, whether it's the hardware being scarce or the power being scarce or the space to be able to deploy the systems being scarce. This has always been something we focused on and it's always on some part of the books. And so it's obviously had a long history of people trying to confront this problem, starting in the 50s with batch processing, going into the 70s with terminals and even virtual machines on mainframes. The developing a client server architecture so that not everything had to be consolidated into a single cluster or mainframe machine to the point where they had to consolidate all the processing power because of how dense they could actually deploy that computation. And then kind of a reversal of that where we re-consolidate the hardware to be able to have those systems run more efficiently. But ultimately all of these things are culminating on the idea of we're trying to push around the computation burden in various ways and slice it in various ways so that we can do it efficiently for whatever goals we have. Because computation isn't a, is not an end in itself. It always serves some other end and it always has to be on the books of something else to accomplish something else. But why do people use these these inefficient architectures at least in terms of virtual machines and things like EC2 and why do they end up only utilizing them about 7.3 percent? And it's because people like virtual machines. When I went back to the efficiency I mentioned two kinds of efficiency computational and the human side. And one thing that virtual machines have been great about is the human efficiency, that the amount of extra things you need to learn to work on our virtual machine or redeploy an application to a virtual machine is really low because it looks just like physical hardware and that's the point. That it's very human efficient even if we're paying the price in terms of the utilization of power grid. And they've also been great about reducing the human element of running these systems where you used to have to have a physical machine for each level of isolation you wanted for all of these different systems. And by consolidating them onto virtual machines there are fewer people pulling carts through the data center racking and unracking systems because they're having issues or that they need to be provisioned. So that's another area of efficiency not even the developer just the person with the cart through the data center. But this is almost a kind of computational skeuomorphic design in the sense that we've taken this element that's familiar and we've reapplied it onto another context and emulated that context. It's kind of like when you use the previous generations of the iPhone design where they would have or iOS where they would have the calendar application it would look like physical pieces of paper or a notepad and it was designed to evoke this idea of you know how to write on a notepad you know how to write things onto a scheduler here's an unfamiliar interface for you to work with. And virtual machines are a much much more technical interface that's being replicated but it's ultimately for the same purpose the idea of increased similarity but that similarity that confined that constraint of similarity also means that we can't improve the design very much. You can't make virtual machines stop looking like physical machines or otherwise you lose the point of virtual machines. But I think that we need to make this leap. There are all sorts of great things here. I'm sorry just tracking time here. And what we need to do is we need to actually make a leap and basically say we should abandon some of the old ways and look at containers as these as the efficiency paragons they actually can be. And this was just a funny thing in San Francisco where VMware was like you know what virtualization did for computing why stop there and it was funny because VMware has actually published a lot of kind of fud pieces on containers and they're they're focused as a company on virtualization but they've they know that they've the the market for virtualization as a product is has reached the point of maturity that there's not much further to go so they're looking at other opportunities and this was several years ago that they had this up and they're now actually introducing a container infrastructure for data centers as part of VMware's cloud product. And containers actually are amazing at achieving efficiency not just for the underlying infrastructure but if developers start working on that working with them the right ways we can achieve efficiency for developers and the data centers jointly. This is just a neat kind of comparison here of what happened in shipping when containers got introduced the idea being that instead of working with non-uniform environments that people would be deploying applications to custom server clusters custom virtual machine clusters custom isolation layers for managing the resources of applications they can deal with a uniform interface and I'm not suggesting that we're going to gain the exact same efficiency by moving the containers on our infrastructure but dealing with uniform environments is actually important and having developers use those uniform environments can provide them the same flexibility and familiarity that they can get from virtual machines today just in a different stack. The so containers actually have their own long history as well as building blocks for infrastructure starting in about 1986. Now I've had a couple engineers from IBM dispute this that they're not quite the same thing but what I'm defining a container as is a mechanism for managing resources and security partitions on a system in a way where you're not only running one thing at a time like batch processing has been around forever where you stop doing one thing and you start doing another and you do it for different units of time but can you actually juggle them like can you actually give one thing 30% of the system and another thing 70% of the system and have that be persistent that that both applications get to run and the closest I could find was workload partitions back on the mainframe environment of AIX where you could actually define slices of the system in terms of resources that would be devoted to one application or another application so this goes back a while and then if you see my original post on the Linux journal there's an angry comment from someone who works on the FreeBSD kernel about how long ago FreeBSD has been doing this and why is it so popular now that nearly Linux has gotten it he's the venerable author of our our prized varnish reverse proxy cache Paul Henningkamp and he's he's a great guy actually but the BSD has admittedly been doing this as long as or longer than anyone else in the open source world at least and then Solaris actually introduced it a little later kind of back into the enterprise with Zones in 2005 the idea of splitting out the resources that machines were using without emulating full-on virtual machines and then the rest is kind of more recent history where we've seen that Google has built their entire infrastructure on containerization and they pushed their work on C groups which is designed to split out resources back into the Linux kernel in 2010 system D got introduced which provides a direct way of configuring C groups in a straightforward way for user applications and then we actually started to see the word a container be really rising in the Linux world in 2013 with Docker and CoreOS the idea of a unit of software that you could basically put into a digital box and drop onto a system and then have it get wired into the kernel on the network layer as as the box required for the purpose of of deployment I mean it's almost analogous to some of the refrigeration containers you see on ships where they have a standardized way of plugging power into them and monitoring into them so that they know that it's achieving the right temperature and it's getting it's running the refrigeration here and it fits in a place that's been defined and then it achieved even more mainstream use at least for for production environments with LXC 1.0 and Kubernetes in 2014. Kubernetes is Google getting into the the kind of open source orchestration game they've long used an analogous system internally to schedule how their containers get distributed but they they've only released that relatively recently and then now we're actually starting to see this kind of app containerization spec from the folks at CoreOS and even public container clouds the idea that you could supply this unit that needs to get plugged in a certain way up to a provider and they would you know attach it and run it which is these are all signs of increasing maturity in that space and it provides all of this without having to duplicate the OS without having to to run a hypervisor without having to waste a bunch of resources this way and there's small units the they're useful in small units this this is a remarkable thing here where it used to be that you could get 256 megabyte virtual machines on a lot of the public clouds but providers basically decided that those were useless because by the time you've started the operating system and done and are running basic services on the system 256 megabytes get gobbled up and you have almost no room left to run a real application load so that's that's kind of the overhead of each slice for a virtual machine that 256 megabytes is a useless slice it's basically all of your cake going to the overhead before you even get to take a bite and that's fortunately not the case with containers they're lightweight you're just which also allows you to distribute the computing load much more densely that one of the fears of deploying machines and having them run at say 90 percent utilization is what happens when it goes beyond that what happens when it hits 99 percent utilization 100 percent what happens when it starts swapping how do you respond to those events and the answer for physical hardware is really really hard like upgrading physical hardware typically requires taking down the box installing more ram and you can only do that to a certain point upgrading a virtual machine usually requires at least rebooting the instance because you you're probably provisioning it on to new hardware you might be migrating a bunch of data at a minimum you're having to pull all of the os all of the application run time and the actually application persistent data with that instance in order to rebalance load containers by contrast are really really lightweight because you might have the application run time depending on what container system you use and you just have the persistent data with the application so you can run boxes hotter because you're able to rebalance more efficiently the um and you're able to also rebalance more efficiently because the provisioning time is massively smaller where you're going from putting in a request where if you had the best physical of kind of traditional data center you could ever work with you could probably get a machine in eight to 24 hours that's really optimistic in my experience with a provider like rack space it typically takes a week or two uh you go to a cloud virtual machine um and like the last few that i've deployed on say ec2 it takes five to ten minutes that's really not very agile for responding to load for an application and then you go to a container where if you're actually really pessimistic about the container deploying it takes five to 15 seconds because it's a relatively lightweight of data uh lightweight set of data and it requires basically no boot time uh and no stand up of a really advanced user land um in some cases where you're just actually launching the container local os not migrating data at all uh you're looking at a tiny tiny fraction of a system uh a second um you're start you're able to start applications this is actually kind of booting Debian on Debian here um and it takes less time than Drupal boot strapping so enough about the theory of why this is important let's get into what we're doing with containers of pantheon um we have we run them really hot uh and we're proud of that uh we have 30 about 30 gig servers and we put um conservatively 150 containers each on them uh we're actually closer to 500 on some of our densest deployments and that means we're using every bit of those machines um the if you look at the um it looks like the databases here are a bit less dense than we're able to deploy the actual php runtimes but the the fact is is we're able to deploy them just at the overhead of php fpm and just at the overhead of moray db in terms of the memory that those things take and the cpu that those things take and so our average um amount of actual memory per container is 205 megabytes which is smaller than the instances that got discontinued on rackspace because they would get dogg gobbled up by the os before you would have any resources and that makes us at least twice as efficient as running virtual machines just looking at that um and on average on pantheon we actually deploy uh containers all the way from requesting them in the api to finding the box that they should run on to running the jobs on the box to provision the container to running chef to actually configure the things in the container to downloading the data that should be in the container on an average about 20 seconds uh and that's including quite a lot of stateful data moving around and we are able to achieve this by having that kind of feather weight design for things um even beyond what regular public container uh projects use one of the things we do that is i think unique is we run containers only when they're actually needed we use a technique called socket activation that waits for a network request to come in before we actually start the container that actually takes care of starting uh engine x php fpm and marie db in the background as the request comes in we take care of shutting things down in a way where that only adds a few seconds on the first request and this means that when we deploy all of those development stacks all of those testing stacks all of those kind of development environment stacks that we have which constitute more than two-thirds of the actual stacks on our infrastructure if no one's using them they're not running there's no cpu going to them there's no memory going to them and that allows us to provide those stacks at really good efficiency the other things that we do that are also a atypical is we make mutual use of the kind of base resources on the system if you look at a lot of container projects like docker you see them using uh they're putting the whole runtime and often a lot of the shared libraries into the container itself where what we do is we deploy every flavor of runtime to the base os on the system and to the base file system and then we allow the container to access those resources so that they only get mapped into memory once because one really neat thing that happens on linux systems it's probably pretty common off of linux systems too is if 100 things are using the same shared library or runtime it only goes into memory once and so that saves us about 10 gigabytes of ram per server just not loading the php runtime and all the necessary libraries for every single site that is running a container on that infrastructure so uh we get the 90% savings there we get the mutual use of base system resources so we're not even deploying the run times multiple times and then we're not deploying the base os at all multiple times within those containers all kind of multiplying out to make uh our infrastructure super efficient uh and then we also um keep agile with this design by using a kind of Lego-like architecture where there just all these blocks are designed to fit together and they're all designed to be interchangeable so when you deploy a large site to the platform we don't deploy huge containers on huge machines we actually would rather deploy 30 containers that are all relatively small and comparable in size to the ones you would get even for a free site and then we just load balance among them and that works great for things like php and uh there's a neat analogy of kind of pets versus livestock of the idea of like what happens when one gets sick and um that we treat our containers like livestock in the sense that this there's this huge fleet of resources we have and some of them uh may get sick in various ways and we deal with that as a kind of systemic problem rather than fretting over healing one thing and taking it to the doctor a bunch of times which is kind of the traditional way that people work with servers uh we also achieve this through scheduling which is we run our servers at about 90 percent utilization constantly rebalancing them with something we call the migration dragon uh the migration dragon is something that runs on some of the hottest boxes based on our scoring and then uh queues those containers for migration uh and they automatically get rescheduled onto the least loaded boxes so all we have to do to rebalance load or take away the heat from a heavy server is to deploy an empty server and then it kind of naturally migrates off of there uh this all uh and then we also avoid resource saturation by running several representative workloads on the machines to that inform that scoring so if memory or cpu or disk i o or network i o are getting saturated we kind of score those down uh this is often handled in open source projects by projects like kubernetes where you it informs the scheduler about decisions uh for distributing the containers and this all manifests as uh an average container age of about 50 days on our infrastructure which allows us to do really neat things like when rackspace introduced ssds we just deployed a bunch of ssd machines waited for the containers to migrate onto them and then uh when the other we marked the other servers as deprecated which migrated the rest of the containers off of them even beyond just even balancing and then we retired them and that allowed us to migrate our entire customer base uh to a complete ssd fleet in about two months without actually contacting anyone or uh introducing any downtime for the migration um and so this will please zack who's in the audience our venerable ceo he wanted me to do a real-time demo and honestly i give him a hard time about it but it is a really neat kind of demonstration about how some of our stuff works so uh if i go to um we created these test sites um that have something called the platform demo in them uh in the platform ooh lower resolution just do that so um we have the platform demo here and what this is is this is a uh an instance um it's just a web regular website on pantheon and it's running this thing that's just index dot html with some javascript um and accessing a php resource on the back end that does introspection theoretically you could just deploy this code uh to any project on pantheon and what this does is it queries our core api to ask what resources are deployed to this environment uh for the stack and uh the neat thing is as i scale it uh we can see it scale real time uh so whoops if i go back over here and i go into the settings for the site and i bump it up to a like a business plan level which means that we start deploying active active application servers instead of active passive which is the case for the professional level and i hop over here um what's going to happen is uh we get a second container that's actually displaying at real time it was literally configured in that time and even these requests that are going through of refreshing this thing are getting routed through that same container infrastructure that is in flight right now on the base baseline um and then i uh just for fun um i brought up a thing here where uh i can actually set the application servers here let's hope that went through there we go so i can set it to a goal number and then as things what happens is the core api realizes that there's a disparity between the goal of configuration for the system and what is actually available for the system and then it fills that in so if we were it gets a little unstable with the graphing thing when you start having a lot and so what this does is uh i can basically say that's the goal and it doesn't just work for stateless resources like application servers i can actually go in um to hear and create database replicas as well and with whatever replication topology that i want i'm not sure if that's going to come up too quick because it actually has to do this the transactional snapshot loaded into the replica server and then bring it online with the replica but we'll see if it comes up in a moment while i'm talking but the idea here is that instead of treating infrastructure as something where we treat individual instances with lots of care and dedication we treat it as a declarative thing where we just want this much and when there's not that much we provision that much and that also means that we can do things like retire one of these servers it'll realize that the infrastructure of this site is out of alignment with the goals for it like the ways that we handle databases and migration is we'll actually keep a master server and a replica and the way that we retire this instance is we basically just say this box is going offline which promotes this to the master and then it realizes wait i'm supposed to have at least one replica and then it says okay well i'll create one off the current master and it'll create it there and what this means is that we can set that goal to anything we want like we could set it to have three replica instances and whenever one of them goes away or one of them gets promoted it recreates the hierarchy with enough active provisioned instances to meet the the goal there and then to hop back to kind of like a little bit more gritty shell stuff i wanted to show off some of the kind of socket activation stuff that we do so here so here i have the the development environment for this site that i was working on and let me just do this more real time so the state of oops no it's been reconfigured but the that's okay so here's the engine x instance for the actual environment that i'm talking about here and the that engine x instance is actually listed as dependent on the php fpm instance which is another service that runs within that container and so what i can and then this is all listening on something that is a socket for socket activation where it's listening on port 26607 on this machine on v4 and v6 and as soon as the request comes in what's going to happen is this socket will get it uh system d will say what is the service associated with the socket oh it's engine x i'll start that i'll pass in the socket which is a neat thing you can do on unix systems with file descriptors it hands it off to engine x engine x takes over that and then actually before engine x comes online it realizes that php fpm is a dependency of engine x it starts up php fpm waits till php fpm is ready then starts engine x then engine x gets the socket and then engine x can actually service the request and stay online and then we monitor this with a process called the reaper that is looking for currently running containers that are not actively servicing requests like they've been idle for hours at a time and then we can spin them down and that's actually a neat kind of almost transactional atomic thing because if we cue a spin down of it and a request comes in system system d realizes that a request has come in for a service that is even kind of in the process of getting shut down and it will immediately schedule a following start of the service so there's no time at which this should interrupt availability of instances by using this kind of approach to resources and now if i go over to the here and hop to my development site you'll notice it takes a few moments because it's actually spending up those resources in the back end um and if i go here um you'll actually notice that whoops that's the unit file so it's actually running that now it started the master process and the worker process for engine x it is also started uh my php fpm pool for this as well which is uh actually quite small because we're using kind of dynamic ratios for that the um and then now it's come up and in that case it was taking long enough that i can tell like as someone who's worked on this was actually pulling up the MariaDB instance in the background as well which we have uh less aggressive shutdown goals for those because they take a little longer to come online but the uh but the really cool thing here is that uh we're able to deploy out all these stacks kind of make it all so liberally available and democratized in terms of everyone having access to it uh without actually taking very many resources and then um the hopping back to here um let's talk about some of the kind of bones that make up the stuff uh one of the the uh when people talk about containers they're not there's no concept of containers in the linux kernel at all uh and that's why i have like the kernels in the background here because unlike solaris and bsd kernels of containers are basically a synthesis of many control systems that are present in the kernel and when we say that there's a container it's basically like we've flipped on all the isolation switches and the two key ones are c groups and namespaces which are actually things in the linux kernel uh c groups is uh ultimately a hierarchy of processes uh where you can basically divide out resources either rigidly by saying that there's specific limits uh or proportionally saying that someone gets 75 percent of the pie and someone gets 25 percent of the pie and uh this also happens hierarchically where uh you could divide things that are 75 percent for this kind of drush and production php fpm and maybe you are running drush as a development process and you're using something like rsync to be able to pull around files or freshen development copy you could give that considerably fewer resources this could be block i o this could be cpu this could be um other uh memory on the system um and uh what it does is it actually puts no limits on the activity unless there's contention for those resources so as long as rsync is not putting as long as rsync's disc load is not actually affecting drush and php fpm it can have all it wants to eat uh it's only when it starts undermining the performance of the production system does the kernel get involved and start saying you get this much and you get this much uh and you cannot you can take it to extremes uh the uh and uh we don't quite do it to this extreme uh we're more like this for like enterprise versus development on pantheon um but you could take it to an extreme uh and there are realistic implications of using something like this where if you're running a backup you might be perfectly comfortable giving it two percent because the majority of the time it's not actually going to be confined to that because it's not actually going to have resource contention but when it does you can ensure that those things are not undermining your production deployments and there are a ton of controllers for cgroups all of which affect the hierarchy of processes that are under them uh and this is just a few of them uh the ones we use at pantheon are mostly the uh cpu block i o and memory ones and um let's see let's skip a few things here it's boring but so uh this is a kind of neat uh demonstration here where uh what you can do is you can actually set hard limits for something where you could give something a hundred bytes for even uh your current um process like your current shell uh what this is doing here is it's creating a cgroup for just yourself with named teensy and setting a 100 byte limit for it uh and then uh when you actually make that take effect uh where echo dollar dollar is your current process id and if you assign your current process id to uh the task or like p id list of that cgroup and then you try to run something even as lightweight as ls it'll just kill you uh and so these these really do work and they're actually pretty lightweight to implement uh here's another case where i created a um a python process that's was designed to just eat cpu um you could argue every python process eats cpu but the uh this one was specifically designed to do that and python is an excellent language for that uh the with the global interpreter lock the uh and so what this does is i was just running these one in uh cgroup a and one in cgroup bb and i gave one of them uh 100 cpu shares and one of them 10 cpu shares and what that does in the system is it just adds those two so we have a pie of 100 cpu shares and one gets one 11th of it and the other one gets 10 11ths of it and you can see it manifest as not quite precisely that but pretty close you're seeing you know an order of magnitude of of the different resources available because when i run those concurrently on the box they are ones getting six percent of cpu and ones getting 60 percent and they're the exact same program they're also namespaces like the other halves of containers uh these are not designed strictly for security isolation and you would really wouldn't want to rely on that but they can provide the illusion of an isolated system which is very useful for developer productivity uh and i say that because the one of the key things that developers love about vms and continue to love about vms is they look just like you've got your own box and everyone knows how to work with a single box for installing services and configuring things and namespaces are a really effective way of providing that illusion for containers uh not it's not quite to the level of what virtual machines can provide with their own kind of kernel but it's getting really really close because now you can have your own p id numbers your own user ids uh your own mounts your own network devices uh all these things that used to require running a virtual machine and uh the interface for uh namespaces is called unshare uh whether you're doing it with a shell or whether you're calling the kernel of syscalls and uh the you base uh so the way that everything starts in the system when you start it on linux is you just start in the namespaces of your parent which if you're on the base system you start with basically no namespace at all like namespace zero and linux basically forces you to turn on all these individual isolation isolation isolation switches individually and uh when you use a tool like docker all that it's really doing is basically going through the switches and turning them all on for you uh but you can do them individually and this is a case here where this would create a new network namespace for a shell that you could start and what that'll do is you'll just and if you do something like if uh if config you'll just list a local host adapter uh and nothing else and then you can pull in other devices into the new namespace but you have to have a handle to the old one and then uh I just wanted to kind of provide a little bit of a survey of like what's out there if you want to work with containers because we've rolled a lot of our own kind of custom stuff but there are increasingly good options in the community now uh the old old standby which is still not a bad tool is lxc uh it's fairly lightweight in the sense that it doesn't have any kind of image format but it does allow you to turn all those switches on really fast and it has a lot of language bindings if you want to invoke it through something like python or lua uh or ruby or go uh and this is before a lot of these other container frameworks existed this was kind of the standard tool to work with uh google launched a project called let me contain that for you uh which I think uh what's that oh um I think was mostly launched because they wanted to be like me too oh we've been doing this for like five years when you youngins were just like thinking about the idea of containers uh and it has all the signs that it's that kind of project because it was like thrown over the wall on github and I don't think it's had a single commit since 2014 but uh but it definitely demonstrated that google had been doing these things for years and proved it to the world so I wouldn't use it for anything but it's interesting to browse if you're curious how google's doing it um and then there's systemd in spawn which is part of the systemd project whose virtue is being on everyone's distribution whether you like it or not uh and it's the closest to what pantheon uses because we actually use systemd services and uh systemd's own in spawn tool allows you to flip on all those switches as well where like services and systemd you still flip them on individually and spawn is like you know going up against the wall and turning them all on uh and it started off as a toy that lennert wrote just to like test systemd and uh but it's now actually part of production tools like coro s's rocket runtime so we're actually going to remove the thing about it being not usable for production purposes uh and then it's super easy to use so you can actually do something like vagrant ssh that bootstrap um and like install debian unstable and then you can actually launch debian on fedora uh by just doing systemd in spawn and giving it a directory and it's inherently designed to not actually have any kind of uh daemon back end it's just designed to be a foreground tool uh and then rocket has been built on systemd in spawn and it's actually my favorite new container tool that we have yet to deploy at pantheon but i'm really excited about because it uses the app container spec which is a kind of independently maintained specification for what should go in a container and uh it uses my favorite systemd tools so i'm going to kind of you know advertise that project but google is actually behind the image format as well as coro s and several other companies so i think it's going to take off and we're probably going to pull a few more elements of it into systemd uh and then there's the the venerable docker which everyone's heard of um the um it's definitely more focused on developer integration and developer tools than production deployments but no discussion of containers would be complete without mentioning it um they have their own de facto image format in a very very wide variety of images that are available for deployment of to assemble you know lego stacks of every kind of software you could imagine including everything for drupal and beyond and they kind of have this spectrum of complexity and uh and completeness all the way from let me containerize that for you which barely does anything all the way up to docker with its own defined infrastructure um and de facto um definition of how containers should work and kind of most complete set of requirements uh so like lxc has a more comprehensive controls over resources than let me containerize that for you in spawn actually has the ability to start working with image formats but it doesn't have any daemon rocket actually starts the idea of having a daemon that's available for configuring it in addition to everything to the right of it and then docker adds in some of the orchestration features uh where you're not just dealing with images but an infrastructure for running the images uh and so if you uh all these tools are legitimate to use um except for possibly let me containerize that for you uh but it's important to look at containers as the spectrum it's not like they're not they shouldn't be synonymous with any single one of these technologies um and orchestration is the new hotness uh which um our friends at coro s are pushing a lot of neat things on uh where in term uh orchestration being there are these great tools for managing containers on individual machines but then you start dealing with a data center or at least multiple machines how do you control where those containers are deploying how the images get distributed the security relationships between those machines and they're there i'd say at the forefront of um some of the most innovative designs uh for for dealing with that problem uh and i'll open the floor for questions yes the point during the deploy process and we want to move away from that um it's not super slow it's more about repeatability where it would be really nice to be able to bake the images uh qa them and then push them out rather than put a configuring in place and uh that's why i'm pretty excited about the app container spec it provides a way to to provide a uniform way to bake a container uh that is not necessarily tied to any one deployment environment we don't actually run chef server uh we run plugins to chef uh actually if you could go to the microphone if that would that would be helpful okay yeah we don't run uh we don't run chef server we run plugins to chef that use our public infrastructure to talk to our core api and that tells them what should be going any place like every container has uuid and then it knows it's uuid it talks to our api and says what do i deploy in here for this uuid and it knows the flavor of container and then that uh that combined with the metadata and access to some of the resources is what allows it to configure it in place microphone now thank you for your talk that was really enlightening um so i i noticed in your your diagram that uh you're load balancing among a large number of containers per per site um you host a lot of sites i was wondering how you deal with load on the load balancers themselves we have uh several clusters of load balancers our in-house load balancer we call sticks because it takes requests to their final list of nation uh the uh it's written in go um and what it does is it queries the container back ends for a given host name dynamically does a kind of lazy lookup so as soon as a request comes into our edge uh it checks its local cache for routes to see if it knows what containers should be serving that environment or that host name if it doesn't have any data on that then it talks to our api and asks what containers are servicing this environment uh and then it uses that to uh in the completely naive case just randomly choose a container to send the load to but it also learns passively about uh the response times of those containers and their availability and will kind of uh it will kind of add like demerit points to the containers that are performing poorly or not responding and then that way it doesn't turn out to be even load distribution what happens is we have these small queues in engine x for to refuse the requests if there are too many in a container and then that actually effectively causes it to make maximum use of as many containers as possible thank you that that's really interesting hi great talk by the way thanks um so i work at a media company and we're really looking into possibly using container technologies and so when i when i saw in your talk you were saying that you would deploy 150 or so number of containers and so i'm i can't read i'm i'm looking at this from a physical perspective let's just say one ec2 instance that is a an m3 or something so are you actually deploying multiple containers in that one ec2 instance well we're so we're using rec space cloud about 30 gig instances and then we deploy about 5 000 containers per box and then we run somewhere between 150 to 500 of them at once so let me ask this if if let's say we were a company that had one website and we're looking into saying to to being able to let's say scale out on demand so we what would happen we would have one large ec2 instance with only a single a single container launched and then as demand grows we would fire up more oh i wouldn't recommend that i think it's important to get high availability for systems uh huh so our scheduler looks at two things uh other than some kind of hard and fast constraints but the main two things that makes the decision on are the first priority is ha for sites so that uh it will almost never schedule a container to be on the same on the same host as an existing container for the site unless it has absolutely no other option which is typically only the case in our development environments and what happens there is uh that typically black lists a number of servers more than a handful of servers if you have say 30 containers but it's certainly spreads it out and then the second priority is distributing the container to the machine with lowest load so what i would recommend is don't deploy one machine deploy at least two and then what you can do is you can use a tool like kubernetes to maintain additional uh machines and incorporate them into your infrastructure kubernetes is by the way built into coro s of stuff too and then what will happen is is that it has similar constraints around trying to distribute stuff and then what you should do is say i need two application servers for this one site and then that'll ensure that it has a footprint on both machines thanks does that make sense uh yes okay hi very informative um how finally do you slice uh the services on a machine into different containers and what's the general rule how finally we slice them yeah like engine x versus engine x and and php or those are all running in uh so everything is containerized in its own unique we're piling uh we do pile engine x the most kind of co-located thing is our application server and that has engine x php fpm and the client for our distributed file system all piled into the same security context in a container most of our other containers are single process like our redis containers only contain redis our maria db containers only contain maria db hey david thanks for your talk um i just have a quick question so it was impressive to see how quick the containers come up and i was just curious as to how the drupal application layer makes it or the code itself makes it inside the container is it is the code running inside that container when you when you bring one up to to add more capacity or is it is that container just connecting to a distributed file system the uh the run the actual application code is not running on distributed file system we pull that through git uh and if you're deploying the tester live environment it's associated with a git tag and if you're looking at any other environment it's associated with a git the tip of a git branch and what happens is that every container gets a certificate deployed to it from our public infrastructure and that gives it authorization to access the resources for that environment and so what that means is that when you when i deploy a new uh application server for the live environment it knows it's an application server for the live environment and it has a certificate that it can use to talk to the git server and pull down the code for that is tagged for the live environment for the latest release and then it reports back up to our api with the actual hash that it's on and then we make sure that those are consistent across uh these the containers deployed to an environment okay so the the initial start of a container is going to do a git clone in order to get that correct of code okay and we do take advantage of git shallow clone to avoid pulling the history in oh okay all right thanks thanks for the talk um my question's related do you use what backend do you use to store the maria db databases that your maria db containers are using oh uh they're using local storage uh the uh all of the rackspace vms that we deploy to have local raid 10 at a minimum of ssd and at a maximum they'll have fusion io follow it um would you mind sharing like a little bit about how you migrate the databases but yeah that's really neat um so what we do for the database migrations is uh we basically treat it as a thing where we temporarily create a replica and then we basically promote the replica uh there's actually no difference in our infrastructure between migrating a database and then failing over uh we just treat it as different windows of time so if you're migrating a database from machine a to machine b it's just a very rapid progression through uh transactional snapshot to replication uh set up to replication catching up with the master server to failover like there's just no delay between uh the replica catching up and us doing failover whereas if we maintain a replica persistently it's just that we kind of hold it at that point uh before doing failover um and then that allows us to do the migration with minimal downtime and we retarget all of the application servers to the new uh master as well as populating updated information about the replicas hi david hey same question is last time do you guys have plans to release your distributed file system uh not at the moment um uh the uh to give a little background um we wrote our own distributed file system for pantheon because uh if you look at a lot of competing architectures you have this kind of uh split over scale where every you either deal with single machines and local file system or you're dealing with a setup of something like gluster fs nfs uh or a tool like um some other distributed file system back end and when you're dealing with media content for drupal compiled css javascript you need to be able to have a at least very close to consistent file system for it and we didn't want to switch technologies in the middle of the stack for scaling up we one of our kind of mantras is smooth scaling where like the performance of the application doesn't change from the free environment up to the large enterprise ones and so uh what we did is build our own distributed file system that is um content addressed meaning whenever you put something into it it hashes the content and then uh references that content using the hash of it it uses the shaw 512 and then uh every single application server has a client to our distributed file system that is accessing the metadata clusters and is accessing the kind of longer tail storage on s3 and then that allows us to deploy the exact same file system for every project on the platform and massively scale it out there are only six production servers for the file system supporting all of pantheon right now two in two separate clusters thanks