 All right, I think I'm good to go and running on time. So I'm going to get started. Thanks, everyone, for joining me today. My name's Stephen Gordon. I'm a principal product manager at Red Hat in the OpenStack group. Today, I'm actually going to be talking about some work that our scale and performance team have been doing around deploying containers at scale on OpenStack. In terms of getting into that, though, I need to talk a little bit about some of the composite parts that make up the stack that we were doing the performance testing on, and just give a little bit of context to how and why we're putting those together. So obviously, we're here today at OpenStack Summit, primarily talking about OpenStack, the open source infrastructure platform. For the purposes of this testing, we're also using, when we talk about containers on OpenStack, and obviously, there's a lot of buzz around this. Kubernetes as an open source system for automating the management orchestrating the deployment and management of containers. And that's what we're using as part of our OpenShift platform, which is our enterprise container orchestration distribution. So when we re-architected OpenShift around Kubernetes, we were making a bet on community-powered innovation and that this was an open source community with the kind of velocity around it that we saw as being the one that would build a solution that would ultimately be successful in the market. And in a lot of ways, the way we've approached that application platform is very similar to the reasons we chose OpenStack for infrastructure in the first place as well. So first of all, in terms of talking about containers and OpenStack, why run them on OpenStack in the first place? And fundamentally, the way I look at that is around the exposition and consumption of resources. So traditionally, an operating system really did both jobs on a single box. So if you think about the kernel, it was exposing CPU, RAM, disk resources, and user space processes were consuming those on top. In a distributed system, things get a little bit more complicated. First of all, our resources are not necessarily physical in nature. There's been a move to software-defined infrastructure that we have to factor into this. And also part of that obviously is that it is distributed by nature. So when we think about a container, it's primarily three things. So when at rest, it's really just a file. But when running, it could be considered effectively a fancy process or a fancy user space process at least. So in its rawest form, we're dealing with code. So say my MySQL database server, configuration for that database server application that's coming in either via a file mount or via sharing through a secrets mechanism, and also the actual data, which again is usually mounted from a volume across the network. But there's other more complex resources that we have to manage as well. So thinking about networking, we need our load balances. We need DNS. We may want to access not just block volumes, but also file storage via something like Manila. These are all resources. And that's really where OpenStack comes in in terms of exposing those resources. So perhaps in college or somewhere else, people may have had the job being requested to provision a new machine. They got hardware, got a physical box, plug it in, give someone SSH access. We're done. But as we repeat that process over and over again, and also with more complex resources, that becomes unscalable. And that's when we look at something like OpenStack to help us with provisioning resources on demand effectively. So then moving up from there, why when we talk about combining containers and OpenStack, are we at Red Hat using OpenShift? And why do we want to combine those two? So this is really the consumption side of things. So OpenShift, by virtue of inclusion, Kubernetes, consuming resources in a way that's transparent to the application in that you as the application programmer shouldn't even really have to think about it. It's in terms of the integration of the two systems. So historically, when we're consuming resources via a process, or historically, we were consuming resources via a process. So for example, the PS3 output on the right here. In a modern distributed process, we really mean we're consuming resources by firing up some number of containers, distributed across my cluster, and getting access to those shared services underneath that are provided by the infrastructure platform. And fundamentally, at the end of the day, what we're trying to get out of this is this mindset of loading applications at the factory, not the dock. The factory, in effect, being the developer laptop, and the dock effectively being our production environment. So OpenShift allows us to iterate across, OpenShift and Kubernetes allow us to iterate across not just the developer laptop, but have the same platform exposed when we're running on our production clouds as well. In terms of OpenShift and the way we build it, so it's what we call community-powered innovation. So talking about the cloud of projects on the left, so things like Project Atomic, Kubernetes, Docker, and even middleware projects like Wildfly, all fit into what we refer to as a midstream project, OpenShift Origin, which is how we actually integrate all of these pieces in the open and ultimately build the products that becomes OpenShift container platform. Red Hat is, of course, a leading contributor, not just to OpenStack, but to Kubernetes and Docker and many of the other projects that make up this stack and combine to provide that platform. So when we kind of look at that all put together in what comes OpenShift, so if we look at our footprints across the bottom, physical, virtual, private, and public, obviously when we think about OpenStack, we're really talking about primarily that private public divide there on the two cloud systems. When we're talking about virtual, we might think about something like Red Hat and Price Virtualization or a more traditional virtualization platform. But as we move up from that and we look at something like Relatomic Host and what it's providing us in terms of a trusted container operating system, so it's effectively a small, fast, and secure footprint for running containers on. So it's configured with cloud in it, updated using OS Tree, and includes host management using Cockpit, and that's effectively the base on top of which the container platform runs. So if we look at moving up from there, the set of blue boxes in the middle, that's really the area where container orchestration is added in this picture via Kubernetes. So that's including things like networking storage and registry functionality that we build around Kubernetes, logging in metrics, security, and so on. We also rely in terms of the operating system on software collections to provide alternate runtimes and containers for those runtimes. So things like different versions of Python, Ruby, Node.js, et cetera, and so on. Another important part of this is in terms of looking at life cycle management and CI CD. So OpenShift also includes a source to image capability. So the idea that I, as a developer, can push effectively from my Git Tree directly to OpenShift via either CLI or the Web UI. And OpenShift will take care of rebuilding the containers involved in that application and pushing them where I wanna go in terms of my dev test production environment. So OpenShift is handling the automation of that all the way through the stack. So how do these actually work together? So what I wanna talk about briefly is just some of the work that has been going on, both in terms of the upstream community, but also at Red Hat in terms of combining OpenShift and OpenStack. And again, by virtue of inclusion, Kubernetes, which is where a lot of the rubber hits the road in terms of those two pieces working together. So if we look at our sandwich, we have our application at the top layer. We have OpenShift or the container platform, at least in the middle. And then we have our cloud, effectively our cloud operating system in OpenStack at the infrastructure layer. So when we combine these, there are a couple of architectural tenants to what we wanna do that we wanna ensure we maintain. So we wanna have technical independence of the application. So my application developer at the top level shouldn't really be writing in the situation directly against the cloud APIs. They should be able to write against the Kubernetes constructs, knowing that those we translated to whatever cloud they have to be on the back end and my operators are gonna take care of making sure that linkage lines up. We wanna avoid redundancy. So we wanna have less layers, effectively, between the application and the cloud operating system and ultimately the bare metal hardware. Some of that is still kind of a gap in the integration here. So things like project courier are aimed at networking integration, for example, and eliminating the need for double encapsulation of the networking both at the Kubernetes layer and also at the OpenStack layer. We also ultimately wanna offer simplified management for this entire stack. So giving you the APIs you need to manage it and also that contextual awareness. So Kubernetes itself or the container orchestration engine has to have the contextual awareness to know when I'm on OpenStack and I wanna get a volume, I go to Cinder when I want a load balancer, where do I go for that and so on in terms of cloud API endpoints. I mentioned courier as one of the things that we're interested in adding to this in the future. Some other things, and we're just upstairs actually at the OpenStack special interest group workshop, other things people are talking about, file share access via Manila, moving to load balancer v2, which is obviously becoming more urgent from an OpenStack point of view. Secret sharing, whether there's a potential for Barbican integration with Kubernetes and so on. These are kind of things we see looking forward as becoming important for people who wanna run Kubernetes applications on top of OpenStack with this kind of integrated experience. So just in terms of how does this actually work in practice, if I wanna use the cloud provider framework for the implementation for OpenStack. So Kubernetes has this concept of a cloud provider framework, which is how it effectively abstracts itself from the underlying cloud. So there's obviously implementations not just for OpenStack, but also for GC, AWS, VMware, other platforms as well. But at the end of the day, it all comes back to this kind of cloud config in terms of how do I tell Kubernetes where the endpoints are for my specific underlying infrastructure. So in the OpenStack case, it looks like this. It has an authentication URL, which obviously is our Keystone endpoint. We give it a username and a password and a tenant ID in terms of what this cluster is gonna authenticate as, and also a region. In, I don't think it made 1.4, but I think it'll probably be in Kubernetes 1.5. We're also gonna have the ability to use Keystone Trust, I believe. And then finally, we also specify a subnet for our load balancer that's gonna be actually used in front of the application traffic. And then the last step there, and this specific one that I have here in terms of editing the Etsy slash origin files, that's specific to OpenShift, but there's a similar step if you're just using Kubernetes directly in terms of configuring the masters and the minion nodes with the correct cloud provider value, which in this case obviously is OpenStack. All right, so with that kind of context setting out of the way, we can move into our scalability testing in terms of what we've actually done over the last couple of months, and what we're gonna be doing going forward as well. So just in terms of Kubernetes community organization, so there's a concept in Kubernetes of special interest groups or SIGs, which I would equate loosely to the concept of a working group in the OpenStack community. So I mentioned, for example, there is an OpenStack SIG in the Kubernetes community. There are also other ones listed here, so scheduling, cluster ops and so on. One of the things with the OpenStack SIG, so it's a coordination point for OpenStack related changes to Kubernetes, but it's certainly not the only SIG that's working on OpenStack related things. So for example, in this particular case, we're actually talking more about the work of the scaling SIG or the scalability SIG, sorry. So the scalability SIG sets a number of SLAs for themselves in terms of their expectations of what it means to scale when we ramp up a Kubernetes cluster. So in particular, API responsiveness, so I expect 99% of calls to return in less than one second. And as I add nodes to my cluster, do I keep maintaining that? At what point do I lose that SLA? Pod startup time, 99% of pods starting within five seconds as well. They also define a number of other primary and derived metrics. So primary metrics include maximum number of calls per cluster, max pods per core, examples of derived metrics, so things like max calls per node, max pods per machine. I should add the disclaimer in terms of the pod startup time down the bottom here with pre-pulled images. So what they're saying there is that when we talk about maxing out the scale of a Kubernetes cluster, they're really taking the performance of the registry and the network in between the registry and the cluster out of the picture here. So all of the images are pre-cached to remove that variability because we're really trying to stress the performance of Kubernetes itself as a scheduler and management piece. So in terms of the goals moving into this exercise, so people may be familiar with, I think around the Kubernetes 1.2 timeframe from an upstream scalability sig point of view. There was testing done to prove out a 1000 node Kubernetes cluster. So we obviously want to validate that when we talk about OpenShift container platform and validating what we've done with the product, we certainly don't want to have introduced any additional scalability bottlenecks on top of that. So it's in part revalidating what we've already done and having a reference design around that, but also in pushing to that limit and beyond trying to identify, okay, what issues do we hit? Which of these are configuration related? Which of these are actually related to the code base? How do we track and address those going forwards that we can then push beyond that to 2000, 5000, so on? And then obviously documenting those things via issues and patches in the upstream community. So obviously to do any serious scalability testing, you need hardware. So the Cloud Native Computing Foundation via Intel have a cluster with 1000 nodes of bare metal hardware available for community projects basically to use for testing. For those that are familiar with the OSIC Lab for OpenStack, it's basically the same concept and in fact it's 1000 boxes in the same data center. But the idea is similar. So if you're doing work on a CNCF related project, be it one that's already under the CNCF governance or related to those areas around Cloud Native Computing, you can go to the GitHub link here, you can file a PR or an issue for access to the system. And depending on how good a case you state effectively for the type of work you're doing, you'll get some limited time access to be able to run scale testing or whatever other type of testing you need to, but it is kind of not a one time thing but short bursts effectively. So it's different to something like OpenStack Infra which is more continuously running jobs on an ongoing basis. So the node specs are listed here. They aren't particularly important in of themselves. The only thing I would highlight is that out of that 1000 node cluster, what we actually got was 300 physical nodes which is what we deployed OpenStack on. And then on top of that we were running however many OpenShift VMs we needed to scale up the Kubernetes cluster effectively. The other thing that's probably worth noting, the VM image disk, so we're actually using the Intel NVMe disks there for that as well. So a total pool of hardware, so 300 nodes, 14,000 CPUs or 14,000 plus CPUs and plenty of RAM and storage to go with that. In terms of the software that we ran on this, so we tested from a hypervisor point of view, so we're using Red Hat OpenStack Platform 8 Liberty at the time, so this testing was done in August I think. There is currently an effort underway of some different hardware to try and spin a retest with some newer versions of the software. So we used OpenStack Platform 8 so that meant REL 7.2 hosts and we used OpenShift Container Platform 3.3 so that was alpha code using Kubernetes 1.3 at the time. That version of OpenShift is now actually generally available and productized. And in terms of the architecture diagram, which I don't think I've missed any off of here, so I'm gonna move kind of from left to right. So we have the OpenStack under cloud box which has, where it says director, that's really the product name for triple O so the upstream triple O project that we use is a deployment tool. We also use that single box effectively as a jump box so it also has some Ansible stuff on it and Grafana. We use that to deploy our OpenStack cloud so we have the three controller nodes. We actually divided up into two host aggregates for both the high available infra part and the high availability catch all. The reason for that is we did some concurrent testing using different deployment mechanisms. So there's actually two different approaches you can use to deploy OpenShift on OpenStack at the moment. So we have a set of OpenShift Ansible playbooks but we also have some heat templates that do some of the pre-configuration for you. So we wanted to divide up the infrastructure to allow us to test both mechanisms side by side effectively. So we deploy in that HA infra, those two host aggregates, a number of master nodes and then a number of at CD nodes which will become important in a moment. And then in the support AZ, in that availability zone, we have the routing, the registry capabilities, metrics and logging. So effectively supporting capabilities for the actual application cluster. And then in the catch all host aggregates, we have the actual nodes themselves. So these are the Kubernetes minions where the workloads are actually going to end up running. And the infra Jmeter nodes on the far right where we're actually generating the load from. So just briefly breaking that down in terms of the architecture here. So the masters in OpenShift parlance effectively the same thing as a Kubernetes master. There are some additional management capabilities on there. And then we have our actual nodes, the minions running the pods. So actual application clusters. Around that, as I mentioned, we have the SCM, so the source to image pipeline directly from your application source and the CI CD workflow. Moving on. So in terms of actually generating the load and also some of the preconfiguration in the environment, there's a set of tools called the system verification test suite. So these are on GitHub performance and scale team put these tools together. They're not actually supported in product but they're how they exercise the product to perform the performance testing. So the main things, a cluster loader, which is how the load is actually generated, networking, workload generator, and reliability testing. I'm also gonna talk a little bit about the way the images are built and there's tooling for that also and that repository as well. So when I talk about the image provisioner, one of the things I mentioned earlier was that they wanted to take the time or the variability introduced by pulling from the registry and from the network throughput out of the equation and test purely the Kubernetes scheduling and orchestration. So to do that, we used some rel pre-configured images using the image provisioner to preload additional RPMs and also to preload the container images and then we used that to actually provision each of the OpenShift nodes. So some of the things that we're using answer what I do on that image. File system setup, rel OS setup, pulling in the OpenShift RPMs, pulling in those container images and so on. So the cluster loader architecture, so the cluster loader again is another part of the SVT and effectively what it's doing is for some type of object, so here quota, template, service, user, pod, replication controller, there are a couple of others counted in the next slide but effectively we give it a number of those that we wanna create and it just loops and creates them using the APIs and that's running from the Jmeter boxes on the side of the architecture diagram. So what did we actually learn from the testing that was done? So ultimately we did get to our 1000 node goal, although we did hit some issues along the way which I'll mention in a moment and associated with those runs, 13,000 projects and 52,000 pods were created. Those are the numbers that we have currently. Obviously the goal is to stretch those further and coming Kubernetes and OpenStack releases. So in terms of issues encountered, so the first one that came up was the at CD disk utilization at scale was actually quite a bit higher and also the RAM utilization for that matter and CPU. The reason for that was that the, for every object we're creating, so any of those different types of objects the cluster loader is creating, those objects put something into at CD and thus as a result of that, the outcome was that we actually increased our guidance to large environments to 20 gig of RAM. I think the max we got to here was about 12.5 but obviously as we want to push beyond, we're currently thinking that you need a lot more memory backing in the at CD nodes or on the masters rather. The other issue encountered in the test runs, the API server CPU can kind of see a little ways in here. We start getting spikes across the graph in the CPU utilization. We actually found that that was occurring because of the Kubernetes master was effectively panicking and restarting on a fairly predictable basis once it got going and that started around 13,000 cluster loader objects and 52,000 pods, which I think we saw as the high watermark on the previous slide. So that's actually been rectified upstream and pulled into the final version of OpenShift 33. So I think that's in Kubernetes 1.3 as well. And I have a PR in the notes here. I wish I could share later as well. The other issue, which was a little bit of a surprise, so I mentioned that in terms of installing OpenShift, we use OpenShift Ansible to drive the actual installation and configuration of the software on both the master nodes and the minions and also some of those supporting resources like the routers and the registry. We found when we, we'd originally done some of this testing using Ansible 1.9 and we moved up to Ansible 2.2 and we found that the Ansible playbooks process started really thrashing the CPU at some point in that process. So the reason we eventually found out was that the recursive includes in Ansible 2.2 were broken somewhat in that they were not cleaning up properly and as you recursively loaded more and more playbooks, you would get more and more memory and CPU usage. Ultimately thrashing the memory and moving into cache. So we did some research on that and we found that it actually had been reported independently into the Ansible issue tracker and we were able to fix that and update that in Ansible and in the playbooks themselves as well. So now with OpenShift 3.3, we were able to get that back to normal and the runtime around 22 minutes with expected memory usage so that was to deploy the cluster from an OpenShift Ansible point of view. In terms of other bugs filed and encountered, so that was kind of just the probably high level ones, one of the most critical ones we found. We did file a number of other issues across Kubernetes, the OpenShift installer, the Docker images and so on. These are variously broken down by their component categories from an OpenShift point of view but all relate to an individual upstream issue as well. In terms of what next, the most obvious one is that there's a need to get to 2000 nodes kind of as soon as possible. Ideally, we'd like to do that for OpenShift 3.4 and that's what we're attempting to validate in the coming months. The other thing we want to do is identify from that process the upper limits and any additional issues we encounter getting to 2000 so that we can track those from an upstream perspective for Kubernetes 1.5 and 1.6. And then kind of the long-term yardstick that we have is to get to 5000 nodes sometime in a 2017 calendar year. So that's what we're currently focused on from in terms of moving through the scale points that are discussed here. The other thing is that the total number of nodes is not really the only scale point you're interested in. So another one that's increasingly seeing some interest is in terms of persistent volumes. What's the total number of persistent volumes we can configure through the Kubernetes provided interfaces, particularly here relevant to Kubernetes on OpenStack but also for other providers as well? And what's the rate of allocation as we do that? So if I'm using dynamic provisioning, for example, from a Kubernetes standpoint, how many volumes can I provision on a constant basis? What's that rate look like? What's the maximum? At what point do I hit bottlenecks in that process? In terms of some final notes here, so a lot of this content is from a blog post deploying 1000 nodes of OpenShift on the CNCF cluster. So that's there and has some more information in terms of links to PRs and so on. I also want to call out the work of the Red Hat Performance and Scale Engineering team. So all of these folks are available on Twitter that I have here. And it's actually their Trello board, so I talked a little bit here about their future goals for scaling Kubernetes on OpenStack and in general. So the Trello board for the scalability team is actually open, so you can track what they're doing there. It's all publicly available if you want to see what challenges they're setting themselves effectively and how their work is progressing. The other thing I wanted to call out is that there's a Kubernetes container expert lounge upstairs or actually out here to the right somewhere. So some folks from Google and other various people are going to be staffing that throughout the week. If you have Kubernetes related questions, that's where you can find them and ask those. And that's it from me, but we do have a microphone on the side here if people want to ask questions and we can try and drill into those. Any takers? Alrighty, no questions. Not even one? Let's grab the monkey. A bit basic question about the structure of the contents of the OpenShift. So Kubernetes is part Docker and well, Magnum, and so you have the proprietor, maybe Red Hat code of the product and inside of it you have Kubernetes, maybe Docker or you can select any container. What else is inside the product? There's nothing in there that's actually proprietary. So all of it, as I mentioned, that OpenShift origin upstream or midstream project, all of the code is either available there or in a further upstream. So Kubernetes and Docker are very clear examples where there's another upstream, but there's also some utilities and some of the image pipeline, for example, the source to image pipeline. That's open source as well. So it's on github.com slash OpenShift, I believe. So all of that's available there. And there's a number of different categories across that block diagram I showed, which all of which have their own individual upstreams effect. It's similar to how you build, the operating system distribution, there's not really one place you go and all of the projects that are in the operating system come from that place. It's really pulling together the aggregate of that cloud of relevant upstreams and integrating them. And the integration piece is really what you start to see in OpenShift origin where some of those things are being glued together a little more tightly than they might be in the upstream. Does that make sense? Okay, did you measure any data in terms of external systems watching the Kubernetes API? What impact we might expect from that? Because one of the beautiful things of Kubernetes is you can watch the APIs. So it attracts people to code that way. My understanding with the testing of that 1000 node limit was including monitoring for those API metrics I mentioned. So the less than five seconds to spin up a new pod and the less than one second response time. So part of the testing is watching for those things and are making sure that they're still valid. So what we're effectively saying is at some point beyond 1000, at least for this set of tests, those metrics started failing and that's where we start to see bottlenecks or we report issues because the goal for the project is really to maintain those SLAs no matter the scale. But obviously at this point in time, there are limits to that and we have to identify what are the blockers, what's causing the delay effectively when we reach certain points. So what I'm saying is when we say we scale to 1000 nodes we're really talking about with those metrics intact and that's part of the scale testing is monitoring for those things. All right, cool. Thanks everyone for your time. Appreciate your attendance.