 All righty, welcome back everyone. Thanks to Eric for getting us pretty much back on schedule here this morning, and also for helping me continue our streak of both being based in Toronto and never seeing each other there. So my name's Stephen Gordon. I'm a product manager at Red Hat in the OpenStack group focused on user workloads, so both compute, but also how we integrate Kubernetes and OpenShift with OpenStack, which runs into some of the topics we've already heard about this morning. So today I'm speaking on behalf of the Red Hat performance and scale team, and going through some of the results they've had when testing Kubernetes on OpenStack clusters. So for those who are at Barcelona, they may have caught or even seen on the CNCF blog the outcome of part one of this kind of testing. So just as a quick recap, in that testing, we ran up a 1,000 node Kubernetes cluster. So Kubernetes 1.3, and just to act as a translation layer to the OpenShift versioning. So OpenShift Container Platform 3.3 that we used at that point in time is based around Kubernetes 1.3. And then similarly, the part two that I'll be talking about today uses OpenShift Container Platform 3.5. So that translates to Kubernetes Version 1.5. And as a result, many of the outcomes of that and the improvements you can already start to see in Kubernetes 1.6 coming out of that. So at the time, the identified goals of this exercise, first of all, we always want to validate that we can get the same results that we got when we were working with the upstream community in the construct of the scalability special interest group in the Kubernetes community. We want to push the system to its limit and when we're ensuring we reproduce that work. And we want to identify both best practices, but also any configuration changes that we can bake in in terms of deployment of Kubernetes or OpenShift. And then finally, documenting and filing any issues we find in the relevant upstream communities. And a very important thing to highlight here is we're not just talking about Kubernetes and OpenStack communities here. In this work, we found various bugs and issues across things like the kernel, OpenVSwitch, Ceph, Ansible, etc. There's a whole gamut of open source projects that are involved in making a Kubernetes and OpenStack cluster work and especially work together. So we're talking about improving all of those via this. And then ultimately, of course, trying to fix these issues in those upstreams. So for part two, the first thing we wanted to focus on was raising the number of Kubernetes nodes we're talking about here. So we're talking about running to 2048 Kubernetes nodes on top of OpenStack. We did also do some side-by-side testing for a control group with a smaller cluster that was on bare metal. And that gives us some interesting insight into where the differences are or aren't, particularly when we're talking about some of the networking layers depending on whether we're working on bare metal or virtual machines. We also wanted to do some more specific testing this time around, so doing saturation testing of the HA proxy-based network ingress here, testing of the Overlay2 GraphDriver with SE Linux. So that's a relatively new feature that you can use that with SE Linux at all. And I'll talk a little bit about how we're combining pieces of software to actually get that at the moment. And then finally, talking about persistent volume scalability and performance in the context of container-native storage, which is a Gluster FS-based solution. And then implicitly in all of this, we're also, of course, saturation testing of the various auxiliary systems in OpenShift and Kubernetes, so things like the registry and CI-CD pipelines that we use to deploy applications. In terms of the upstream landscape, so for folks who aren't familiar, the Kubernetes community is organized somewhat loosely around special interest groups. These are a phrase compared to an OpenStack project. They're kind of in between the OpenStack projects and the working groups. It's kind of a mix of both. They do take ownership, often, of various areas of the code. They also take ownership of some of the areas around those as well. So in the context of scalability, the scalability SIG has some SLAs in terms of what it means for an application to be responsive, because, of course, it doesn't make a lot of sense for me to come out and say, how many thousand instances I can run on how many thousand containers if those containers aren't actually responsive or if the rest of the interface is going so slowly that I don't have a responsive system or a useful system. So in terms of API responsiveness, 99% of calls to the Kubernetes API returning in less than one second. In terms of pod startup time, 99% being up within five seconds. I should note the asterisk there is important. So when we talk about the pod startup, we're talking about with pre-pulled images, because we're trying to isolate and show we're testing Kubernetes itself, not testing the variability of our storage network in particular. So one of the questions that I think is coming up a lot of late, and we had a little bit of talk about this in the keynote today, and in Eric's presentation just before mine, is about why combine infrastructure as a service and an application tier like Kubernetes. And the way I think of it is in terms of exposition and consumption of resources. So if I think about a single host, traditionally the Linux kernel has been responsible for taking CPU, hardware, CPU disk memory, and exposing that to you for consumption by user space processes. When I scale that out to a distributed system, I still need something to provision systems and expose their resources. And those may be hardware or virtual increasingly. So when we think about software defined networking, for example. And then Kubernetes is what allows me to have a translation layer that effectively communicates between the application and the underlying infrastructure without my application itself having to be tied to that infrastructure. One of the other things, and you'll note when I open that we're talking about running OpenShift and Kubernetes in a virtual machines at this stage for most of this testing, although there was some bare metal testing as well. And it's interesting because I think it's a reflection of where we are currently, that that is the reality of most of the production deployments of Kubernetes we're seeing at the moment, but not necessarily where it's going. So what's going to be interesting, I think, is this week and particularly in the keynote demos tomorrow, you're going to see people breaking down the monolith that is OpenStack into which projects are still useful in this environment where I'm running Kubernetes potentially on OpenStack managed bare metal. Some of the examples mentioned, Neutron Cinder is potentially interesting, there are others. And it's kind of a model that's a little bit in conflict with what Eric was talking about with the sandwich and a little bit of a different way of thinking of maybe I'm not building a sandwich with Kubernetes, OpenStack Kubernetes, maybe I'm building a bare metal compute pool which might be managed using something like Ironic, some of which is running OpenStack for the purposes of running VMs, some of which is running Kubernetes directly on that metal, but potentially using some of those shared services to communicate in a complex application. I think the fact that we're also the other factor in why people are mostly running it in VMs at the moment, I think, is more people and culture. And it was mentioned today, we had Red Hat Summit last week and a lot of good customer conversations at that event. But just to give you the kind of, you have this breadth of where people are coming from where some people are all in building Greenfield applications on OpenShift. And the other end of the scale, you have people who currently at the point where the IT organization wants them to run one container per VM because of concerns about isolation, which we can argue and discuss about whether those concerns are real because we use AC Linux in a similar fashion to isolate against the unlikely event of a hypervisor breakout, for example, when we talk about virtual machines. But regardless, there are people and culture challenges there and getting acceptance to the point where you can run Kubernetes on bare metal in some organizations. And it's also just a factor of the way IT is developed in the preceding 10 to 15 years, where in a lot of organizations, it's now easier to consume a virtual machine than it is to actually get your hands on bare metal hardware. These are all things OpenStack can help with and Kubernetes as well. So in terms of thinking about Kubernetes, Red Hat is obviously a big contributor to Kubernetes. We build OpenShift v3 around it, and that is what we'll be using to test, or what we use to test in this exercise. So this is an integrated platform built around Docker and Kubernetes, building on top of that and providing workflows for CICD and for building it and pushing code to production. So when we put this together at the moment, the way this typically works is we have our application or our software layer at the top. We have OpenShift providing the application platform based on Kubernetes in the middle, and then it has via the cloud provider implementation in Kubernetes the ability to talk to the underlying cloud, and when my application requests a persistent volume claim, for example, that layer knows how to translate that into either a call to Cinder in the OpenStack case, a call to EBS in the Amazon case, and so on. So we're maintaining the technical independence of our application while also providing this contextual awareness to make the best use of the available infrastructure. There is a published reference architecture around this, which is available at that link, and these slides I should mention I'll send out afterwards as well. But now I want to get into the meat of the actual performance and scale testing we did. So the first thing that comes up when you're talking about doing any kind of perf or scale testing is where do you want to test? We have a couple of different approaches to this available to us. So Red Hat does have its own scale lab. We also have the opportunity to work with partners in some of their labs from time to time, and this particular case we're working with the Cloud Native Computing Foundation. So they have 1,000 node cluster provided by Intel for use by the CNCF community. In this particular case, we're focused on primarily the OpenShift and Kubernetes testing. So the OpenStack cluster itself is around 300 nodes to run those 2,048 VMs that we put on top of it. In terms of the compute node and storage node specs that it's played here, nothing that I would say is too important to focus in on except that I will mention these compute nodes did have an NVMe PCIe SSD available, and just as we did in the previous round of testing, we actually made direct use of these, which I'll come to when I get to the container native storage piece later in this presentation. In terms of how to test, so the Red Hat Performance and Scale Team have a set of tools in what we call the System Verification Test Suite. And these include things for testing or metering application performance, performance and scalability via the OpenShift Web UI, scalability, networking performance, and reliability and longevity for things that we need to run over a longer period of time, say, weeks. It also includes some tools like the Image Provisioner, which is a set of Ansible playbooks for doing that preloading of the image with the OpenShift pieces in particular, so that, again, we're using an image that's been pre-baked with everything we need so that we're focusing on what we want to test in terms of performance rather than our storage network or the performance of our storage network. So in terms of what we actually deployed, so I mentioned we had the 300 nodes of OpenStack. We put the 2048 VMs on top of that. We also had this other bare metal cluster, where we did just 100 nodes of OpenShift on bare metal. In terms of how we actually deployed that, we're using Red Hat OpenStack Platform 10, which is based on Newton. The previous testing that I reported in Barcelona was, if I recall correctly, based on 8. We're using OpenShift Container Platform 3.5, early access builds. So those are built around Kubernetes 1.5. And we're using Red Hat Enterprise Linux 7.3. I say mostly, because when it comes to the Overlay 2 plus SE Linux testing, we were actually using a REL 7.4 preview kernel for that purpose as well. And I'll get into that in more detail in a second. For the deployment, we were deploying OpenStack and Ceph using triple O, which we called Director in the context of the product. We're deploying OpenShift Container Platform using the playbooks in the OpenShift Ansible project. We also applied some previous learnings in terms of how we went about doing this. So in terms of storage, each of the storage nodes included two SSDs and 10 SAS disks. We found that, or we know, that Ceph performs significantly better when deployed with the right journal on SSDs. So we made use of the SSDs to create two right journals and allocated five of the spinning disks to each of those. So in all, we had 90 Ceph OSDs, and that enabled us to have 158 terabytes of available disk space. We also, the NVMEs for the container-nated storage testing, so those appear as a PCI device on the box. So we're able to use the PCI pass-through functionality that's been in OpenStack since something like Havana to pass those directly to the VMs where we wanted to run nodes with the container-nated storage facility. And that's using the commonly available PCI pass-through feature. We also, through management of the image upload process and ensuring that we're converting to raw before we upload, we're able to consume a lot less disk space than we did last time around. So we used only 1.5 terabytes, so 22 terabytes, for twice as many VMs, and also brought our boot times down quite a bit. So at 2,048 VMs this time out, we managed to boot in 15 minutes. And we're using here a snapshot slash boot from volume process to do that and to get some efficiency there. All right, so in terms of network ingress and routing, so the routing tier in OpenShift consists largely of nodes running HAProxy for ingress into the cluster. We identified that, on average, we get a large number of low throughput connections. Typically, a lot of the applications or the Greenfield applications we see are more web-based or transactional in nature. So we weren't really looking for a small number of high throughput connections in this testing. And we've already made some improvements in this space based on the previous iterations of testing. So for example, the default connection limit of 2,000 was leaving plenty of available room on the CPUs of these boxes. So that's actually been bumped to 20,000 in OpenShift 3.5 out of the box. So we're already applying some of the learnings we've had in this space. The other thing is that it's now much easier as of OpenShift 3.4, I think, to customize the routing layer because you can now feed it with a config map, which makes that easier. The load generator itself is also configured using a config map, an example, which is on the right here. It queries the Kubernetes API for a list of routes. It then builds the list of targets from that dynamically. And what we did was we zoomed in on a particular workload mix, a combination of HTTP with Keepalive and also TLS workloads. And again, because that's representative of what we're seeing in our field-deployed customers where they have a mix of applications serving both internal and external users, and therefore have a variety of different security contexts in which they want to run those. So in terms of graphing those and just walking through the scenarios here, so the graph itself shows a throughput test. And on the y-axis, we're talking about requests per second. So higher is better. And in terms of the scenarios listed on the left, so the NB-proc, that refers to the number of HA proxy processors. Further down the list, the shed migration cost. That is a kernel tunable, or not a kernel tunable, sorry, but a tunable that allows us to tell the kernel how and when it should load balance amongst available cores. And we use that in one of these scenarios here as well. And the final thing, or not the final thing, but as we go through this, if we look at the graphs here, some of the interesting things. So CPU affinity did matter. So if we look at our first scenario, we're running on any CPU, second bar run on core zero, run on core one, run on core two, and so on. So those core zero through three, those are PIN scenarios. What you notice, though, when you're looking at these is we've got a significant boost in these examples on pinning on core zero and on core two, but not necessarily on core one. And the reason for that is actually locality to the PCI device that's doing the network, or the network traffic. So when we had that, we get even more significant performance boost, or a significant performance boost, at least. With the NB Proc setting, there was an impact when we talk about increasing it to two. So we did, as you would expect, get almost roughly double out of that. But interestingly, when you go to four, that didn't really scale that way. And the reason for that is that the guest in question only had four VCPs. So in effect, if you have four HA proxy processes busy, you're not leaving any room on that particular guest for anything else to happen, any of the host processes that need to schedule as well. So the biggest thing actually was, of all of these things, the biggest thing was that shed migration cost. Slowly by changing that, we were able to get a 20% improvement from the baseline. And that's something that we'll be doing in the future. It's also a common technique with low latency networking and will be included in the guide that I'm gonna mention at the end that came out of some of this work. The other thing I should mention was, so with the guest that shed migration cost and the reason it makes a difference, by keeping the HA proxy process on the core for longer, we're basically increasing the chances that we're gonna hit the host CPU cache for that core. All right, in terms of general networking and a bit of an overview, so OpenShift includes and uses OpenShift SDN by default. This is a solution built around OpenVSwitch and VXLAN. It provides full multi-tenancy and it provides full multi-tenancy across any of the footprints where we support OpenShift. So that includes physical, virtual infrastructure, private cloud, public cloud. The downside of that, of course, in the context of OpenStack is you're often already running some kind of overlay networking, sorry, on the infrastructure. So this means we're running with double encapsulation. And I'll go into the kind of nuances of that a little bit when we go to the results. The other thing I should mention though is that that layer is fully pluggable, as is the networking grist layer. And in future, and one of the things we're working on is project courier as a potential way to have that layer or the networking layer and Kubernetes talk directly to the OpenStack networking. And that's something we're excited about for the future. Again, we're talking about web-based workloads that are mainly transactional based on what we see in the field. So we focused on a microbenchmark, so a ping-pong test of varying payload sizes. So the purpose of trying to make this readable, I've tried to pair these up a little bit. But in effect, each of these groupings are notated by the arrows. So the first pair is testing with 64-byte packets. Second pair is testing with 1,024-byte packets. And then the last one with 16,384-byte packets. The difference within each pair between the first and second group of bars is that in the first case, we have just one stream. And in the second case, we increase that to four streams. In terms of the colors of the bars, so we tested here from a variety of points. So from bare metal itself, bare metal plus pod, VM plus pod. And what's interesting is when we talk about, looking at the first pair, the differences when we add streams, obviously, we get a big boost, as we would expect. So when we go from one stream to four streams, but there's not a lot of difference across bare metal versus bare metal plus pod versus VM and so on. Those bars are relatively aligned within their group. What we see though is as we start increasing the packet size and increasing the payload size in the later bars, so it's a minimal extent with the 1,024-bytes, but certainly when we get up to 16,000 plus, there's quite a lot of degradation in the VM and VM plus pod case versus straight bare metal, which is somewhat to be expected, but still something we needed to validate. In terms of tuning, so one bonus tuning thing I should mention, so when you'd have over 1,000 routes on the node, we needed to increase the kernel upcase size, so we've actually increased that by a factor of eight in the out-of-the-box tuning for Reckonshift 3.5. All right, the last thing, or the last category of tests I want to talk about is storage. So first of all, for those who aren't familiar in RHEL until recently, and actually it's still the default at the moment, we used Device Mapper for Docker Storage Graph Driver. Overlay support was added as an option in 7.2, Overlay 2 in 7.3, and the main reasons that we stuck to Device Mapper thus far for maturity, supportability, security, so in particular, we were unable until very recently to run the Overlay drivers with SC Linux enabled and also POSIX compliance. What you're getting though, when you trade those things off to use the Overlay Graph Drivers, is density improvements by sharing page caches, which is particularly valuable if you have a common base image or a large percentage of your base image is common. So for testing this, so we actually have landed in the upstream Linux kernel as a 4.9, the changes to allow us to support Overlay 2 with SC Linux enabled, which makes Dan Walsh very happy, which makes all of us very happy. So Device Mapper is going to remain the default in RHEL, but in 7.4, you will be able to use Overlay 2 with SC Linux. We actually used a preview of that kernel for the subsequent testing of this Graph Driver, and what you'll see is in Fedora 26, it's planned to make Overlay 2 the default, and for folks familiar with that development process, that will eventually funnel down to RHEL as the default as well. So in terms of testing this out, we again used the cluster loader, which basically is one of those tools from SVT repository, which allows us to bulk load a heap of Kubernetes objects at once, or as in this case, we staggered them. So we're using a single base image, which as I mentioned is nominally the best case scenario for the Overlay Graph Drivers. And earlier, Eric mentioned the upper limit on Podspern node is 110. I think that was actually a previous limit, because now 250, in the more recent versions of Kubernetes. So we ran up 240 Pods on the node. We rate limited the creation, so the bumps you see in the line are batches of 40, if I remember correctly. So we create 40 Pods, check that they're in working, and then we create another 40 Pods. And as you expect, as we go through this, you get a reasonable memory saving in the Overlay 2 case, because, sorry, Overlay 2 is the lower line. So as you create more and more pods, you're getting more of a saving here in memory. But more importantly, as you would expect based on the caching, you have a little blip at the start there when we load the image for the first time in the Overlay case in red. But you're saving significantly on IOPS4 versus Device Mapper, which obviously continues to need to make those reads. In general, we found that the Overlay 2 stuff was pretty stable, and of course we were able to use that with SC Linux, which makes it more compelling and reduces some of the trade-offs in using that. The other thing I wanted to talk about a little bit is the Container-Native Storage testing we did. So OpenShift Container Platform and Kubernetes itself supports a wide variety of volume providers via the volume plugin mechanism. Container-Native Storage is a cluster-based solution that can plug into that, but the idea is that it itself is running on OpenShift. Why would you want to do that? So if you imagine you have your cloud provider, you have some number of physical volumes attached to that, so from EBS or Cinder in the OpenStack case, and sometimes there are limitations on how many of those you can actually attach to a guest. Container-Native Storage is aimed at taking those, plumbing from those a larger number of volumes for an application, a larger number of smaller volumes, and ideally co-locating the storage with the application. In this particular case, we're separating them out for the purposes of isolating what we're doing, so the nodes where we deployed the Container-Native Storage piece, we isolated them from the actual workloads because we want to ensure we're just testing the performance of those. So we marked those unschedulable for other workloads. These nodes again are the ones where we had plumbed through the NVMe disks directly to the instance, and then we're exposing one gigabyte volumes from those as persistent volumes for Kubernetes applications to use. From that, we ran through throughput numbers for create and delete operations as well as checking on API parallelism. And what we're really checking for here is to ensure that it behaved in the same way that we see other volume providers for Kubernetes behave, which is we wanted to ensure that we're allocating volumes in constant time. So what we found was that we take roughly six seconds from submit of the persistent volume claim to having that bound to our container, and that's a constant over time. The number didn't change for bare metal versus virtualized, that was consistent, and we did do other tests of other providers, not shown here, to validate that we got the same results. And as you can see, we gradually scale up on the right towards 700 plus persistent volumes bound. I want to talk a little bit about next steps. So out of this round tour of this exercise, the OpenShift 3.5 testing, we filed 40 plus bugs across a number of projects and components. So again, not just Kubernetes and OpenStack, but also Docker, Ceph, Kernel, OpenVswitch, even Golang. Obviously, our aim is to try and fix as many of those as possible in the relevant upstream communities and get those into product. Many of those, that's already happened. Others we'll see in future durations. There is also now with OpenShift container platform 3.5, and this is publicly accessible and also relevant to Kubernetes users as well. A scaling and performance guide, which is available on the Red Hat website at access.redhat.com. In terms of getting involved, so if there's any folks here operating Kubernetes on OpenStack or interested in operating Kubernetes on OpenStack, there is a forum session on Wednesday that I'll be facilitating along with a couple of other people, aimed at gathering feedback about how that's working, what you would like to see in future from that combination. We also, I mentioned the scalability SIG, which is where the upstream Kubernetes community interested in scale and perf coalesces. We also have a SIG specifically dedicated to OpenStack and the integration that we have there. Currently, that's largely focused around both the cloud provider framework implementation for Kubernetes talking to OpenStack, which is somewhat monolithic in nature, and you'll be hearing later this week about various ideas for breaking some of that apart and using bits and pieces separately, but also as a place for people who are active in the OpenStack deployment using Kubernetes projects to come together and share some of their conversations as well. If you're interested in seeing OpenShift running on Red Hat with some real applications or some real example applications at least, we do have that running down at the booth, so you can go and see that live there as well. And then just to finish up, just a couple of references, I will tweet out these slides shortly, so my Twitter handle is accessgordon, so you can catch these there. And I should mention in particular that Trello board at the bottom is public, so you can follow what's coming next, basically. And that's it.