 I'm connected to, and I can actually do a live demo, and I will do live demos here. And then the other part is, I've got everything on GitHub over here. So you can pull down, that's my GitHub, and then there's something in there called OpenShift Performance. You can clone that. You'll get the slides. You'll get a couple of labs, some helper scripts, and blah, blah, blah. On the home page in the ReadMe is how to set up that VM that you've got. So I'm going to pretend that you've used VirtualBox before, which I hadn't until yesterday. These are VirtualBox images. You can also unpack them and use them with Climew if you, the command's on the page there, in Burda. So first person to get their VM up and running gets a, gets a, he's just wet, okay? But you have to prove it's up and running. It was a little bird I didn't have it done yesterday. Same here. So anyway, I just don't have a lot of confidence in that. But I actually, way more important than, frankly, way more important than the hands-on demo is just wanted to convey to you guys, I guess, how we do our work. Maybe you can reuse some of those concepts in your job. Who here works on, who here tests stuff for only a couple of testers? I mean, developers are supposed to test too, right? Not as many queues as I expected, but that's cool. So I just, maybe I can then tell you what in forms engineering group we, you know, how we approach our job and how that trickles down into the product and how we work with engineering to make our products better. So I'll go through some of that. I'll give you some of the performance results that we have most recently with OpenShift. Our current effort is to get OpenShift v3, sorry, get OpenShift.com upgraded to version 3 of OpenShift. Who here has used OpenShift before? Version 2 or version 3? Three. A couple people use three. Okay. So it's vastly different. It's basically a different product, quite frankly. I mean, it's got the same goal in that you launch applications based on your source code. But there's almost no other parallels in terms of how to operate it, how to install it, what requirements there are, how it can scale. It's extremely different. I've been working with them for over four years now to test v2 and now v3. So, okay. So there's no little clicker, right? Okay. Yeah, so that's the, I want to just clone that. I don't know how the wireless is going to work for you, but it's an instruction on how to set up the VM, but again, all those are on GitHub. So I hope you can reach GitHub right now and get those instructions. Yeah, sort of covered most of this already. I'll show you some of the tunings that we've got. I'm actually, our team is responsible for the Tundi product as well that's in rail, which is workload-specific kernel, and there's some user-space tuning in there to optimize for certain workloads. We've got profiles for pretty much every product that Red Hat sells from the platform BU and most of them from the cloud BU at this point, and Atomic and OpenShift are no exception to that. So as I mentioned, we're converting OpenShift.com. We have to make sure it scales vertically and horizontally. We're working closely with Samsung and Google on those efforts because we all have the same sort of end goals in mind. And we're pretty much at the forefront of R&D and to Kubernetes scale performance at this point, including Docker, I think. Quite frankly, I think we're further ahead than both of those companies. I think so. Okay, so as a group, we get our inputs from a bunch of different kind of stakeholders. We call them engineering, bugzilla, customers, our own research. The marketing people are always reaching out and asking us to create data that helps the field sell the product. So we write white papers and blog articles and things like that around performance to help us differentiate from the free bits. That all ends up in a tracking system like Trello or whatever else you use to manage your work. And then at the bottom is our workflow, our pipeline. All revolves around automation, so everything ultimately gets automated through Jenkins and whatever supporting scripts that we need in Beaker and EC2. A couple of the key boxes that I'm going to talk about today, pretty much only two boxes on here I'm going to talk about today, the Data Collector and the Cluster Loader. As a group, we have a need to profile Linux and user space throughout all of our testing. And so we've, as a team, built a data collection harness and it's plug-in extendable. It's called P-Bench and we've written, Elko over here has written plug-ins for OpenShift that allow us to gather cluster state as well as Golang profiling data, which has helped us already quite a bit. Okay, so the fun thing about working on the emerging products is that there's a lot of low-hanging fruit, I guess. And having experience from previous generations of software, I've seen pretty much all of this stuff already. There's not really handing out patents for any of the new container-based stuff because it's pretty much all been done already, if not 20, 30 years ago. From a productivity standpoint, we're able to jump into the new stuff pretty easily because fundamentals don't change. There's four food groups, CPU, memory, disk, and network. And those need to be in a healthy balance. There's bottleneck analysis. It doesn't matter what the software is, the techniques are the same. So, I mentioned we're trying to get thousands of these, did I mention that? We're trying to get thousands of these nodes running. And part of the first thing we ran into was that OpenShift uses Ansible as its installer. And, yeah, so we had to tune actually tune Ansible in order to not have the install take six hours. So for a thousand nodes, a thousand really cheap nodes on Amazon cost 200 bucks US an hour. So to have a thousand node cluster that takes six hours to install costs a thousand US dollars just to install the product. We can't do that, of course. So we had to optimize the installer. And actually, I think on a subsequent screen here, I've got a recipe and I think we submitted it to the documentation people. So going through that process is going to help the product because now we help the customers install the product faster. And it's actually a lot faster. It went down from six to two hours for a thousand machine clusters pretty good. So anyway, part of the process in getting there was we had to learn how to build AMIs. Who here has used EC2 before? Okay, a small percent. Neither do we, but frankly, our group had worked on bare metal and VMs pretty much exclusively. And since OpenShift is hosted on Amazon, we had to get up to speed on EC2. And a lot of the things we learned were, I guess, what you'd consider basics, but they end up being kind of arcane knowledge and not very well documented. So anyway, we had to build our own AMI that was preceded with the configuration and the tuning and the best practices. The only way to be able to have this thousand machine cluster usable in a very short period of time was to do a lot of work up front. And part of that was seeding the AMI with Docker images, which take forever to pull, even on a fast link like EC2, a couple of seconds per machine is not tolerable. So the fastest way ends up being to have everything already in the image. So that's what we did. And everyone in my team uses the same AMI for our testing, which is, you know, RHEL 7.2, the latest Docker, well, the latest OpenShift, whatever we're testing. Yeah, so there's the Ansible. That should be in the documentation by now. Okay, I mentioned the tuning deep product. It rolls these settings into... So the upstream defaults are kind of for laptops, quite frankly. And if in a lot of cases, and the RHEL as a data center operating system, we decided to change some of those defaults to perform better. And a lot of the hardware has changed out underneath Linux. So in order to actually gain back the performance we used to have, we had to make some of these changes as well. So basically, we couldn't ship RHEL 7 regressing 20 or 30% versus RHEL 6. We just couldn't do that, even though it has nothing to do with the software. It's all about the newer chips. So anyway, a lot of that is undoing what Intel had done over the last couple of years in terms of creating processors with deeper C states and frequencies. So these CQs basically go to sleep when you're not using them. And that's cool for power savings, but it actually hampered performance quite a bit. So what we did was we sort of found a happy medium in between the low end and the high end for RHEL. So it will save power, it won't burn, it doesn't run full out unless you actually put load on it. But anyway, so that's some of the stuff in the top left box that we did for RHEL 7. Also to help tasks stay on the CPU longer, we've increased the scheduler quantum and increased the block read ahead. Those are things that don't regress workloads, they only help. So we also reduced the swappiness value, although I very rarely see swap. Unless there's an application bug at this point, very rarely see it used. So anyway, 2D has a concept of inheritance. So you've got profiles that build on their parents. Group and performance is the default profile in RHEL 7. And for virtual guests, we do a couple extra things. If you install KVM, you automatically get that profile. If you install RHEL 7, you automatically get that. So for virtual guests, you have this plus that. For OpenShift, you've got this, well let's say OpenShift inside of VM. You've got this plus this plus this. So when you install OpenShift, this is the tuning that you've got by default. And in the future, we've got a couple other things that we're thinking about for scalability. Actually, the fast opens for performance too. So all these tuners here don't come into play that we added until you get to like, what was it? A couple hundred containers. Our original guesses were that people wanted to run many hundreds. But from anyone I've ever talked to, they've always said, you know, if we get to 50, we'll be happy. So we're pushing it way beyond that for OpenShift.com. It'll be in the hundreds for sure. And it's all the products are showing their lack of scaling capacity, I guess I'd say. When you get to that point. I've got a question. Yeah. Do you have like any sort of automation for these fine tuning of these parameters? Like, I don't know. This is automatic. So what it does is it look tuning as a service is started by default. It looks at Etsy system release CPE, which is a machine parsable file that has like rel its version and its variant. Variant being like workstation server atomic. And so it's installed on atomic and see that it will apply the whole thing. So it is automated. And you can unwind all of these and you can create your own profiles. This isn't a tuning presentation, but I just want to let you know that our group contributed these because they help in all of our tests. And so they should help customers too. So I mentioned there's two boxes. I'm going to go over today. One is the data collector. And then the other one is the cluster loader. And I mentioned P Vent earlier. So it's on GitHub there. And I'll show you a demo of it real quickly. What it'll do is it'll go out and run all of the tools like SAR, MP stat, IOS stat, Perf. And it'll collect them, normalize the data and produce graphs. The reason we do this is because we want to put very tight calipers around the test itself. We want to be able to control the interval of different tools at different rates. And this puts a nice, easy to use wrapper around all of the system tools. And it's kind of funny because it supposedly has a stable version of packages, but we tried to use SAR from RHEL and actually there's enough churn in between versions that we actually have to carry our own SAR. So when you install P Bench, you might see Sys staff getting pulled in from a different repository that we maintain and compile on our own. Did the different versions of SARs change their performance? No, they don't change the performance, they just change the output. It's not standard in any way and the arguments change. And the data file format is not backwards compatible. So if we need to machine parse it, we either need to keep copies of every SAR binary that Red Hat's ever shipped or use our own and then have a little bit more control. So that's what we decided to do. We didn't want to have to put it that way. We tried for a long time not to, kept breaking. At this point we're not really about... I take that back. We file bugs wherever we can, but we're not going to fix everything we find. We sort of have to move on and get all of our other jobs, which is to get performance data, not to stabilize SAR. Quite frankly, what we're going to do is move all of this to PCP eventually anyway. PCP's binary format is backwards compatible for the last 25 years. So at least we won't have that problem. We'll have something else, probably. Anyway, so this tool creates... That I'll show you. And we've got benchmarks as well built into it. So you can run... Well, we use Uperf for network testing instead of Netperf, but also FIO for storage testing. So if you just type pbench underscore FIO, it will do a full sweep of your storage, small packets to large packets, throughput to latency. It'll run several samples. It'll calculate standard deviation. If it's outside of a standard deviation window, throughout those results, it'll continue to run samples. So it does its best to get reliable data out. So we ship a bunch of benchmarks like Netperf is in there. You can take a look at those if you want. So that's one tool. The second tool. So when we started with containers like two years ago-ish, there was no tools. There was zero visibility into any of this. So we had to build a couple of things. Oh, also at the same time, someone, Colin, invented the atomic OS trade, meaning we couldn't actually add software to RHEL anymore. So that really is a problem for people trying to debug. Everything is new all at the same time. Atomic, we can't do the same thing. Containers, no one knows what the hell they are. RHEL started just came out. We had just barely shipped it. And now we have to do debugging in an environment that we're not comfortable with. So what we did was build a tools container. Well, we had a container that was- we were using internally, and it became so useful and popular that we ship it as a product now. So if you do Docker pull dash tools, I think that works. I've got the CentOS example on the bottom here. Anyway, you'll get an image pulled down that's got all the debugging stuff that we would normally use. It's got- there's a list in here. So on an atomic system, if you actually need to troubleshoot what's going on, on RHEL Atomic or Fedora or CentOS Atomic, you'll need a container- you'll need to build your own that looks a lot like this, or you can use the one that we ship. So Docker pull CentOS slash tools or Fedora tools, those are all up on GitHub. Fedora, Docker hub. So I'll show you that one in a second, and it's actually like the use case for super privileged containers. Has anyone heard that term before? A couple people. Super privileged container is a container without the container. It's a container that disables every namespace and every bit of security except the mount namespace. We've even been able to disable the mount namespace. So all you're left with then is the PID namespace, and even you can turn that off. So when you launch this container, it does not get its own network stack. It uses the host network stack. It uses the host process table. It should bind mounts the hosts slash partition into the container to allow you to copy files from inside the container to the host. That's actually how Sauce Report works on Atomic. So if you're on Atomic System and you type Sauce, it will spin up one of these debug containers and copy the tar ball of the Sauce Report out to the host. Because the container, when you stop the container, the container goes away. So the debug from our customers, we use this container. So it's actually grown arms and legs and far beyond what we had the original intentions for. So it's all used by the support guys quite a bit. That's isolation for a bit, right? There's little isolation as possible, yeah. It's basically just a packaging format, quite honestly. I can run it. It's a little fat container, though, because it's got a lot of stuff. And part of the reason Atomic is we were able to shrink the install size of Atomic is because we pulled all of the useful stuff, at least from my perspective, out of the OS. We pushed it into a container. So you don't need this container for the OS to function, but if anything ever goes wrong, you're going to need it for TCP DOM or whatever else. So I think it might even be up to the gate after you want to compress it. This guy helps us with our day-to-day and helps us feel as well. I could have sworn I had this container pulled already. Maybe a different machine. Okay. I'll update it, that's why. Okay. So the container image is on the system now, and I'll start a container. Let me do this first. Can you guys see the bottom of the screen? Move it up. So I just type Docker run and then the container name, right? And I just get a shell. There's no host directory. And PID 1 is bash. I just want you to keep that in mind. I think this patch is actually in upstream Docker, but for a while it wasn't. So what we've got is metadata associated with the container image itself that Docker doesn't use, but Red Hat has a tool called Atomic that actually reads those labels and runs Docker in a certain way. And so for the tools container, it runs, I had mentioned it tries to disable as much security as possible. So it runs in privileged mode. In the host, it drops the network name space. It drops the PID name space. It does the bind mount on slash host. So basically this is the command you have to run to get a super privileged container. And we didn't want our customers to have to, I mean, no one would ever get that right. So what we did was this label support. I must have done this on another machine. So now I'm in this container, right? I typed atomic run this time. Is that a Docker run? It spits out the label just for reference purposes, how it executed the container. And then if you notice here, my prompt is different. It's a host IP. That's because it doesn't use the UTS name space in the kernel. So now if I type ps... Sorry, are you inside the container now? Yeah. But I'm inside the super privileged container, which is mostly not a container. And so what I just... Well, the reason or difference that this prompt is different... Because if you use... If you don't use double dash net equals host, if you don't join a new network name space, you get a new host name. That's what it is. But now since I didn't create a new network name space, I am now in the host's network stack. I thought the command line was the same for Docker and for the atomic, so it wasn't. It wasn't. I typed Docker run manually. I typed Docker run manually. You saw it was just like those three things. Docker run CentOS tools. But in this case, it ran that huge mess of a command. I hit the VM running. Okay. There you go. So anyway, so now I typed ps inside this container. Look, I can see all the processes that are on the host. Okay. I can also see the host storage is mounted in slash host inside my container. Okay. So at this point, you can use yum to install whatever you want or I could use, you know, whatever TCP dump it is. So that's probably that idea. So basically it becomes a regular user process in the operating system? Pretty much. It still runs under Docker though. So the child, it is a child of Docker still in terms of the process tree. So it is not technically its own process. So what does it do in that container? It was still... Still bash. Because that's what I typed. You actually can do something like... So I don't have to give it bash. I can just give it a command to run. And it'll run the sauce report inside the container. But we also patched sauce so that it knows to look in slash host slash sys for all of that debug information that sauce usually collects. It basically charoots itself in slash quotes. It's running inside the container, but it has to look outside the container for certain virtual file system data that it wants to collect and pack up on behalf of the customer it sent to us. So this is actually digging deep back into my memory because we did this a couple of years ago, but I figured it was useful to show you the super privilege container, the tools container at the same time. So it's doing... Yeah, I'll show you. I forget where we put it on the host though. We've got our temp, I think. I've done it a little bit less time consuming. So this is what the customer does when they submit a support ticket. They'll run sauce report, attach it to the case. That way the engineers can unpack it. Does anyone work in support? You know this already? Sauce? Have you used sauce on atomic? Yeah. That's where it all came from. Okay, so it packed up the... ...tarmel. And if you notice here now I'm back on the host. You can tell because I customized the bash prompt doesn't look like a real bash prompt. It looks like Jeremy... I think it was Bartam. Where the hell did we put it? Okay. I forgot to run it with one additional flag so I copied it into the container. But basically I forgot... So the atomic run has a double-dash SPC command which is what would have moved that file onto the host file system. Okay. So that's the super privilege container. Okay, so now let's get into some of the... This slide is out of order. This is the sort of Kubernetes-specific stuff around performance, fairness, research management that I wanted to talk to you guys about. So Kubernetes has this thing called a quota. And the quota can currently... So there's several different objects in Kubernetes. One of them is a namespace. It's called the namespace upstream. OpenShift calls it a project. It's the same thing. These quota files are per project. We're going to increase the granularity to make it per pod as well. But it's not yet. And inside the quota can be things like how much memory they can use, how many pods they can run, how many replication controllers, blah, blah, blah. Does that make sense? And Millicorps is a way of equalizing different CPU generations to kind of a single... boiling different CPU generations down to a single number. That was invented by the Google guys who have that problem across their infrastructure. So they call it Millicorps. It's a Google thing, but Kubernetes uses it. So basically this is 20%. And it's not like... So like 20 microseconds on one CPU would perform differently than 20 microseconds of CPU time on another CPU generation. And in order to equalize that and make it sort of fair across their infrastructure because they're selling CPU time, they invented this formula that attempts to kind of even the playing field amongst different chip generations. Anyway, so that's what... that's how you express... So if you're offering service to your customers, you might offer them something like that. And we're going to actually for OpenShift.com instead of one, two gig, but something like this is going to go to OpenShift.com. Maybe we won't let them have 10 pods. I don't know what the rules are going to eventually be, but every namespace will have a JSON or a quota like that applied to it. So that's how you handle resource management. If you notice, there's no network or block IO in there yet. Dan Winship gave a talk today about networking and what they're going to do is use TC to rate limit virtual v-th devices, virtual interfaces that are inside... If you weren't at Dan's talk, basically told you that there's a v-th device on the inside of the container and a v-th device on the host OS. They're connected as a tunnel. And what we're going to do is insert TC traffic classifier rules at a certain point that allows us to throttle traffic in and out of the pods. I think it's in and out of the containers, actually, not out of the pods. So this is a container and a pod is that a pod is a single network namespace. It could have multiple pods, multiple containers running inside it. They share the same subnet. They could talk to each other without firewalling. So pods is a logical grouping of containers. Pods are actually the smallest scheduleable unit in Kubernetes. So when you... I'll do it in a second. When you create a pod, if you want to create one container, you're actually creating a pod that only asks for one. Pods can have all kinds of other attributes that make no sense in the DACA world. Like, what node do you want to get scheduled on to? Or this quota thing? This actually toggles C group stuff, as you can imagine. The CPU and memory, anyway. And the pods and services are actually logical distinctions inside the SED, which is what Kubernetes uses as a database for storing this sort of thing. So that's that. And then I figured I would go through... Did anyone else manage to get the virtual machine up? The virtual machine on the USB? I don't have space. Yeah, I don't have space. Two out of, like, 30. On the VM is a copy of OpenShift Origin from, like, three or four nights ago that is everything statically linked, so you just type, like, OpenShift master start, and it starts a master OpenShift start node, it starts a node, they all back to talk to each other, and then you have your own environment inside that VM. Alternatively, you can download Vagrant VMs from OpenShift.org if you have Vagrant. So, getting a demo environment has proven to be the hardest thing, so I really apologize. Okay, so for CPU and memory, it kind of depends on where you're running, so we're assuming most people are either going to run on a public cloud or VRT of some kind. Whenever I do a survey, and I go see customers whenever I do a survey, like, 90% of our meters are using virtualization of some kind. There's very little bare metal out there amongst the enterprise customers. Bare metal might be, like, a couple of machines that they still run SAP on in the back corner, or Oracle, because of a licensing thing that they can't get out from under. There's, like, very little excuse to use bare metal by and large, there's still a couple of cases where you still have to have it. Stock exchanges will never use any kind of for whatever they're interested in containers. My point being that when you're in a VM, you have very little control over the CPU and memory optimization. You can lock yourself to a virtual CPU, like, Task Set, if you're familiar with that, or NUMA Control yourself to a particular virtual core. That doesn't mean anything about what the host kernel is going to do for you. So in Amazon's case, it's Zen, or whoever they're using, like, Google uses KVM. The host kernel can schedule you wherever you want. So this sort of tuning makes no sense inside the public cloud. And frankly, we want to get away from the hand-tuning anyway and kind of improve the kernel scheduler and automation. So what we did for REL7 was we built an automatic NUMA balancing feature into the kernel, and what that'll do is keep threads in their memory on the same NUMA node. Who here knows what NUMA is? Okay, I'm not going to go through that. Basically, in the REL7 kernel, threads in memory are more likely, not guaranteed, but more likely to be on the same node. So that's the type of thing we're trying to add to the kernel to help it, not help so people don't have to hand-tune as much. Okay, for storage optimization, this is kind of the thing that's burned Red Hat the worst in the world of containers in Docker is that when Docker came out, they shipped on something called AUFS, which is a union file system, but it's not upstream. And we weren't able to do that. So we had to invent something on our own, and so we have this LVM-based, thin provisioned LVM-based driver for Docker and REL. And what's really bit us is the default one uses a loopback-mounted sparse file, and the performance is shit. So on that VM is a loopback-mounted Docker installation. So it's not you. It's the way the storage comes by default on Fedora, REL, CentOS. So Vivek Goyle from the Docker team used to be on the storage, he used to work on something else on the kernel. He came over to work on Docker's storage specifically, and he wrote this helper utility called Docker Storage Setup. So if you have an atomic VM, we've done this for you where we've got a separate partition just for Docker. Instead of using a loopback-mounted sparse file, it uses a raw block device, and it's a lot faster. So for production workloads, or even for your workstation if you don't want it to be slow as shit, if you have an extra device, that's the trick. You have to have an extra device, which is why we can't ship it by default, because not every computer is guaranteed to have two block devices. For servers, it's a lot easier. But the laptop use case, actually, is what our default installation is optimized for the developer use case right now. So if you do yum install Docker, service start Docker, whatever, it will do this loopback thing, so it won't work, it just won't work production level. So are you using ephemeral storage on EC2? Yeah, so for on EC2, we're using... Well, no, it's actually an EBS device. We have ephemeral storage for the OS, but yeah, we don't use it for Docker. No, we have a second device. That's another part of our AMI. Every AMI, like for the former team's AMI, has a second disk wired into it, and that's just for Docker. So that's what we're using there. So we have really great documentation, believe it or not, on how to switch from loop to this thin LVM thing. And so I definitely recommend you look at that. In terms of tuning, it's the same as bare metal storage tuning. You've got to worry about the IO scheduler. You've got to worry about right back. You've got to worry about... The trick with these containers is that the kernel really has no clue of what a container is, and it's because they were bolted on. So there are many critical places in the kernel that really need to be taught about namespaces that aren't, and one of them is all these sys controls. So the memory management sys controls, some of the networking sys controls are not namespace aware, and that becomes a problem. So those are things we're trying to improve, but it's really difficult to retrofit the Linux kernel to do really what we're asking containers to do at this point. So anyway. Yeah, this is the Docker storage setup screen. And so what we thought was going to happen was we were going to move to a union file system which doesn't have this requirement for a second device, and it's a lot faster. It's got a lot of upsides. The main one is memory efficiency. It actually allows you to share page cache between containers, which makes the density and memory usage of containers and startup time a lot more efficient. The problem with overlay is that it's not POSIX compliant and it breaks things randomly. And the really funny part is that the CoreOS guys decided to make it their default. And I don't think it's anywhere near ready. It just went upstream like six months ago. So for a file system, it was out of tree for a long time, and I think SUSE was actually shipping it for a while. And we've got guys that work on it now, but we're supporting it with 7.2, but it's not our default. So for 7.2, in the case of atomic or open shift, you can use overlay with the understanding of there's like 15 caveats in the tech notes for REL 7.2 around overlay FS. And one of them is, actually it's kind of funny because YAM wouldn't even work in an overlay backed container because YAM was, I actually forget the exact bug, but it had to do with the RPM database being like stale. It had to do between like the copyups, the way the union file system worked. I don't have the details. YAM was broken. So Pavel Advami, I forget how to pronounce his last name. He wrote this little plug-in for YAM that actually triggers a copyup on the lower layer inside a container. So when you use the REL 7 base container and you use YAM, you'll notice there's a new plug-in there called OVL. So our base containers have this overlay plug-in. It doesn't matter if you're using overlay or not, it still gets loaded. So that's to work around a non-posix compliance issue in overlay. Okay, so networking one. If you weren't at Dan's talk, I'll go through the basics. Kubernetes has no network setup thing by default. Who here has set up Kubernetes upstream? One guy. Kubernetes, have you set it up and installed it? If you set it up and install it, you can't do multi-host anything because there's no network setup. They recommend you use Flannel, which is fine, but Flannel is a polarized product. Flannel is fine, except it doesn't provide any multi-tenancy guarantees. If you're using Flannel, you can actually, like containers from different customers can talk to each other. Of course, that's not going to fly. So we had to invent something on our own. And what we did was come up with, it's still, so Flannel uses VXLAN, and VXLAN is a way to encapsulate TCP packets and allow them to, basically allow overlapping networks as well. They're not doing a good job of explaining it. But the point is that we actually have our own carry now, which we're trying to get away from, because Flannel doesn't support OpenVSwitch yet. And the way we've implemented multi-tenancy open-flow rules in OpenVSwitch, and what's called virtual network identifiers. So every tenant inside OpenShift gets their own VNID, and the packets are tagged such that, and there's open-flow rules that allow segregation of containers. So, from a foreman's standpoint, and Dan mentioned VXLAN offload NICS, I think it's, I think it's, I don't know if we'll ever see them widely available, but VXLAN traffic, I mentioned it takes a packet, it wraps it in a UDP outer packet. There's a couple of problems with that. One, it takes additional CPU time on the sender side, and the receiver's got to unpack it, and then transmit it to user space. And all of those extra CPU cycles add up, particularly when you're using fast links. So right now, even with high-end CPUs, we can only do about 5 gigabits in over a VXLAN tunnel. And... You mean on the same host? No. From one to another company? Different hosts. Across a 10 gig link. Like different hardware hosts? Yeah. And do you run it on Amazon? Yeah, we run it on Amazon. Also this time. What's that? Also this time with 10 gigabits... Yeah, with, we did it with SRIOV on Amazon. Because they're 10 gigabits. I didn't know what to wear below. Performance analysis in the public cloud is a mess. It's very difficult. That's one of the reasons that we use Pbench, because it'll run 5 tests. One of them will be wildly different, and it gets thrown out and run again. But in reality, that wildly different one was a valid metric. It was so far out of bounds, because someone else came in and took your CPU time, whatever. It manifests itself in all kinds of weird ways. There's ways around it. Like Amazon is almost a pay-to-play game at this point. Public cloud is almost a pay-to-play. They entice you in, like a drug dealer, with really cheap compute. But when you really start running production loads on it, you realize the cheap stuff is not going to work. So what we've had to do is find ways to cheat. Not necessarily cheat, but to be as efficient as possible. And part of that has involved software changes inside Kubernetes. This one was sucked. So we had 100 node Kubernetes cluster on EC2. And it turns out that every 10 seconds, and every node was checking in with the master and telling you, here's everything I'm doing right now. Which we want to happen. We need the nodes to talk to the master. But we have to stagger it. We have to compress. So there was a lot of changes that went in here. And the before is this blue thing. And this is actually, in order to do 100 node cluster with Kubernetes like six months ago, you had to have 16 CPUs. And now you need one. So there were a lot of efficiency changes that went through, and it went into that. One of them was this. And to give you a sense of how much we've improved, this is, so this is CPU at max 95%, blah, blah, blah. All I'm trying to convey is like the relative difference. So this is OpenShift a year ago. This is our new release. And this is upstream Kubernetes. Before we do our new release, this is, sorry, this is internal. But we're going to hear it before we release our next version. And this is all, not all, a huge amount of red network to make this. So that's CPU efficiency. So we used to take just like, this is the number of cores. So four cores in the idle state. Four CPUs just burning. And they were actually, it was actually something called CAdvisor, which is a way that Google, it's a piece of software that Google wrote to have per container metrics. And it was just crushing the CPU by having too frequently a polling interval in SysFS. So one of the things we learned with Elko's Golang profiling was that CAdvisor was to blame for this. So we've done some tuning in CAdvisor, and eventually it's actually statically compiled inside Kubernetes right now. So there's only so much we can do, but it's going to get factored out into a container at some point. So we can at least put it on a different node. So that's some of the stuff that we've been working on and in Kubernetes. So I guess I don't really need to cover too much more of this. Jumbo frames are still important for storage traffic. One of the things we're doing with Kubernetes is creating a way to have different traffic on different network interfaces. So we'll have storage traffic over one interface, we'll have VxLand traffic over another one, and then we'll have a managing network just like a real cluster. Quite frankly, it hasn't been that way. The software is new, so we're still building all these features. At the end of the slide are a couple of blogs that we wrote about kernel bypassing the containers using solar flare open on-load and Intel DPTK. I think at the time it was sort of, let's try everything we possibly can in a container, and this is one of those things where we're taking a piece of hardware, dedicating it to the container, and bypassing the kernel entirely from a network standpoint for performance reasons. Okay, I'm going to skip this one. Oh, actually, we can't do this in a public cloud. So this is kind of the problem is that public cloud is like the lowest common ground. They don't support it. We have a hard time prioritizing it on the open-shelf side because folks aren't running it on bare metal. There's going to be a few that do, but not everybody. And so if it's not available on a public cloud, we have a tough time developing it. And VXLAN, since it requires specific hardware, and it's like a Red Hat or this type of environment-specific thing, Amazon has no incentive to do it and others Google. So what we actually need is a way to use existing offloads and make the VXLAN offload nick agnostic. So it'll work on every nick. And there's just a shit ton of work because it's a known problem amongst every customer, every company that uses VXLAN. This is a known thing, and there's a lot of work going on around it. There's no solution immediately. Performance difference is pretty dramatic, though. 20 gigs a second, and then if you enable the offload, you get a line rate, which is, well, not line rate, but like 36 gigs a second is a 40 gig nick, just with the hardware offloads. Sorry, how did you activate the offload? I thought the standard kernel doesn't support offloading to national cards. So you're talking about TCP offload engines, maybe, and that's not supported, correct? This is different. This is like TSO or GRO. There's offloads available in the network cards right now. This is just another one of those. It's enabled by default. It's supported in REL7. And yeah, it's enabled by REL7. So if you have these nicks, it will just magically work. You don't have to do anything else. From a latency standpoint, there's really no difference. Okay, then we did it with a shitload of pods. We got these pods, like this is like a matrix test. We've got 96 pods all talking to each other at the same time. What do we see? And the reason we wanted to do this was to find out what the optimal number of pods to run is. So you can see here, and actually this goes back to my point earlier about none of this stuff being new. It's kind of the same concept over and over again. If you can see the colors there, and you can see the top performing line, guess how many cores were on this system? The green line being the most performant. There were 24 cores on the system, and that's the most performant one, which means everything beyond the green, this is over-committed. When you over-commit, you pay a price for it in terms of performance. But sometimes, in this case, it didn't drop off that dramatically. So I might actually tell people, yeah, you can over-commit safely a little bit, make more money on the same gear, or not buy new servers, or spin up. That's kind of high number of new bits. Yeah, well, okay, so this is... No, no, no, it's actually additive. So it's, yeah, it is. It's that many megabits across however many pods are running. So it's actually a... You've got to have to divide that number of pods, the number of pods running to get the single flow rate. It's actually a decent over-commit return. It's not that bad. Yeah, 4x over-commit, and you only have, like, another 2%. Okay, so then we pushed it further. We went to 192, and the machine's crashed. Seriously. So that's kind of something to worry about too. Have you ever pushed a machine to the point where it says, like, dazed and confused? There actually is an error message that will, yeah. So that's all that happened. So if you go ahead and rerun this thing, by the way, we're trying to open-source all our tests right now. They're not open immediately, so we will eventually open them up, and you'll be able to deduplicate this stuff, and if you want to crash your machine, it's pretty easy. Then we did the latency one. I mean, I don't know how many slides you want to, how many graphs you want to look at, but... So back to the network segregation thing. What we did was, so while we were running those network tests, there was so much data flooding over those links that the cluster heartbeat, which is what those check-in, that blue line, and they were in with each other to say, hey, I'm okay, I'm still here, I'm still here, I'm still here. Well, if the master doesn't hear from the nodes for a while, it pulls them out. And when it pulls them out, all those pods go away. So it pulls it out of the cluster, which is what it's supposed to do. So what we need is a way to prioritize the cluster traffic over the rest of the traffic, and there's the traditional ways of doing that with QoS on the switch. We can do it with QoS in Linux. I'll work at Dedicate Links, it's actually easier to dedicate links than anything else, especially when you have an API to add a network card like you do on any public cloud. So we want five nicks in this machine, done. They're virtual nicks, but they allow for segregation, and they allow us to have this cluster heartbeat network separate from the VX LAN where we can't control what the customers are going to do on it. They could...a Docker pull. Docker pull sent to us tools. It's going to pull one gig of it over the network. If enough customers do that at the same time, the link is totally chewed up, and at that point, if you're unlucky, those card beats could be interrupted. So there's a way to extend it, and actually that's what we ended up doing for now. I think it's 40 seconds now, so if it doesn't hear from a node for 40 seconds, I think it'll take it out. 40 seconds and it has to fail three times. There's some weird algorithm, so we triple it basically, and it's not working on the problem because the tests were short enough to fit in that window. It's actually not a fix. It's a workaround. Sorry, the cluster, it means a bunch of governance management nodes. Yeah. And then there will be a cluster suite or something. Correct. Not Red Hat cluster suite. We do have a place for Red Hat cluster suite in the staff though. I'm going to show you a network diagram in the end. Sorry, a question about the congestion. The problem, was it was about the congestion in the network or about the CPU processing power or something in the docker? No, it was about the... Well, so the network utilization meant the CPUs were unpacked. The systems were pushed to the breaking point, and at the breaking point... You see a real-time scale or something like that was not used in that case? We didn't use real-time. Is that what you were saying? Yeah, maybe it's not even possible. We can't do docker containers at real-time yet. And there's another blog entry. If you Google docker real-time, you'll see it. Right now, they have no interest in supporting it. And because of that, if you sat through Dan Walsh's talk earlier about that thing with systemD and docker, we're kind of stuck behind that for real-time processes and containers. We need... We need native systemD support in order to do real-time in containers. Until we have that, we're kind of screwed. So, is there another question? Okay, anyway. There weren't that many people that actually used EC2, though, or a public cloud, right? Hands up again if you've used a public cloud for any reason. Did you guys notice anything about them? Was it convenient? Was it fast? Was it cheap? Expensive. Yeah. We noticed that the performance varies significantly. So, you have to hedge. We also noticed that you can buy dedicated stuff, which means you don't have this variability. So, do you dedicate the hosts on Amazon? No. The fun part about dedicated hosts is that they're actually not metal. They're actually dedicated hypervisors. So, you still get VMs, but they're dedicated to you. So, you get rid of the noisy neighbor problem your tasks get unscheduled. But... Yeah, but the schedule is just one VM, better than the whole machines. Yeah, you can do that. So, when people say dedicated, I guess the important distinction is that it's not their problem. Have you tried it on the right space? Yeah, but I haven't. And also, let's see the other one. The one that IBM bought, was that other cloud that IBM bought? The bare metal variant, too. So, yeah, there's places where we could start purchasing time from them, but we're not going to learn anything we already know, because we already have our own bare metal. We have our own gear. Yeah, but the OCP stuff is pretty interesting. We did a lot of tests with Corvus, and just to determine, like, the consolidation ratios and stuff like that, it was made. Yeah. So, there's overheads on the server in a lot of cases, and you just don't pay with bare metal. It's got its own problems, though. We can't live migrate containers right now, so it's kind of... Someone demoed it, the guy from Parallels. I don't know if it's tomorrow or not. It was it today? He's got a way to live migrate containers. I assume it's memory. I assume it's CPU tasks and not live migrating storage, but anyway. So, there's some other gotchas. If anyone's used the EBS, there's a thing called a quota. So, if you just flood, or if you just do a lot of I.O., eventually they're going to put the brakes on, and you're going to be down at, like, 40 kilobits a second. So, we ran into this OCO, these are his graphs, where he was running our P-bench FIO tool, and it does, I don't know, I think it ends up doing like 30 terabytes of I.O. in all of its sweeps, around 18 terabytes, we started noticing this problem, and the OpenShift people also brought to us separately, where Docker just, and Docker died because it wasn't hardened to handle that. It was expecting I.O. to come back soon. I.O. never came back because they hit their quota. So, there's a couple of ways around it. One is to not run P-bench FIO, or one is to to buy faster storage. So, this is another pay-to-play, where you can buy provision diops, where it'll have a slower top end, but you won't have this. You won't have that, like, drop-off. Incidentally, this is Amazon's cloud watch. They have their own thing that lets you monitor your stuff. So, this is actually Q-depth. Yeah. Okay, so maybe this will make it more clear. If we run P-bench FIO once, we're good. 3,000 I.O.s, which is what I.O. is currently rated at. If we run it again, we're down at, like, 100 I.O.s. That's because they put the brakes on. And the system is at 100% I.O. weights. If you look at I.O. stats, 100% I.O. Discs aren't returning. They're not telling the kernel that, okay, I'm taking control of these bytes. And then the third one here, this is with a provisioned volume, which costs a lot more than EDS, but it's guaranteed throughput. So, we did a couple of things. We improved Docker to handle the random storage pause. And we're also actually going to buy more expensive disks. I believe the advantage of Docker is to use a big machine on this T2 micro. Now, the T2 micros are just for test. Just for scale. Buy something like CC8 SuperArt and Sprint. I'll show you. When you see the... I'll just go right to it. Okay, they did the same thing on CPU. Storage, yes, storage performance. So, you can see here that EDS is significantly slower than SSD. This is one of those things where it's like, if you run on EC2 and you do enough work on EC2 like we've been doing, and then you go back to bare metal, you realize how slow EC2 really is. It's not just EC2. I'm using them as an example. It's public cloud in general. You've got this shared environment that scales horizontally really well, and there's no capex involved. It's all OPEX types expensive. So, the finance guys are generally happier. But when your OPEX goes up significantly because you can't... Like, this doesn't suffice for you. So, this costs you, you have to pay. And every cloud does the same game. So, I've got a couple more slides. But I want to show you what what an open shift... a highly available open shift environment looks like. We've got a... So, you had mentioned Red Hat Cluster Suite. This guy uses a pacemaker to manage effects on the masters. And we've got the Google... Sorry, the Amazon load balancers in front of those. Registries, talk to S3. We've got routers with load balancers in front of these software load balancers. And this is all on public cloud. And then we've got two NCD clusters. So, one thing we added to Kubernetes in the last couple of days was we have the ability to shard if you're familiar with database scaling. It often comes with sharding. NCD had no... Kubernetes had no concept of sharding. We had support for that. So, now we can put the high volume keys on a dedicated cluster and the lower volume keys on another cluster so we can get those heavy hitters away. And we'll pay for some high end gear here for those high traffic keys. So, we're hitting a lot of scalability problems because of that. Because actually what it is is the events. Kubernetes logs are a shitload of events. And if you've got a lot of containers, those events all multiply. They just get shoved into NCD. So, we did a couple of things. One, we did the sharding, but two, we also toned down the amount of events that Kubernetes is going to log to help scale out. And then on the number of nodes, like this would be up to a couple of thousands. And they're all talking to EBS. And so, you had mentioned the type of machine is what we're currently making for a very large environment, a thousand node machine, a thousand node cluster. Those machines aren't cheap. You know, these masters and NCD, and you'd be really powerful on NCD uses a shitload of memory. So, you need high end gear, which makes it not cheap. And then the network standardation over here. So, we're going to put the management stuff separately, the ex-plan. So, this is the pod communication, the stuff that Dan Winship and we're also going to put in most likely a dedicated link between master and NCD. So, the nodes talk, the traffic flow is nodes to the masters. And then the masters go ahead and talk to NCD. And we can't buy machines large enough on Amazon to allow us to co-locate NCD and the masters. Because they only sell I believe they only sell 64 gig machines at this point. They might be offering 128 gigs soon and at that point we'll probably have outgrown that. So, basically we have to keep them separate. So, that's how we're going to scale out OpenShift to 1000 nodes. So, couple more minutes. Oh. So, in terms of scaling out of the box limits, currently 40 pods a node ish, something like that. And 255 nodes is the out of the box default limits. You can tweak all those. I mentioned the Visenapper earlier. So, that's kind of, I call it a limit, but really it's, you make a choice between stability and speed. So, you mentioned with overlay class system it's possible to share page pitch. So, but this is not the case for the remover system. Correct. As a question, if the nodes are virtualized does it run less VMs or does it run multiple VMs? That's not going to matter. It won't matter. From a Kubernetes perspective the number of pods is all that matters. With a number of objects, let's say. It could be pods or application controllers or services or others. Those are the things that the master has to worry about. So, at a certain point, yeah, you know, but it's not 40 is conservative. Remember I showed you that CPU graph where it keeps getting 40 was because it used to be terribly inefficient. 40 is not something we have to worry about anymore, but we haven't yet shipped all the fixes. Once we ship them that 40 probably, I mean I'm not in charge of that thing, but it could potentially go up. So, yeah. So, we've basically taken every part of this environment, deconstructed it, found the limits for each individual part, found best practices for tuning each individual part and so this screen talks about that CD. We know it needs extremely fast storage. We know that it needs a ton of memory because it actually keeps somewhere between two and three copies of the database in memory at any given time. We know that swap is going to crush it and we know that the master and the CD are constantly talking to each other. That's another thing we're doing is that the efficiency, we're trying to make it more efficient. The communication path between the master and the CD right now it's way too chatty and it scales depending on how many nodes you have there's more traffic. So we want it to be more efficient and so we're reducing a lot of the chatter between the master and the CD by finding out if we can add some jitter to the callbacks. If every machine's got a stick to clock then every minute the machines are going to check in. We want that to be fired slightly. So we're going to add one to one or two second fuzz on both sides of that so that it doesn't come there's no more thundery herd problem. Where the master all of a sudden goes from 5% CPU to 100 after it works through its backlog of all that stuff that just got thrown at it it goes back down to 5% of CPU. We want it to be all of them more steady. So part of that is reducing the amount of chatter. So I mentioned the GitHub repo. In there is a sub directory called SVT. And in there is a sub directory called content. Inside the content folder if you haven't cloned it now you can go ahead. Inside that content folder are a couple of examples of pod manifests replication controller manifests quota manifests. I hope you like writing JSON. And you can use those to work through your labs that are on the GitHub repo also. The inside of one of those manifests you've got really basic ones but some of the additional capabilities we can do is we can we can make it a privileged container we can increase some of the capabilities that the kernel has these are kernel capabilities. So the container we then have sysadmin we can run custom scripts and we can we can mount volumes in a pod manifest. So there's just an ever growing list of capabilities that these manifests can have. You can also say land it only on this node. Node 5. Or you can label your nodes and say this node has a very special sand attached to it. I only want my pods to run on that server because that sand is super fast. So these particular pods, let's say they're my SQL pods, they need super fast storage only ever land on the nodes that are physically wired to that sand. You can do that in the pod manifests as well. So the concept I just mentioned is called node selector and so you can land a pod on certain nodes and either you give it a hostname or you give it a label and the labels could be anything. In OpenShift's parlance we've got regions by default anyway. So you've got the primary region. They're arbitrary. It's a key value. It could be anything you want. So you can say I want machines that are in row 5, rack 2. Only those machines get this pod. I don't know. However you want to carve it up. That's how you do it. We've got schedule fairness, pod fist resources. So remember the quota Json I showed you? It says you can use 10 pods or a gig of memory if the machine doesn't have a gig of memory to a gig of memory free then it would not schedule on that pod. On that node. Same thing going along with the rest of these ports. Those are ephemeral ports that might be available. This is extremely complicated but the point is that you can have persistent volumes attached to pods and we support a mode called multiple readers but we don't support multiple writers because we don't support cluster file systems just yet. We support things like Gluster and Ceph but not multiple rewrite mounts. So that enforces that. Eventually we'll say no we'll eventually be able to remove that when we figure it all out. And actually this is a label region for services. So you can schedule a service that lands on a particular pod. And a service is a virtual IP that has a HA proxy associated with it that is what the external world talks to. So the external world will actually they talk to about. So a DNS name translates to a route IP that route talks to a service and the service is actually an HA proxy that talks to however many pods back into it. And so we've got some tunables there we can put the service on a particular region let's say it's the west data center or and then we can do some waiting about where those pods might be scheduled. So this again is a list as of three or four weeks ago and I'm sure they've added them since. They've added stuff since. This is available inside JSON so on your VMs if you cap that file you'll see this is actually JSON I kind of cleaned it up to make it more readable. As you mentioned Kubernetes supports persistent volumes of iSCSI NFS Fiber and I think you don't use multiple mounts for this cluster mounting just from one point. It's not possible to write from two pods to the same cluster volume. We can do it in Gluster. Gluster can do it but Q can't do it. Eventually we'll fix that. Profiling of Golang who here is familiar with Perf, the tool Perf. Golang was written by Google and has their own profiling utility it's called Pprof so Perf doesn't work quite well and it certainly doesn't have all the features that the native Golang profiler has. So you can enable part of the things we this is part of the stuff that Pvenge automates for you and it's part of our job to look at these profiles and be able to identify what the problem is. If you used Kubernetes from six or eight months ago and you scale it out and you had these Pprof data files you'd see you'd see things like TLS handshakes at the top of the profile or you'd see things like CAdvisor and those are all problems that we've resolved so there's new problems at this point or at least there are orders of magnitude smaller and one of the example issues that I showed you earlier with the blue spikes if you want to go read the details of that that's the issue where we fix it Okay, so that's it I guess there's CPU efficiency we're doing a lot better we're going to be well under one core to manage 120 pods which is a significant improvement used to take us three or four cores to do the same work and yeah and the memory efficiency is shrunk by a factor of like two and a half so a huge amount of improvements from an efficiency standpoint that's all I have towards the slides um so if you've got the SVT repo pulled up you got this container you'll see this file called the SVT you see this cluster loader thing Manchu wrote that in the green shirt and a lot of our guys are contributing it to it at this point what it is is it takes one of these YAML files which looks like that okay and it deploys an OpenShift environment that looks like that so you can say I want two users in our case thousands of users but in this demo one like two users three services, two replication controllers I haven't even told I haven't even gone into that yet but each one has five replicas 10 pods 40% of them are Hello OpenShift and the other 60% are Python some other stuff we did at the bottom was we had to insert some delays because there's races in the kernel that we had to fix so we had to work around them by inserting some delays in between container operations and we also pause every once in a while to gather some state information using Pbench so to run this guy I created a little bit of a smaller one and I think this only takes a couple of seconds to run so to run it I'm going to run this and I'm going to quickly switch to another window because I want to show you actually what Kubernetes sees on the other side so let me prep that this is a little bit smaller I don't really have the right resolution here so this is OCDetNodes I've got pods I'm asking the master server how many pods are running and OY tells me what server what actual node they're running on and it doesn't watch in OCD so it doesn't actually loop it's an event driven thing instead of pulling so it's much more fair on the master in terms of resources that's how we figure that out so if I run this guy it's going to create a new project create three services create a user create five pods and what it's doing right now is it's waiting because I told it to wait until all five pods are running we've now got five pods running after a couple seconds we'll start seeing some more stuff stream out here there we go so it's scheduled another I don't know what do they tell you 15, 20 pods they'll say pending now they're saying running this is actually just a three machine cluster and there's only two nodes so now if I go back to the other screen it's actually still going so that's what this software will do and now what you can do is you can say what do my developers need or what do I expect to host and you can model your test environment and figure out how much budget you need how many EC2 instances you might need how many physical servers you might need what kind of storage you might need all because of this workload generator actually sorry, the cluster loader so I think it's only a couple more the point is that you express what you want in YAML and it goes ahead and does it for you we've chosen to use the Hello OpenShift pod which is for the sake of it being fast for the demo but it could be any pod the CentOS pod any pod you want and any mix of them as well so we could have like I told you we did half Hello OpenShift and half Python and what you're seeing now towards the bottom here is a replication controller so it's called TestRC0 and TestRC1 you see they're all running so I think I've got everything running now let's see so the script exited and now we go ahead and run whatever tests we want whether it's network storage whatever, we've got an environment that's got pods, we're going to do Jmeter tests against it, whatever and the other thing I wanted to show you is that so it'll tell you if you do describe and then the resource name like in this case TestRC0 you can see a lot in detail about this replication controller a replication controller is a horizontally scalable in Kubernetes that says I want however many pods you say I want one pod but if you create a pod by itself it's stuck at the number one if you create a replication controller with one pod in it you can then scale it horizontally however far you want in fact this is how Kubernetes is doing horizontal scaling auto scaling in Kubernetes so it'll say I've got too much load on this one pod start up another pod replication controller so that's how that works and so everything has as described you can do like you can describe every resource so I typed OCGetNodes tells you the host name but more importantly this region over here these are arbitrary tags I could name them Rack5 and only LAN pods on the node that's in Rack5 something like that so if I wanted to create a pod sorry if I go to the content directory in here are all these examples um the pod default looks like this so it uses a hello openshift container it opens one port and it runs a go binary that just sits there a replication controller however says I want to run this image and do this the same thing as the original pod thing did but it also has this extra field called replicas that's how ever many pods are running so if I have let's see you saw I created a replication controller I had five pods in it right so I just reduced the number of pods in that replication controller from 5 to 1 and you can see I've got one pod running and one desired but let's say I want to schedule let's say I want let's say it's Valentine's Day and I sell flowers I want to be able to scale out horizontally on demand fast as far as I want right now um in this case let's make it I don't know actually yep so now you can see that numbers are updated to 20 and if I type it again it'll probably have started all 20 nope only 10 are running so after a couple of seconds you'll see you've got 20 of these pods running each one of those pods represents a horizontally scalable portion of your application so that's how easy 20 are running so that's how easy horizontal scaling is in Kubernetes I'm scaling it manually but there's heuristics in Kubernetes that you could have this all happen dynamically for you and the fact that you can start these things and stop them within a couple of seconds makes it really dynamic we're going to get to the point where we can automatically schedule new nodes to come online so not only new pods to come online when you need resources but new nodes as well and so that you can basically have this cloudy thing that expands and contracts as your workload does you never really pay more than you need to for those resources and so if I go to I don't know 100 I probably won't the last bit I wanted to share with you I guess this is some of the data we have internally about node scalability so you can see how much CPU it used to use when you're creating these pods and all the way down towards the bottom here actually the y-axis changed so it's hard to see but the point is that it's a factor of 4, at least 4 more efficient than it used to be to run the same amount of workload so let's see if do we have 100 running yet 46 it's getting there I promise so I think that's it it's almost 6 right? questions? sorry if you mentioned it already is there any difference between running a privileged and non-privileged container from performance point of view? no sometimes we have to run privileged in order to get some apps to even work in containers yeah so sometimes an application will expect a PID 1 and when that happens we have to run it privileged because we have to run systemd in a container like we could run supervisord also technically but are you familiar with supervisord? it's a PID 1 replacement sort of thing that I guess they support it we don't actually ship or support it I'm interested how OpenShift compares with Amazon's own service from performance point of view? no didn't you try to test we tried a lot of stuff that's not enough yeah I'm not going to give you any competitive there is even a driver or provider for Kubernetes use those PCS I'm more so that's it guys you already cloned it and you have it thank you guys if you still have any of these I need this back thank you very much you work with Kubernetes? yeah you use Kubernetes? no I mean it's it's working really well the only caveats are that if you have really heavy work loads you have to just like with a pair of metal though you got a buyer so we run Amazon look at you I think we're not at that point close this is where we can try to save some money but for the test thing we don't know what we're going to do so we just sort of that's why we work to optimize the installer work to optimize basically how fast can we run these tests without losing any the data but to keep the costs down we cannot go to the reserve because when you do the reserve you have to pay up front it means we have to get the finance report I want to get it out with the finance report maybe you can share those research instances with your production and just use them in case there are no customers I specifically set up a separate channel for us I don't want to go anywhere but for yourself we're actually in a separate channel thank you nice to meet you nice to meet you talk to you later how are you guys doing thank you I'm just going to go and I'm going to go to the customer and I'm going to give you a little information about the procedure oh cool so he's going to get all that stuff next time or figure it out we're actually going to meet him before he's here I have one guy doing an ABU why you doesn't have an ABU