 Hello everyone. Thank you for coming to our talk and my name is Tano. I'm with the IBM from the IBM Silicon Valley Lab in San Jose, California. We have our colleague here, Winnie, also from IBM, and Ricardo, Rocha, and Spiros Chukazi from the CERN Lab. So we know that Magnum can deploy container or infrastructure for container in almost that, so that's not new. But before a production team is willing to adopt Magnum and run a production, they would want to know how things scale when there's a lot of users on the system. And they want to know how robust the system is when you have error. So a few months ago, we embarked on this study to understand how Magnum and container behave on a large scale. And luckily, we have access to two environments to do this, one from the CERN Lab and one from the CNCF Lab. So on this talk, we would like to share what we found so far. Now, doing scalability study is an ongoing work, so this is not a complete study by any means. But we just want to share what we found so far. So this is the outline for our talk. I will begin by a quick review of Magnum and the future in the Newton release. And some of the kind of benchmark we want to look at. We're just going to go into the details for the benchmark. And then Ricardo will talk about the results we found at CERN. And then Spirits will show you the results from the CNCF Lab Cloud. And then we have the conclusion. So now we didn't do all this by ourselves. We got a lot of help. So we want to acknowledge the CERN Cloud team who gave us a lot of help at CERN. And we want to thank the CNCF Lab for providing the hardware for us to do this study. We want to thank the IBM team, Dirk Davis and Simon, who gave us a lot of help with networking. And the Rack Space team, Adrian Otto, Chris Houghton and Drago Rosen gave a lot of help to run OBSAT Ansible to build the whole environment from scratch at CNCF. And of course, a lot of thanks to the Magnum team for a lot of great development in this cycle. So about Magnum, the mission for Magnum is to manage the infrastructure in OBSAT to host your container. So what it means is that it creates all the OBSAT resources to build your cluster. The VM, the bare metal, networking, the storage and so on. And put them all together to give you a fully functional cluster. So the value that Magnum gives you is the deep integration with all the OBSAT services and also the lifecycle operation to manage your cluster. One key point here is that Magnum does not create a new API for containers. So instead, the user will just use the native API that comes with the particular platform that you use. So currently, Magnum supports Kubernetes, Swarm and Mesos. So in this chart, I'll show some of the Newton release. This is not complete. And for detail, you can go and check now the release note on Magnum. I just want to mention some of the key ones. We refactor the code that are specific to container platforms so that they're now driver and they're easier to manage. We no longer talk about Bay, we talk about cluster. There's a lot of work on documentation. We have a user guide and installation guide. And a lot of new support in bare metal, storage and networking. We have a new driver for OpenSuzi. And in the internal engine, there's a lot of improvement as well. We have now the asynchronous operation option to use the database for storing the certificate and notification and rollback and so on. For the OCCADA release, the team is looking at a couple features. We want to support heterogeneous cluster. We want to have a lack of cooperation to upgrade cluster. And we will support advanced networking for container, lab network for the Docker and CNI for Kubernetes. And we'll have an addition driver as well, the DCOS and more support for bare metal. So when we talk about scalability, you probably have heard about companies that run OpenSack on hundreds of thousands of cores. So you may think, okay, this is a solved problem. But the scalability that we look here is specific to Magnum and container. So it's a bit different. We want to look at three different aspects. So first, when Magnum builds the infrastructure for your container, there's a lot of things happening at the level of the OpenSack services. So second, once the infrastructure is up and running, and you start deploying container to your container platform, there are things happening at the container platform level. And then third, once your container and container are up and running your app, there are things happening at your app level. So we want to look at all of those three levels. So with that, let me now pass on to WinniSank. She will take you through the benchmark that we built for doing the study. Winni. Hi. Thanks, Don. So Rally is an OpenSack benchmark test, too. Currently, it already supports many OpenSack projects already. If the project you are working on is not supported yet, can easily extend it by adding a plug-in to Rally. And that's exactly what we did for Magnum. We added a Rally plug-in for Magnum so Rally can call the Magnum API. And so Rally plug-in mainly consists of two pieces. They have the context and the scenario. The context is where Rally would create all the necessary resources that you need for you to create before you can run your scenario. And the scenario is the real benchmark test. And Rally is also recommended, too, for production surface to verify the surface always behave as expected. And another great feature Rally provide is the export the results in HTML format. And here I have a sample of HTML report you get after you run the test. So here, it gives you a really nice view about your test run. You can easily tell how long it runs, how many failures you have, and when it happened at the beginning or at the end, and also how long it really run for each scenario. Does it take longer time to run it at the beginning or longer time to run it at the end of your test? And also give you all the statistics up there. So it's a very nice summary that you can tell how your test run is. So next let's talk more about our plug-in for Magnum here. So mainly we have two types of scenario here. The one at the top, those are more for the infrastructure level that we will call the Magnum API. So that's mainly tried to assimilate a cloud administrator to create a cluster on the cloud. And then this would help you to show the performance of creating different size of cluster on your cloud. And the second type of scenario we have here, those are at the container level. So those would call the native container client for each cluster type. So that assimilating an end user trying to create a container on different type of cluster. So next I would like to kind of show you some sample of a rally test file that you can use to call our scenario. So the one on the... I forgot I can point. So for this one, that is to create and list a cluster. So in this example, it will create 10 cluster by two threads and each cluster will have four nodes. Before it run the scenario, first thing you need to do is everything inside the context. So first it will create the user and then the tenant and they will create this cluster template. Your cluster going to build based on this cluster template. So the cluster we're going to build for this sample is a Kubernetes cluster. It will have a five Docker volume and all the master node and the slave node is going to using the Fedora image and it will be in the small flavor. And you can access to this cluster in the public network of this cloud. So let's look at the other side. One is the task file to call the port scenario. So for this case, it will create a 20 port by two threads and all the port will be using this file as the manifest. And just like before, it will first create all the tenant and user and the cluster template and this time it also will create the cluster in the context first. So after it created the cluster, it also will create this CA certificate. We need to use this certificate. When later on in the scenario, you want to talk to the Kubernetes client so you need to have this certificate because by default your cluster is TLS-enabled. So beside the rally benchmark test, we also run the Google benchmark. So for this test, the first thing we need to do is create a really large Kubernetes cluster. So we need to have at least 800 CPU in there. And then after that, then we create a lot of engine export in it so we can serve that million of HTTP requests per second. So after that, then we compare our test results with the Google-published result. So next I will pass it to Ricardo to talk more about the test result. Cool. Thanks. Okay, so I'll continue and build on what time we need to explain already. So I'll present to CERN Cloud Results. I'll just introduce why we did this at CERN. So we've started looking at Magnum as our solution for providing container services one year ago or a bit more. And the reason we do this is that we have big clouds and we have big needs. So this is a summary of our cloud as it is today. So we have just under 200,000 cores, a lot of projects and users. We have around 22,000 VMs running at any time, over 7,000 hypervisors. And you can see here that we already have it running as a pilot service for a few months at CERN. So we have a few clusters already created and we are just opening it to production end of this month. So the use case is at CERN and the reason we need all this capacity is because we have a big machine. There is the Large Header Collider here in the picture down on the left. And then we accelerate protons that collide in machines like the one on the right, which has big particle physics detectors that try to track new physics. The way people do this is these collisions will generate a lot of data. We store it and then we need to analyze this data. So we have a lot of use cases for batch processing. This means going through the data and getting more data that is more easily accessible by the physicists. And then we have end user analysis using Jupyter Notebooks. Traditionally, this analysis is quite complicated to do. Containers are bringing really a lot of ease of use to this domain. And what they want is to visualize what happened inside the detector with pictures like the one we see here, which is actually a Higgs event, a particle we found a couple of years ago. There's a lot of people looking also at machine learning to try to do different types of physics analysis using things like TensorFlow, Keras, and also the infrastructure services. We are also thinking of what we can move to containers to simplify the deployments. We already use it for a lot, like a lot of things, including continuous integration and deployment. So a summary of Magnum deployment. So our initial goals were to integrate containers easily in the CERN Cloud. We already have a big open stack cloud that containers are just one more thing, sharing identity, the networking integration, storage access. We wanted people to be able to choose their engines. There's some people that just use Swarm because it's compatible with the Docker API and just plug it in and you get a cluster for free. Some people like Kubernetes, other people have tasks that run better in Mesos. We wanted to be able to do all of this and Magnum fits these requirements. And then the two main things is it has to be fast and easy to use. So that's why discussing in the last summit with the people of Magnum, Tone, and other people, we decided to try this kind of exercise. So a bit of a timeline of the Magnum deployment. We started looking at containers in the last couple of years. We deployed pilots around February this year, and we quickly got it into a state where we are just about to put it in production thanks to the upstream developments. We also did a lot of integration of internal services with the container clusters that I put here. So how does it look now? So if you would come to the CERN Cloud today, you could use Magnum. And the way we do it, we use shared public templates. So in Magnum, you can describe how your cluster should look using templates. We provide some predefined ones that people can just reuse, and this we provide, right now we provide swarm, Kubernetes and Mesos with HA versions. These are default configurations. Then as a user, you can create your own and just deploy it as you wish. And then all the users have to do is three steps. They have one command as cluster create, and they choose their template. In this case, I'm choosing swarm, and I say I need a cluster with 100 nodes, which will be 100 VMs running containers. Then I do a cluster list until, while the cluster is being created, once it's done, which just takes a couple of minutes, and I'll show the numbers to show that this is true. Then you do a cluster config, which is the command that will do the configuration on your environment so that you can use the native API. As Tom described, one of the big things is that if you're using swarm, you can just use docker. If you're using Kubernetes, you use kubectl. With three simple commands, I get the docker client talking to a 100 node cluster, and it's very, very easy to use. The benchmark setup we had at CERN, we often get new hardware in bunches, so we had the chance to exercise this in a series of new hypervisors we got just before we put them in production. In this case, we used 240 hypervisors. Each of them has 32 cores, 64 gigs of RAM, and they have 10 gigalinks between them. We do use the default in Magnum, which is to store the container images in Cinder, and some of the results seem to show that it's a good idea to make this optional, and Spiros will talk a bit more about that. Everything is deployed and configured in Puppet at CERN, and we just extended our production setup with a new cell, which runs exactly as the rest of the environment. The Magnum heat setup, so at CERN we split the controllers, so Magnum and heat both have dedicated controllers and rabbit NQs, and we dropped the Neutron resource creation. The reason for this is that our networking setup in integration with OpenStack doesn't allow us to have things like floating IPs, ports, private networks explicitly created, so we had to do some tweaks in the Magnum setup. Now, here are the first results. We run this a couple of months ago. At the same time, Kubernetes had published their own results, so we wanted to compare. They had reached one million, so we wanted to a bit more, so we managed to do two million, and we used for this 200 nodes, so 400 with two core nodes, so 400 cores at 100 gigs of RAM. In this graph, in the left one, you see the usual rate of creation of VMs at CERN, which is around 200 per hour. In this case, we bumped it to 1,500, and now our OpenStack infrastructure is just continuing going as usual. And then on the right, you see the plot that comes from the test results. On the top, you see the number of requests per second, and you can see it scaling in this animation as it grows up to two million requests per second. We did get some not-so-optimal results in terms of network latency, so that's the lower part. And here you can see that our average request at two million was at 40 milliseconds, which is pretty high, and 99% was very bad. So there was some work to be done here. But this was very encouraging. The service was just deployed a couple of weeks before, and we already had very good results. So some of the observations we saw in OpenStack, so the service is coped pretty much as usual. We did see a bump of four times in the number of requests in Nova, eight times in Cinder. Keystone requests stayed the same, but we had detected a small problem, which was the way Magnum deploys using heat underneath. It creates a lot of trust users that triggered this bump in the size of the revocation tree in Keystone, and this kind of disturbed our memcache denotes, and we had some higher latencies in Keystone for a while. We understand the problem, and we already have a fix for this. And then on the right side, you can see the plots from Nova, Cinder, and Keystone, and it all looks pretty good. Now, then later, we decided to do a second run. Google republished their results with 10 million requests, with bigger clusters, and at the same time, we wanted to test not only how much we can do in terms of requests per second with clusters of 200 nodes, like in that case, but how much can we scale to? And they had a cluster of 1,000 nodes. So at the same time, Winnie was developing the rally plugin for Magnum, so we decided to build on that and just test the whole range. To get in this second run, we did a lot of iterations. There were a lot of tweaks to be done to be able to deploy clusters of this size. One of them was scaling the Magnum conductor. At the time, it implied deploying Barbican. Right now, you can actually use a DB backend for the certificates storage. So there was some things to be changed there. An example here is one of the initial iterations we've done where Neutron just exploded, and we kind of also detected where the issue was, but we had several ones, and we learned a lot about our own cloud also with this. So results of the second run. Again, on the right, you have the plot as described before. On top, you have the number of requests, and we got up to 7 million requests per second. On the bottom, you can see that we improved the average latency on the network quite a bit. It's still not ideal. There's still a big bump to the 99% percentile, but there's some work to be done there. And then on the left are the results from the rally plugin tests. The first column is the cluster size, so we tried clusters of 2, 16, 32, 128, 512, and 1,000 nodes. The middle column is the concurrency, so for smaller clusters, we can actually do a lot in parallel to try to really exercise the load on the system. And then on the right, you have the average deployment time per cluster. So the results are very good. For the first couple of ones, for a cluster of two nodes, we can deploy 50 concurrent ones and get a cluster, a working cluster, a container cluster in 2.5 minutes. 16 nodes, 32 nodes in 4 minutes, and a concurrency of 10, so quite a lot of loads at the same time in the system. And then 128 nodes with a concurrency of 5, we got just under 6 minutes. These results are really good. What we did observe is that with all this tuning we did, we managed to deploy clusters of 512 and 1,000 nodes, but the deployment time goes up, and it seems to start going up linearly. So this was an issue that we wanted to understand. It's still okay, like if you really need a cluster of 1,000 nodes, waiting 20 minutes is not that bad, but still there's clearly something to be fixed around here that we can work on. So for this test we used a 1,000 node cluster, which in total had 4,000 cores available, and 8,000 gigs of RAM. So the tuning, what did we work on? A lot of tuning in heat, especially timeouts and contacting a RabbitMQ, also scaling what you can do with heat by the defaults for the number of stacks per tenant are kind of low. So for this kind of scale, we had to pump them quite a bit. The large stacks sometimes take multiple retries to delete. This is an issue we also are looking at. Then in Magnum we had some minor issues with small bits in the daemons that we already saw in other daemons, so they were kind of obvious to fix. Some rabbit issues, and then the flannel network configuration, we couldn't use the default one for a large cluster because we didn't get enough subnites. So this is the labels configuration for flannel is just to be able to scale it up. So this is something you can do in the template. It's not a problem with the service itself. But these are things we are summarizing in a nice page to share with everyone. And then the Keystone Rehocation Tree issue that I mentioned, the solution was actually to disable memcache because it was an issue with the way memcache was being used. It pumped the average latency, but actually overall it paid off. So the last bit I have here is some more tuning or issues we saw. So in Cinder, Cinder sometimes gets pretty slow in deleting the volumes and this triggers heat timeouts while it's trying to delete, and then it gives up. We saw some issues with the heat engine. So making Cinder optional is actually a good option in general. Not only because we saw these issues, but because probably in many cases you will just want to store the images locally if you have SSDs and not in persistent storage. Then the heat stack deployment scaling linearly, as I mentioned, for larger stacks of more than 128 nodes, we saw these issues. Just to have an idea, the way we use heat today creates 70,000 records in the database when we deploy 1,000 node cluster. So clearly there's some improvements there and we have a session with the heat team to look at options to improve this. Then the flannel backend, we used VxLan, and this was after we discovered that UDP wasn't giving very good results. So with VxLan, we got better results. And then we might just set, or we've set that CERN VxLan as a default. So this is the summary of the CERN results and I'll pass to Spears that will describe the results in CNCF. Thanks. So a couple of months ago, we were lucky to have a 100 node cluster from the CNCF lab, and we tried to repeat the same benchmarks that we did at CERN, benchmarks with Rally, and Kubernetes benchmark to achieve millions of requests. So I would say I would describe a bit the setup in CNCF. We got 100 nodes with 24 cores, 110 to 8 gigabytes of RAM, and 10 gigabytes, 10 gigalinks between the nodes. We deployed with OpenStack Ansible using the Newton release, and so this is a pretty standard OpenStack deployment that you can get today using Newton. Nothing special about the deployment. The configuration is HAProxy in front of five controllers without all services and rabbit, and the five cluster node rabbit, and three dedicated controllers for Newton. All these controllers run in LX containers, as is done in Ansible. We initially had CINDA available in the system using the LVM driver, but later did some unrelated to magnum or heat problems with disabled to continue with the benchmark. Also to note that Newton is configured almost in the same way as in CERN using Linux bridge, and the difference between CERN and CNCF is that we were using the database of magnum to store the certificates. These are the results about the Kubernetes benchmark. We did two rounds of tests. One round was with 35 node cluster and one with 8 node cluster. We're using very large VMs with 24 cores and 120 gigabytes of RAM, so we were occupying almost all the physical hosts. For the first run, we had in total 840 cores, and the second run, 1,920 cores. In the first test with the 35 node cluster, we used the HTTP, and in the second one, we used host gateway, and we compare results with those at CERN and the results that Google published. For the 35 nodes cluster, we achieved 1 million requests, but with UDP configured, the latency was very bad, 83 milliseconds, but for the 80 node cluster, we were using host gateway, and we achieved almost exactly the same actually performance with those published by Google, and for the 8% latency and for the average latency. Then we pushed a little bit more to achieve 3 million requests. The latency increased, but I think for this amount of requests, it can be considered reasonable. For the rally benchmarks, this is, as I said, the ongoing work. We didn't have the chance to do the same results, to do the same exactly benchmark as we did at CERN, because we need further tuning for a bit, which we identified as bottleneck, but for two node clusters, which we managed to do all the tests, we achieved pretty much the same times as in CERN, and the difference between the previous table is that here we show also how many clusters we created in total, so when we tried to create 100 clusters in a very short period of time, like under 20 seconds, all clusters succeeded in 3 minutes, and we had at all time 10 clusters creating in concurrent. For 1000 clusters, we managed to create 219 clusters, but after that, Rabbit and Hit gave up because of the load. On the other side of benchmarks, about container creation, these benchmarks weren't implemented when we did the benchmark at CERN, so we had the chance only to do them in CINCF. We created four, eight containers in total in every Syria available right now in Magnum. This time might seem a bit high, but it includes also pulling the image from Docker, from Docker Hub, but in a real application this is also true when you just deploy a new cluster, you don't have the image usually stored locally, so these are pretty good numbers. The bump at the mesos cluster is because mesos is primarily used to manage many different kinds of resources, so in this case we use marathon over mesos, so there are a few more handshakes between mesos, marathon and zookeeper, so that explains the much higher time than some more Kubernetes. In this case, mesos was used as a container engine, but as well. We tried to use the same configuration at CINCF as we use at CERN because we already had the feedback to deploy large clusters and to have a high amount of load in no-percept deployment. We already knew what we should do, so we used exactly the same parameters for heat and we decoupled CINDR, also because CINDR had internal problems in our deployment, but also the scale as well. In the site deployment, we weren't using floating APs, but in this one we had also floating APs, so we tried to disable them to reduce a load on neutron, but we discovered a very nasty bug in Magnum that if you actually disable all floating APs in the master nodes and in the slave nodes, you don't have access to actually your cluster, but this is easily very easy to fix. We're still working as of this week as well, and we hope the next weeks to tune rabbit to cope with the high loads that we want to put in the deployment, and we consider to tuning a bit the OBERTA Constable Playbooks to have dedicated rabbit clusters for each service like we have at CERN, and a very good result that came out from this exercise is that the newly introduced storage backend about certificates works very well, performance-wise, so it's a very reasonable alternative that we can, at least for performance. So the conclusions after the test, it was some... We spent one long month at CERN doing benchmarks and one long month at CNCF trying to figure out what works and what doesn't work in our deployments. So as Don mentioned earlier, we did this exercise to measure the deployment of clusters, deployment of containers in the clusters and deployment of the apps. I think in all three aspects, Magnum copes very well and also at the same time we tested Kubernetes and Swarm that they seem to cope with the load very well. Nova and Neutron seems to be very solid at this point. We didn't have almost any problem in both deployments. And Magnum as well seems to handle very well the load and as soon as a cluster is created, you can actually achieve very good performance compared to very high-end clusters like used in Google. Although Magnum can cope with the load, you must do some tuning to have it working in a very large deployment as we discovered, but we will publish all these results and improve our documentation so you can get up to speed very fast. Things that we need to work at this amount of time is that this exercise was also a scaling exercise for all opening services. So Rabbit was clearly the bottleneck and also must do some tuning in heat and we might consider improving heat in validating the resources where Magnum creates many, many resources. So this is the linear scaling of heat and for Keystone we track upstream this problem with the very long validation of tokens when too many trust users are created. And finally, the question that the title poses, did we hit 10,000 containers? Yes, we did. At CERN when we tried to do the 10 million requests per second, we had 9,500 load containers and 500 server containers. So I think we achieved our goal. Best practices, you should tune your open stack, not all the default parameters work for all environments, and one important advice, use only what you need to use. For example, if you don't have space constraints, you can disable the center volumes to store the container images, which doesn't have to be purchased, it can be volatile, and use local storage to also leverage fast hardware like SSDs. Play with your configuration, you need a very large cluster with many nodes that will create many resources in heat and many load on all the services or you can use bigger VMs if your infrastructure allows it. We can also have the floating apis on and off to reduce a lot of the neutron, but we'll just fix the bag that we have in MagnumFest, and you can, of course, choose a lot of parameters only if you need it. You don't have to be mandatory at all time. So next steps after this exercise, as I said, this is a continued work. We want to benchmark a few things more about Magnum, like rolling upgrades that are coming to the next release. We also need to benchmark the deletion of Magnum clusters because we had actually more problems in deleting and cleaning up our infrastructure than creating. And as next step, we will also, as I said, summarize all our findings in our documentation. At the application level, we did only benchmarks about Kubernetes. We're going to do also application benchmarks about measures and swarm. There are some public known benchmarks for measures to create 50,000 containers and 3,000 containers in swarm. As an immediate plan, what I personally have is to make Cinder optional in Magnum as a storage for containers. We will also track upstream the bags of floating api handling in Magnum and also on the bag in the Magnum client. You can actually easily disable floating api. You must do two commands. And we must improve the state synchronization between Magnum and HIT the way that Magnum holds HIT about the status of the cluster. And the final question about developers is that I am an upstream developer, so I can't really identify this kind of bottlenecks because I'm running DevStack. So how can we find bottlenecks and track scaling problems in a systematic way? This is an open question. Thank you for your time. It's time for questions. Can we have questions? Yes, please. I might miss the beginning of the presentation. Did you guys compare what is the penalty running the Kubernetes with Magnum on VMs versus running on the bare metal? And did you have any work with ironic zone integration? We have only worked for ironic at development level. We don't have any running and production service, so we couldn't benchmark it. But you probably can compare to any other bare metal running Kubernetes clusters. You didn't do this benchmark? No, no, we didn't do this. So nothing specific for containers, but we did do related tests on VM versus bare metal tests at CERN. And with some tuning, we got to 3% loss running on VMs. We have published this in our... 3%? Yeah. Any other question? Thank you for coming to our talk. And we'll hear if you want to chat some more. Thank you.