 Okay, so good afternoon everyone. My name is Yelena Yezhova. I am a software engineer at Marantys and I'm Oleg Bondarev, Senior Software Engineer at Marantys. Today we are going to talk about Neutron performance at scale and find out whether it is ready for large deployments. So why we're here? For quite a long time there has been a misconception that Neutron is not production ready and has certain performance issues. That's why we aspire to put an end to these rumors and perform Neutron-focused performance and scale testing and now we'd like to share our results. Here are some key points of our testing. First, we deployed Marantys OpenStack 9.0 with Mitaka-based Neutron on two hardware labs with the largest one having 378 nodes. Secondly, we're able to achieve line-rate throughput in data plane tests and boot over 24,000 VMs in density test and finally, and that's the major spoiler by the way, we can confirm that Neutron works at scale. But let us not get ahead of ourselves and stick to the agenda. We shall start with describing the clusters we used for testing, their hardware and software configuration along with the tools we used. Then we'll go on to describe the tests that were performed, results we got and their analysis. After that, we'll take a look at issues that were faced during testing as well as some performance considerations and finally, we'll round out with the conclusions and outcomes. So we're testing Marantys OpenStack 9.0 with the Mitaka-based Neutron with MO2 OS plug-ins. We've used big slant segmentation type as it is a common choice in production and we're using DVR for enhanced data plane performance. As to hardware, we're able to experiment on two different hardware labs. The first one had 200 nodes. Three of them were controllers, one we used to run in Prometheus with Grafana for cluster status monitoring and all the rest were computes. Here, as you can see, controllers were more powerful than computes. All of them having standard Intel 82599 NICs controllers. Now, the second lab had more nodes and had way more powerful hardware. It had 378 nodes. Again, three of them were controllers and the rest computes. As I said, these servers were more powerful than ones on the first lab as they had more CPU RAM and what's important, they had more than Intel X710 NICs. Now, a quick look at the tools that were used in testing process. All the tests that we were running can be roughly classified into three groups. Control plane, data plane and dusty tests. For control plane testing, we're using Rally. For data plane testing, we're using a special design tool called Shaker and for dusty tests, it were mostly our custom design scripts and as well as heat templates for creating stacks. Prometheus with Grafana dashboard was quite useful for monitoring cluster status. And of course, we're using our eyes, hands and sometimes even the sixth sense for taking down issues. So, what exactly were we doing? The very first thing that we wanted to know when we got our deployed cluster was whether it is working correctly, meaning, do we have internal and external connectivity? What's more, we needed to always have a way to check that data plane is working after massive resources, creation, deletion, applying heavy workloads, etc. The solution was to create a so-called integrity test. It's quite simple and straightforward. We create a control group of 20 instances, all of which are located on different compute nodes. Half of them are situated in one subnet and have floating APs assigned when the other half are located in another subnet and have only fixed APs. Both subnets are plugged into a router with a gateway to an external subnet. So, for each of the instances, we check that it is possible to SSH into it, ping an external resource, Google, for example, and ping other VMs by fixed or floating APs correspondingly. A list of APs to ping are formed in a way to check all possible combinations with minimum redundancy. Having VMs from different subnets within and without floating APs allows to check that all possible traffic routes are working. For example, this check validates that ping passes from a fixed AP to a fixed AP in the same subnet, from a fixed AP to a fixed AP in different subnets when packets have to go through the QRouter namespace. From floating AP to floating AP, when traffic goes through the PIP namespace, and finally from a fixed AP to a floating AP when packets have to go all the way through the controller. This connectivity check is really super useful for verifying that data plane connectivity isn't lost during the testing process, and it really helped us to track down that something went wrong with data plane quite early on. And now, I'd like to pass the ball to Oleg, who will tell you of control plane testing process and results. Thanks, Lena. For control plane tests, as Lena said, we used rally, and we ran three types of tests. The first one is basic neutron test use. It's actually a set of API tests, like creating lists, subnets, networks, routers, etc. This doesn't include VMs found in. This set of tests goes with rally itself, so we didn't modify test options much. The main purpose of this suit is to validate cluster operability. Secondly, we ran hardened versions of both tests with increased concurrency and number of iterations. Plus, we added some tests which actually spawned VMs. And finally, we ran two tests specially targeted to create many servers and many networks in different proportions, like many networks with one VM in each or less number of networks with many VMs in each. Okay, speaking of results, for basic neutron test use, there is not much to say, actually. As I said, it's just a neutron API test. And from the graphs, we can see that there is no big difference between average and max response times, which is positive. So these set of tests were ran with concurrency from 50 to 100 and from 2 to 5,000 iterations. So create and list are additive type of tests where resources are not deleted on each iteration, so the load and cluster grows from iteration to iteration. Also, we added three tests which are actually booting VMs where the most interesting is boot and run comment delete. So as it not only creates server and it only verifies that it has... that floating API is working and that it is accessible for floating API at all at high scale. So as for results for the highlighted tests, we were... All the tests were successful with no failures and we see that on mobile follow-up, the response times are better, which is expected. So for this boot and delete server security groups and boot and run comment delete, we faced some issues initially on 200 node cluster. I won't talk about issues. I will talk about issues a bit later. So now I can say that after investigating the plan, several fixes on a powerful lab were able to run the tests without any failures, even with greater concurrency. Okay, speaking of friends, I can say that for create and list networks, we see slow linear growth for create network and as well for linear growth for list networks operation, which is kind of expected because the more resources we have, the more time-neutral server needs to process. So it's even better for create and list routers. As you see, it is a stable response time for create router, depending on the number of resources, and really slow linear growth for this router. So pretty much the same result is for create and list subnets. It's slow linear growth in both cases. Ah, yeah, create and list ports. Here is an aggregated graph, also gradual growth with some peaks. The peaks, I believe, are related to some side effects on the cluster during the test. Security groups. There is actually something to investigate and profile in list security groups. As you can see, it's not quite linear growth, so there is something to look into. For create security groups, it's pretty stable response time, not depending on the number of resources. So-called rally scale with many networks. So with this test on each iteration, 100 networks were created with one VM per network. And we did 20 iterations with concurrence of three. And as you can see, it's really slow response time increase. And even better for rally scale with many VMs were one network per iteration. And each network has 100 VMs and as well 20 iterations and concurrence of three. See that it's pretty stable response time. So we should have probably do more iterations and concurrency, but we were very limited in time and had to give priority to other tests. Just like with this talk, so I'll pass the ball to Elena and she'll speak about shaker and data plane testing. Thanks, Oleg. Shaker is a distributed data plane testing tool for OpenStack that was developed at Mirantis. Shaker wraps around popular system network testing tools like IPerv3, NetPerv and others. Shaker is able to deploy OpenStack instances and networks in different topologies using heat. Shaker starts lightweight agents inside VMs that actually execute tests and report the results back to a centralized server. In case of network testing, only master agents are involved, while slave agents are used merely as backends for handling common packets. There are three typical data plane testing scenarios. The L2 scenario tests the bandwidth between pairs of VMs in the same subnet. Each instance is deployed on its own compute node and the test increases the load from one pair until all available computes are used. The L3 is test scenario is the same as the previous one with the only exception that pairs of instances are deployed in different subnets. In the L3 north-south scenario, VMs with master agents are located in one subnet, while VMs with slave agents are reached via their floating IPs. Our data plane performance testing started on a 200 node that deployed with standard configuration, which also means that we are using standard MTU, which equaled 1500. Having run the Shaker test suite, we saw disquieting the load throughput in east-west bidirectional tests. Upload and download each were almost 500 megabits per second, which is rather low for a 10-gib NIC. So, this result suggested that it might be reasonable to update MTU from default 1500 to 9000, which is a common choice in production installations. By doing so, we are able to increase throughput by almost 7 times, and it reached about 4 gigabits per second each direction in the same test case. Such difference in the results shows that performance to a great extent depends on a lab configuration. Now, if you remember, I was telling that we actually had two hardware labs, where the second lab had more advanced hardware, and most importantly, more advanced Intel X710 NICs. What's so special about them? Among else, these NICs allow to make more full use of hardware floats that are especially needed when big slant segmentation with its overhead of 50 bytes comes in. Hardware floats allow to significantly increase throughput by reducing load on CPU. Now, let's see what difference does advanced or flows-capable hardware make. On the 300-plus node lab, we ran a shaker test with a different lab configuration. We experimented with MTU's 1500 and 9000 and turned hardware floats on and off. As it can be seen on this chart, hardware floats are most effective with small MTUs, mostly due to segmentation of flows impact. If you compare columns 1 and 2, you will see a three and a half times throughput increase in bidirectional tests. Increasing MTU from 1500 to 9000 also gives a significant boost. If you look at columns 2 and 4, you will see a 75% throughput increase. The situation is the same for unidirectional test cases as well, downloading our example. Here hardware floats give two and a half times throughput increase. And looking at columns 2 and 4, you will see that combining enabled hardware floats with enabled jumbo frames helps to increase throughput by 41%. These results prove that it makes very much sense to enable hardware floats and jumbo frames in production environments whenever possible. Here are some real numbers that we got on this lab. We were able to achieve near line rate throughput in L2 and L3 test with concurrency over 50, which means that there were more than 50 pairs of VMs standing in traffic simultaneously. So we got 9.8 gigabits per second in download and upload tests, and in bidirectional tests throughput each direction was over 6 gigabits per second. Now, let's compare the results we got in a 200 node lab that had less advanced hardware with the results we got in a 300 plus node lab with more advanced hardware. On this chart you can see how average throughput between VMs in the same network changes with increasing concurrency. On a 300 plus node lab throughput remains line rate even when concurrency reaches 99. Almost the same situation is with L3 East-West download test when the VMs in different subnets are connected to the same router. Here it can be seen that running the same test on a lab with enabled jumbo frames and supported hardware floats leads to sufficient increase of throughput that keeps stable even with high concurrency. L3 North-South performance is still far from being perfect, mostly due to the fact that in this scenario even with DVR all the traffic still has to go through a controller which inevitably gets clogged. Apart from that, the resulting throughput depends on manufacturers including switch configuration, lab topology meaning other nodes located in the same rack or node and MTU in the external network which in fact must always be considered to be no more than 1500. The results of bidirectional tests I think are the most important as in real environments there is usually traffic going in and out and that's why it is important that throughput remains stable in both directions. Here we can see that on the 300 plus node lab the average throughput in both directions was almost three times higher than a 200 node lab with the same MTU equal 9000. In fact, the average results that we were showing on the previous graphs are often affected by some corner cases when the channel gets stuck and throughput drops significantly. To have a fuller understanding of what throughput is achievable you can take a look at this chart with the most successful results where upload, download exceeds 7 gigabits per second in each direction on a 378 node lab. To sum this up, the data plan testing has shown that neutral DVR plus VxLan installations are capable of very high almost line rate performance. There are two main factors though hardware configuration and MTU settings. This means that to get the best performance you need to have a modern hardware flow capable NIC and enable jumbo frames. But even on older NICs that don't support all hardware flows performance can be improved drastically which the results that we got on the 200 node lab clearly show. The north-south scenario however clearly needs improvements as DVR is not currently truly distributed and in this scenario all traffic goes through the controller which inevitably gets flooded. And now Oleg will tell you about density testing and share probably the most exciting results that I got. Right. So with density test we aimed three main things. First of all, boot as many VMs as the cloud can manage. Secondly, not only boot but ensure that each VM got connectivity, got wired up, got connectivity. Thirdly, verify that the data plane is not affected by hallowed on the control plane. So essentially the main goal was to load cluster to death to see the limits and see where the bottlenecks are. And additionally of course check what happens to the data plane when the control plane breaks. We only had a chance to run this density test on a 200 node cluster. Just to remind about the hardware it has three controllers with 20 cores and 128 gigs of RAM and 196 computes with 6 cores and 32 gigs of RAM. Additionally one node was taken for health monitoring using Prometheus and Grafana. Okay, so about the process. For the first version of the density test we used HIT. And so one HIT stack means grazing one network with a subnet connected by a router to the external network and also spawning a VM per compute node. So it's actually 196 VMs per stack. Upon spawning each VM should be injected a script which upon spawning VM fetches its metadata and send this metadata to the external HTTP server. Thus the server actually controls that all VMs respond successfully. All VMs were able to get metadata and got external access. So we spawned HIT stacks in batches of 1 to 5. Most of the time it was 5. So one iteration basically means up to 1,000 VMs being spawned almost simultaneously. Between iterations we ran integrity check by actually executing connectivity check which Elena described earlier. So to make sure that the data plane of control group of VMs is still okay. And also during the test we were constantly monitoring Grafana dashboard to be able to identify and probably fix, most identify any issues with cluster at early stages. I will talk about issues a bit later now about the results. So it was 3 or maybe 4 days journey with over 10 people involved from different teams and finally we were able to spawn 125 HIT stacks which is over 24,000 VMs in this cluster. We faced several bugs in different projects and one important note is that we never lost data plane connectivity of the control group of VMs. So this is how one of Grafana screens was looking during the test. This is close to final iterations. So it is showing CPU and memory load, also load on database and on network. These are basically aggregated graphs for controllers and computes. So for memory usage you can see that it is close to end on computes while staying pretty stable on controllers. Here are the peaks that correspond to batches of VMs being spawned. And this is how CPU and memory load changed during our test. So as you can see we almost reached memory limit on compute nodes which we expected to be actually the limiting factor. But we had to stop the test a bit earlier because of the issues we have which we used in our deployment. So first we faced a bug with a max number of allowed peaks per OSD node. So there is some self-stuff. And after this, self-monitors started to restart and consume all, if not more, resources on controllers. So causing all other services like Revit and OpenStack services to suffer. After this failure we were unable to recover the cluster even with the help of our test team. So we had to stop the test even before the resources on compute nodes were exhausted. One pretty important note that even when this cluster went crazy we still got 100% success for data plane connectivity test For control group of VMs which is quite exciting. Okay, as for other issues, at some point we had to increase our table sizes on computes and then on controllers. So then we had to increase CPU allocation right on computes. It's actually our config which is controlling how many virtual CPUs can be spawned on a compute node Depending on the number of real cores. Several bugs in utron most interesting is port create time increase with a scale. This was related to dvr and actually was fixed by a two lines page and quickly upstreamed and backported. So another interesting issue which deserves attention is obvious agent restart on a loaded compute node Which has a lot of VMs because sometimes agent may time out while reporting active interfaces back to server over RPC. So this is a well-known issue actually which has I believe two alternative approaches to fix, two alternative patches and just need to reach consensus. Also messaging back which affected us pretty much and took some time to be investigated and fixed by our messaging team. So this was related to agents reporting their states to the queues consumed by nobody. Also by Nova which when you delete a bunch of VMs like a hundred of VMs at once, Nova computes may hang and after some time VMs are started to be deleted But still you see in Nova service list you see computes offline and this was actually also fixed. So the bug is related to Nova interactions with Chef. Yeah I believe that's all we got issues so and finally we can after our tests we can say that main outcomes are that no major issues were found with neutron. So all issues that we found were either already fixed in upstream and maybe we just backboarded them. Some we fixed ourselves and upstreamed and backboarded and one is in progress as I said with alternative approaches to fix. So Rally test did not reveal any significant issues with neutron. In fact there were no threatening in France in Rally test results. And the data plane tests show pretty stable performance on old hardware. It was demonstrated that high results can be achieved even with old hardware just need to adjust MTU. And on modern hardware with modern networking interfaces even land rate performance can be achieved. Again important note that data plane for control group of VMs was never lost during our testing, all types of testing. The test showed that we were able to spawn over 24 KVMs on a cluster without serious performance degradation with just three controllers. And given the above we can say that neutron is ready for large scale production deployments on over 350 nodes. Yeah by the way the results and the process is shared on OpenStackDocs so everybody can go and check this. Okay that's pretty much all. I will return to the previous slide so everybody can take a picture. That's pretty much all we wanted to share with you today and we have some minutes for questions. Here is Mike. So if anyone has questions please raise your hand and I will give you the mic. You hear me? Okay so a question. 30 slides ago I didn't want to interrupt but on the subnet creation, port creation it wasn't clear how many were creating and what was the throughput of creation of ports or subnets? How many seconds? Yeah we did not include that result. Like here? Oh yeah you mean what was the time of the whole test right? How many were creating? I mean it's the 14, what's the scale here? How many subnets were creating per second? How much time did it take? Yeah this graph does not show this because we can't include all the graphs that Riley is able to create. But we have links to the results where it can be seen I believe. Here you can build any graph you want. But what is the scale? Is this 14 seconds for single subnet creation? Ah okay so this? Yeah 14 seconds means that it is the maximum time that got to create a subnet during an iteration. Yeah single subnet so you can see some peaks at the end. So a few months ago an open network summit PayPal gave a lecture saying NSX doesn't scale beyond 2,500 compute nodes because of the control plane right? Now data plane you showed that it's a line rate or near line rate. But if you're talking here about 10 seconds, 14 seconds to create a single subnet I don't see how you're saying that's production grade rate. So this was 50 or even 100 concurrent threads which were creating subnets so it's not serialized so 50 or 100 concurrent threads. So how long did the entire? So this is 14 seconds for 100? No this is 40 seconds for one so you have 100 threads each creating a subnet and the maximum time that some thread in one thread was 14 seconds. When you already have 2000 subnets created. Well the average, these graphs have not showed the average but I believe it's kind of 5 maybe maybe 4. So you can see when their cluster is not loaded with many resources the average time is well beyond one second. But when we have about 2000 subnets already created the average time is around 2 or 3 seconds. So for create subnets oh yeah yeah okay it's 14 the maximum time. Can you please use the mic? Sorry I thought I could speak loud enough. Were you able to determine what the bottleneck during these tests were? Was it RabbitMQ across the cluster or was it Python actually running or was it the driver talking to OpenVswitch? So I believe if we didn't face this self-issue I believe that we will, on a greater scale we will run into RPC issues I believe. Because the chastiness increases with a number of nodes so this is one of the big points clearly on this architecture. Do you give the mic over there? So what was the flavor of the instances you were using for the data plane test? What other plane? It was a special Ubuntu image which Shaker creates for its tests with print-styled IPerf. So I believe it is the smallest Ubuntu image. Did you also do any kind of tuning of net-perf or IPerf threads on the instances? I believe it would, yeah only one. We have a Shaker expert actually, Shaker developer here. And did you also have to tune any of the Linux parameters for any of these tests for the data plane? I'm sorry again. Sorry? Yeah I'm sorry because Europe has a question. Okay did you have to tune Linux in any of these data plane tests for better network performance? We didn't. In the VM or on the compute nodes? No the only thing that we did was enabling hardware floats if they weren't enabled by default and tuning MTU by a neutron config files and increasing it manually on physical interfaces. Okay thanks. So there were no CPU pinning or that stuff? Thank you. There is another question over there. So what specific reason why you disabled native implementations for open flow and OVSDP interfaces in Neutron? Is it performance issue? Yeah with native interfaces at the time we were testing this there was a couple of bugs with native OVS open flow implementation. Not sure about OVSDP but that's how MOS90 deploys it by default. Until it's not tested they switched off so. But I believe that in future versions after some performance testing with turned on OVSDP and open flow native implementations it should work. Okay if I know more questions we are out of time and thank you all for questions and for your time.