 Okay, let's get started. Hello everybody, my name is Dimitris Tiliades. I'm the city of Noa's Networks. And I hope we're gonna talk about some interesting things here today. You know, it has been, this code has been ongoing on around for a while, right? Everybody has seen it about the pets and cattle, the analogy of the cloud applications versus if you want the more traditional workloads. And every time I see the picture that comes into my mind is actually this, right? There is always a herding dog around the cattle. And the herding dog is what keeps the cattle safe, right? It is what keeps them together. And you're like, am I in an OpenStack presentation or am I talking about cattle? Well, the analogy here is that essentially your control plane, whether it's your OpenStack control plane or your network control plane or whatever, is actually the herding dog that is gonna keep the cattles of applications safe and operating under performance and overload controls. So what we set up to do and what we are gonna present here is some of our experiences and our results of what we did is try to find out what are the limits in terms of the OpenStack control plane and specifically neutron, right? A lot of our focus was to figure out these basic questions. And is neutron deployable? Can we actually make it work in a production scale? There has been a lot of discussion in the community over the last couple of OpenStack summits. Well, does neutron really cut it? We want to go back to Nova Network. Some people argued. So what is the real situation with neutron and how much can we scale it? And is neutron really the problem or is somewhere else the problems? So our goal was to push the limits of neutron and figure out how fast can we activate ports in neutron? How fast can we activate instances that have ports in neutron? We can download ACL. We can download security groups on them. We can fully program the whole infrastructure. So as probably most of you know, neutron is actually two components. There is the core neutron server and then there are the plug-ins, right? And there has been a lot of, if you want, tests that are done with the reference implementation, the reference ML2 implementation. And what I'm going to talk about here is not that. There have been some things published about the reference ML2 implementation and how it scales. And I'll show you some of these results. But I'm not going to talk about that. I'm going to talk about the implementation of neutron together with the Noa's VSP plug-in and how these two come together. Now, background, the latest performance tests were reported by Canonical in the last OpenStack summit about neutron. And there is a very interesting blog up there that I have it down here in my presentation that I will put up in SharePoint to make it public or whatever. But I definitely recommend reading this blog. It's actually very good. It provides a lot of details about where the scalability limits of neutron are. So what Canonical noticed, at least with Ice House, that's when the tests were done, not with Junois, at about 170 instances per compute server, neutron pretty much stopped responding. The agents that just stopped responding, it looks like it's dead. They had to turn off security groups in order to scale higher. So they actually turned off security groups in neutron in order to make it work. And after about five to six chassis of servers, they were using these AMD servers, the RAC servers from AMD. They just gave up. They couldn't scale neutron anymore. They gave up. They abandoned neutron. They went back to Nova Networking, and they moved on with that. So after they abandoned neutron, and they moved to Nova Networking, and they just did everything with Nova Network, they managed to reach this instance activation rate of about 4.5 instances per second. And that's the limit that Canonical managed to get. And then they let it kept on running to about 170,000 instances. So the first reaction after you see that is, well, Canonical, in order to scale really this thing, they were pretty much forced to drop the neutron out. And you have to answer to ask the question, where does this really make sense? And is neutron really the problem, or is the problem somewhere else? So that's what we try to figure out here. And my view is this. Let's not throw the baby out with the bathwater. And in reality, the problems in neutron are coming from the reference implementation that a lot of people understand that they have scalability limits for several reasons, and not really because of the neutron server or neutron as the integration API. And there is a big conversation coming up in the Dev Summit tomorrow about exactly this question, right? Shall we keep neutron and the plugins and the reference implementation together or shall we split them and figure out what is the best in order to evolve the neutron for the best interest of the community? Now, a very short background. When we talk about the Noa's VSP platform, because that's what we are using in our particular tests, the Noa's VSP platform is very much a network virtualization platform. It creates virtual networks on top of under, if you want, an OpenStack cloud. It has three layers, and that's what makes it a little bit different. There is a policy layer that sticks at the OpenStack layer, if you want, where our OpenStack plugin talks to our virtualized services directory and updates the policy there. There is the control plane layer that actually scales out so you can start with a control plane with a controller for about 1,000 servers. As you add more servers, you can add controllers and the controllers are federated together using BGP so we can get larger and larger capacities. And then there is the forwarding plane that is actually our version if you want of OpenV suites. It's the same kernel version is our modifications on the user space of OpenV suites in order to make everything work. Now, some fundamental difference from the core implementation. In our implementation, there are no agents. There is no DHCP agent. There is no layer 3 agent. There is no network node for that matter. Everything is distributed. Layer 2, layer 3, layer 4, everything is distributed. There is not even multiple bridges inside the hypervisors. There is not an integration bridge and the tunnel bridge and all these things. There is just one bridge. That's it. And it's a multi-tenant bridge but there is only one bridge. And the other big difference is instead of forwarding down to the hypervisors actual flows for the OpenV suites we are actually forwarding high level flows and the flows are interpreted and implemented locally by the hypervisors. So it's not like we are actually sending actual ACLs there. We are sending higher level constructs. So what did we do to test things? Well, because we didn't want to spend the time to put in the physical infrastructure together and we were just testing control plane. We actually used Amazon AWS. So we actually ran Nova controller in C38 Extra Large AMI. This is pretty much a 64 core machine with local SSD drives. We ran the Neutron server in a similar machine up in Amazon. We ran the Nois VSD and Nois VSC also up in Amazon. And then we started compute nodes but obviously these are again virtual machines inside Amazon. They are not physical machines but our compute nodes in order to avoid the overheads of the nested virtualization we actually use the libvirtel XC driver. So we're essentially launching containers inside our virtual machine. But from a control plane perspective to understand how Nova and how Neutron behave and how they interact with each other and how they interact with everything else in this equation it's pretty much the same. So we beefed up these things. These are pretty good servers lots of memory, local storage for MySQL. Of course we packed everything together on the controller setup. So we had MySQL, RabbitMQ and all the Nova processes running in one server. We had a dedicated server for Neutron to try again our goal was to push the limits. As I said there is no need for a network node or anything like that. So we didn't have a network node. And the compute nodes were pretty much low end but they don't matter, right? Once you have containers they don't matter. And once we had the setup up and running and we had about 41 compute nodes and somebody can ask me why 41. The reason is every time I try to launch 40 nodes in AWS at least one of them will fail an EBS volume test and I could never get 40. So I was always starting 41 and I know one of them never works. So you got trained on these things, okay? So I run this thing. So I create 5000 network instances. I create 5000 networks. I start activating instances randomly starting 50 instances at a time as fast as I can and I'm trying to figure out what's happening, right? And the expectation is this thing should work as fast as it can or how fast can it can, right? But first result was a disaster, right? Things stole to a crawl and pretty much Nova and Neutron were activating one instance per second, right? That's what I could get out of this installation. And I was kind of expecting it. This was starting from nothing, right? Completely not tuned to open stack by any means or anything like that. There were timeouts all over the place. When I would go look at the Neutron server and the Nova server utilization was 10%. Things were idle. Error messages left and right. It was very nice. So I had to go through a series of optimization steps and I will try to explain a little bit in a summary a little bit my background or my adventure over here. First of all, I need to adjust the number of Nova and Neutron workers. The default installation didn't have a lot of workers. So I need to put 32 Nova workers, 32 Neutron workers, essentially parallelized the installation. And these greatly improved things. Things started going up. I was about five instances per second, maybe six instances per second. Then I hit the big Keystone bottleneck. Keystone was the main bottleneck. Every API call, every exchange between Nova and Neutron goes to Keystone. Keystone didn't like that. Everybody was serialized on Keystone, waiting for Keystone to respond. The easy way out was to... All the tests were with Icehouse. So I actually grabbed the Juno Keystone. Juno Keystone has a very easy parameter. Increase the workers, yes. You can do it with HTTPD as well to get parallelization out of Keystone. So by bringing in the Juno Keystone in the picture, things got much better. Utilization started going up. My instance rates started going up. Then I hit my SQL problems. Default configs had to be adjusted, number of my SQL connections, all the usual things that somebody would do to try to scale the application. It's all knobs here and there that we had to play in order to make things work. So after these several iterations and every time you do these things, you have to delete the infrastructure, start it again, all this kind of stuff, it's a fun game to play. We actually managed to activate 4,000 instances in about 10 minutes. So we actually got a rate of 6.8 instances per second. So we were pretty much at this level at what Canonical had reported with just Nova Network. Canonical reported 4.5. Potentially we had faster processors, better memory SSD drives. We're about 6.8 instances per second. But we were not really happy because in reality, Neutron server was a 20% utilization. Nova server was again not doing very good. So we went to the next step now to try to figure out where are the bottlenecks, what is happening over there. So these plots here, the Nova and the Neutron utilization and you see the test that we were doing over here is launch 100 VMs, launch 100 instances, wait until they finish, launch the next 100 instances, wait until they finish and on and on like that. So you see the curves happening in terms of the CPU utilization of the Neutron server and the Nova server. And what you see here is Nova was actually operating at about 60% utilization. Neutron was pretty much slipping and in reality Neutron is slipping because the only thing that it's doing is getting API calls, doing a very small processing and passing over the API calls to the VSP solution and then the VSP solution is responsible for all the network implementation. There are no RPC messages, no agents, nothing that happens. Neutron doesn't do anything. So it's pretty much idle or it's pretty much slipping. So this is the overall utilization of the Nova server. Next observation is Nova scheduler. Nova scheduler is at 80% utilization with PIX at 90%. There is a single Nova scheduler process is clearly the bottleneck. Even with 40 servers, when I try to start, activate VMs as fast as I can, Nova scheduler becomes the clear bottleneck in this equation and it's pretty much what is limiting the ability of the system to activate instances faster. So the only way at this point to activate instances faster is to add Nova schedulers, right? To create more Nova instances and add Nova schedulers at this point. So that's not good, but that's not the only problem. The next thing we did is we started seeing at the MySQL utilization. That's the MySQL server utilization and this is actually the most alarming if you want observation here and the worst of all observations. You will see that this time passes and the MySQL database has more and more instances there. The average utilization or the utilization we are getting for MySQL keeps on increasing and it shouldn't happen because come on, we have 4,000 instances there, right? This is a trivial database that shouldn't be creating this type of overloads with just 4,000 instances over there or 2,000 instances. So this makes us believe immediately that there is something fundamentally wrong going on between Neutron, Nova and MySQL, right? So we started trying to build the Onion to figure out what is going on over there. Then we took a look at the... Out of the MySQL logs, we took a look at the average processing time for every query in MySQL and there are outliers here and some of the outliers my belief right now I cannot guarantee that this is the real thing. I believe that some of these outliers are because of Amazon CBS volumes that are not responding very fast all the time so that is a clear issue with the outliers. But if you notice here on the dense region that is showing every test that is happening it has pretty much the statistics of all the queries that MySQL is reporting. Essentially, we see the same trend. As time goes by, MySQL queries take more and more and more time to respond, right? Obviously, this doesn't make sense, right? So go to MySQL, even more analysis. We use the MySQL slow query mechanism. We do, we dump all the logs, we analyze the logs and we figure out here that this is for my latest test. I activate it 20,000 instances and there is a query that is being handled here about 20,000 times it takes like 0.06 seconds it is the majority of the queries, right? This ends up being a quota query, right? Every time all my VMs I was activating everything in the same networks from the same tenant, right? Everything was the same tenant. So this is a quota query to make sure that the tenant can activate so many ports. Every time that you activate a port this quota query counts the number of ports and gives back the result. So when you find the corresponding code and if you go to the base, to the base, the core neutron implementation you are gonna find this piece, this snapshot of code. Getportsquery.count, okay? This is a single line of code that shows you how careful you have to be inside your implementation of anything that you do with Python because the moment you start these abstractions and the abstractions are nice and easy with SQL Alchemy this code is the wrong way to count how many ports are there. And if anybody would read actually the manual from SQL Alchemy they would see that the actual right way to do it is by using this query and the function count. The difference between this statement on the previous statement is the order of magnitude in complexity for MySQL because the one produces a sub-query and it produces a very complex MySQL statement this actually uses the MySQL capabilities directly. So by changing a single line of code you can change substantially how your system is gonna operate. And this is very important to understand. My most important learning out of this exercise was we have to pay a lot of attention as a community to understand the performance of this code and optimize it in a line-by-line basis in the end of the day, right? We have to analyze everything to understand how the system is operating. Now, on the good news VSD utilization, essentially the Nois server utilization was actually going down. Well, it was going down because the more time it would take for the VMs to get activated because of the database problems the less load the VSD server had. So now I'm the opposite trend. My utilization is going down because all the bottlenecks are upstream inside the MySQL database and the Nova scheduler. Of course, our MySQL utilization is going down, right? There is nothing alarming over here. So after we figured that out and we made the modifications in Neutron, right? In order to get the database problem and I'll show you the results we didn't go into the trouble though to try to fix Nova scheduler. We were running out of time. We said, you know what? If Nova scheduler is the bottleneck and what we are trying to test is Neutron instead of activating VMs with a single port I'm going to activate VMs with five ports. So for every VM that Nova scheduler is scheduling I'm going to give a load to Neutron that is five times larger than Nova scheduler. So if I activate 4,000 VMs it's going to be 20,000 ports let's see how Neutron is going to behave. So we modified our test. So we start activating instances with five V ports per second with five V ports per instance, right? Start again 50 instances at a time essentially a batch API call start five times 50, 250 Neutron ports at a time. So in this way we don't need to worry about the Nova scheduler bottleneck anymore. We can avoid the Nova scheduler bottleneck and we can try to push the Neutron server to its limits. The result. The result is we actually activated 4,000 instances with 20,000 V ports in about 9 to 10 minutes depending on how you run it you end up with a 9 to 10 minutes type of delay. 500 V ports on every server, right? So we had essentially every compute server had actually 500 V ports unlike the canonical test that got stuck at about 170 V ports with obvious implementation. We were up at 34 ports per second. That's an order of magnitude faster than what has been reported before. The number of instances per second as I said was limited by Nova essentially by the Nova scheduler and in reality Neutron goes much faster. Neutron was not really our bottleneck. The bottleneck is not the Neutron server. The bottleneck was Nova and the bottleneck was not the Neutron server as I explained because it goes back by an SDN solution in the background and that's the question. Yes. No, that was the next step that... Right, but in order to run... Yes, we need to run multiple scheduler instances in multiple servers, right? Yeah, I would need to add... That's the next step in the performance thing, right? But for a given server, for a given configuration what I was trying to say the argument that Neutron is the bottleneck and Neutron is the issue, it goes not really the issue. And the Nova utilization was like that. It was pretty straightforward, the same thing as before. The Neutron server utilization actually started going up. You know, you see all kinds of variations. All variations happening there because we actually essentially are getting much more loads so the Neutron server is utilized much more than that. And the Nova scheduler is maximized. It's pretty much the same thing at almost 85-90% utilization. Now, although we did the MySQL patch for this particular statement unfortunately we didn't solve the whole problem. So when you see actually these numbers and if I go back and compare with the previous test that was doing the exact same thing these numbers are actually 50% better than the previous numbers but the trend is still here. So this means there are other prices over there that need much more analysis to figure out why is this happening because I can tell you MySQL is not really behaving like that. There are runaway SQL statements there that make MySQL behave like that, right? With 4,000 and 10,000 objects even with Keystone objects and stuff like that doesn't mean anything. Now indeed we found the next one. So it ends up that this is the query. I have no idea where this query is created but this is the query. Now I have to trace it back. At this point we run out of time but each one of these queries actually takes 140 milliseconds. So it's creating an issue over there. I'm sure it's somewhere on the code. I'm sure we'll find it and we'll try to optimize it. So did we have time for more? Well, we kind of run out of time because it was time for the summit. So that's where the steps, the adventure stopped. In concluding, our experience showed that there is a lot that it takes to actually optimize an OpenStack system for a high throughput production, right? You know, getting upstream OpenStack, putting it in, it's not an artwork like that. It's an iterative process. It will a lot depend on the workload you have, what you are trying to achieve, why you are trying to achieve it. On this particular case, we're just trying to find the limits, right? It's not necessarily a production type of situation we're trying to understand. We're trying to understand the limits, whether we have any limits from our side. What are our limits? And together with that, what are the OpenStack limits? But in order to figure out our limits, it ends up we had to do a lot of work to try to understand the limits of the deployment in general. SQL Alchemy, there is a lot of Alchemy there. Be careful. There is a lot of optimization. Anytime you use, anytime anybody has chosen to use this type of abstractions, there is a lot of things happening under the hood. Look at the SQL statements. For every SQL Alchemy statement, figure out the actual SQL statements because we might have a lot of surprises there. That was a clear result out of all this exercise. And for us, the call to action is, let's spend some time as a community increasingly on profiling, understanding the code base, understanding performance, removing performance bottlenecks. We are at the point now where we are not fighting for this feature and that feature. People are interested in deployments. And you will ask me, why are you interested on so many VMs per second or instance per second? Well, I'm interested because my customers are telling me about disaster recovery scenarios and all kinds of situations where they are trying to bring data centers and racks and racks of equipment up very fast. They want this type of performance. Yes, it's for the extremes, but it's a business they are trying to solve. So our call to action is, let's spend some more time profiling and understanding how opera stack behaves. And that is my presentation and I'll be happy to answer any questions that anybody has. Yes. We actually submitted, yes. At least the one SQL Alchemy thing. With? Ah, no, I don't think I have... You know, we haven't spent too much time talking about them yet, right? Because we just finished producing them last week. But this week is the week. Yeah, yeah, we'll be talking about it. Any other questions? Thank you very much everybody.