 All right. Welcome. Thank you very much for coming to our talk in yet another gorgeous day in Vancouver. Hope you are having a good time. This is the work that I've done with my colleagues at IBM Research. They couldn't be here. So I'm going to present the work. Let me start with a little bit of motivation in what motivated us to start this work. And then I get to a brief overview of Neutron very quickly to do a level set. And then we get to the main part of the talk when we talk about network performance, the benchmarking framework we are proposing, and we show some empirical results and look at those in the talk with conclusions and future directions. So Neutron has been around for some time now. It has become, it has got to a point of usability and the stability that people have started working on it and using it in production. Sometimes during the last cycle in February 2nd, DevStack merged a patch that made Neutron the default networking for OpenStack. And as you can see, a lot of Neutron people were extremely excited only to see the patch get reverted three days later. And what happened there was that the community taught the default configuration of Neutron is still too complicated for some users. And it's not closely matched with what people have been using in NOVA network. And we need to do some work around getting to a default configuration that is much closer to what NOVA network supports or it is in general simpler. So there has been a lot of ongoing work. There have been at least three design submissions that I have been to where we have discussed with Neutron community and NOVA community what are the issues in terms of making Neutron more accessible to everybody. Nevertheless, Neutron is your networking in OpenStack. So we thought the first question that comes to mind is how does it perform? And we are talking about different aspects of Neutron, from the management plane to control plane to data plane. And we wanted to answer that question, and that motivated our work. What are the important networking metrics we should look at? How do we measure them? And before I get to answering some of these questions or start the discussion about these issues, I want to just very briefly talk about Neutron. So we are all on the same page and then get to the discussion about benchmarking Neutron performance. So Neutron provides just the API for network virtualization. It has a core API and some extensions. I have listed the important ones, and there are some others, but I just want to keep it simple for our sake of discussion. The main abstractions are networks and subnets and ports. And if you have your VM connected to a network, you expect it to be able to talk to other VMs connected to the same network. And if you want to talk across networks, you expect to have Neutron. You expect to create a logical router, where you provide connectivity services between different networks. And by connecting two networks into a router, you can have the VMs talk to each other across networks and possibly to the external world. And the router extension also provides floating IPs, so you can be accessed from the outside. And I have listed security groups as one of the more important extensions that one need to use to provide the level of security we need at each port level. This virtual abstraction, this API, needs to be realized somehow in the real world. And that is done through the plugin architecture that Neutron supports. So depending on how those plugins realize these sets of logical resources, you may end up with different properties in terms of performance that can be important. So you could use, in terms of realizing the Neutron API and implementing it, you could rely on what the operating system and hypervisor networking stack provides. You can be using different types of tunneling. You can be deploying a controller that uses open flow. And there are all different types of technologies that you may end up using in order to implement that API and that abstraction. There are several back-end or plugins or drivers available right now. The reference implementations are the famous OpenV-Suite and Linux-Bridge plugins to be more accurate mechanism drivers that are used through the ML2 plugin. And there are several open-source options that I have listed. A few of them, Open Daylight, has been around for some time. And they have been presenting in this summit and have been integrated with OpenStack. It was a mid-init, open-contrail, OF agent from Rui, RuiU, and Oven or OVN. So these are open-source options that are also available as your plugins and backends. And then there are commercial solutions. And there are a few of them I decided not to list in case I missed someone. There are different releases. And even though the API remains the same for the most part or the backward compatibility is a big issue across the networks, in terms of enhancement that gets done at the control plane or in the reference implementations, there are implications, whether you are using Juno or Kilo or at some operators using even earlier releases. And very soon in a month or so, we will hit the L1 milestone. So there are lots of options. And for all of these options, there are some knobs to tune. And the question that we wanted to answer is that how can we come up with the framework that can be used to see how these different options perform, different choices we have, whether we use it out of the box or we have to tweak it or adjust it in our environment? How do they compare? That's why we thought we needed performance evaluation framework. So what would be the major components of such a system? Very simple. We thought we need workloads where you want to apply to your cloud and see, for example, how your Hadoop application perform. So a workload can be something like a Hadoop running terrasort operation. Or it could be more simpler, synthetic using the networking tools that we have currently, IPerf, NetPerf, and so on and so forth. And then for each of these workloads, you may want to have different scenarios. You want to do this across 100 compute nodes, or less you want to have your communication between these nodes. If you are using synthetic benchmarks in a particular pattern, just point-to-point communications or something more. So that's what we call different scenarios. And then you need to harness some way of doing all this, right? Some way of standing up your VMs and making them execute the workload that you are interested in. And collect the data, do the cleanup, and all that. And by doing so, you want to get some performance data that I will talk about shortly. And eventually, you want to come up with some figures of merits. Say that in this particular setup, with these options, with this particular plugin, and the environment that I use, I could provide scalability to a certain level. I could gracefully deal with increase in load. Something that can be translated into a simple number but signifies something much more. So that would be the ultimate goal. So what are the performance data? And where do they come from? Obviously, there are different places to collect the performance data. The obvious places are the VMs that you are using to run your workloads. And ideally, you want to be able to also collect the performance data on the compute nodes that are hosting those VMs. And the things that you want to collect are things that you are all familiar with, CPU utilization, memory utilization, network activity, so on and so forth. And then there is data that you want to collect from your application and see that you achieved a certain amount of work in so much time, things of that kind. And you can do that, obviously, pair VM or you want to be able to aggregate all this across your workload. So we wanted to have an open source framework. We wanted to have an ideally multi-cloud performance benchmarking tool where we could compare the results across different types of clouds. Whether you run the workload with OpenStack or something else, then you would be able to kind of compare and have a better judgment about what is going on. In terms of networking, we have focused on OpenStack so far only. So we want our results to be reproducible. If you have a similar environment, using the same workloads and same scenarios, you should be able to reproduce the data. And the future goal is to somehow become up with this automatic estimation of these figures of merits. Now I can look at the graph and say that, oh, it looks like things are scalable. But we want to automate that as well. And that's something that we are working towards not there yet. And automate the network configuration part as well. So the tool that we are using is called CVTool. We call it Cloudbench. It has been around for some time. It is what is being used by the spec organization. And hopefully, by the end of the year, a spec organization will have a spec for Cloud. And that will be utilizing the CVTool. And it already had a non-network-centric kind of workload in it developed by my colleagues at IBM Research. And now it has contributors from outside IBM as well. So it's a collective effort to make this part of this spec organization benchmark. So we thought this is a good tool that has some credibility. And that's the place to start it. And I just want to go very quickly over a couple of sample results that are generated automatically with this tool. You can start different workloads. As difficult as it is to read these, the dotted lines are when the workloads arrives. And you can specify that. And the VMs get instantiated. Those are the solid lines. And the measurements, which is latency in this particular example, get measured. And you can have concurrent workloads getting executed depending on a particular arrival distribution and all that. You can also collect the data, as I said, from your VM or your host and keep that for later reference. CPU utilization, the first thing that you want to look at. So with all that as the way of introduction, what are the network-specific additions that we have added? We said, let's start small. We have used three obvious networking metrics, throughput, latency, and data loss. We have just a few workloads, standard networking tools, IPerf, Ping, NetPerf. We have a specialized workload, which mimics video over IP streaming. And that also is something that we are going to use. And we have different scenarios. Again, there are so many things we can do. We are starting with a small set. In terms of networking, we can communication between the VMs. We are limiting it right now to point-to-point communication and communication in a ring. So if you have, let's say, 10 compute nodes, you can have single VMs on them, one flow coming out of each VM. You can specify a scenario where you have multiple flows between these single VMs. And you can have multiple VMs and multiple flows. So these are the scenarios we looked at. The problem space is rather big. So we used a small subset of the problem space for this talk. I'm going to present even a smaller set of data, just to give you an idea about where we are. So we wanted to use the standard reference plug-ins or drivers, OVS and Linux bridge. We have measurements with some open source controllers, such as Open Daylight and some commercial ones. But we thought we have too much data to include all that. And it's better to kind of get the point across about what we are trying to establish here, rather than having an overwhelming amount of data. So we wanted to work with the latest releases. Most of the data I am going to present comes from Juno and some from Kilo. And we used VLAN, VXLAN, GRE, whatever is available through those two reference drivers, OVS and Linux page. And we wanted to measure all this network metrics across, within a single network, across networks, for getting to the external network, and so on, so forth. And the systems that we used use standard Ubuntu 14.04. The Linux kernel is 2.3. OVS that comes with that distribution is 2.0. And the hardware we used is Intel Xeon 2 sockets. It's 16 or 32 hyper-traded cores and 256 gigabytes of memory each. So we have a cluster of these nodes with this software. And I'm going to skip these slides where I show how OVS and Linux bridge drivers work in terms of where the security groups are specified and whether you use tunneling or VLAN and so on, so forth. In the interest of time, there have been talks that have gone through this already in this summit. So we started with the base setup, the software that I described and the hardware that I mentioned. With the Juno release, we said, let's start with VLAN as our baseline. See how close to the line rate we can get. We both drivers OVS and Linux bridge with one flow, we get to around 9.3 gigabit per second on a 10 gigabit per second interface. So we said, oh, that sounds reasonable. Let's look at the impact of having multiple flows now, sharing the same interface. Look at the impact of security groups. Compare VLAN with GRE and VXLAN. And also look at the impact of distributed virtual router. So in the baseline, we use just L3 agent. So this is one of the set of data where we use IPERF, that's the workload. 20 TCP flows within each pair of hosts. And we have, for this case, we have three hosts. We have different sets of tests depending on the number of compute nodes. But this is a representative result. So we get around 400 megabit per second per flow. We have two flows on each link. So that gets us to around 8 gigabit per second on each link. Linux bridge performs a little bit better. Not by much, but nevertheless provides a better performance. So we said, OK, let's increase the number of concurrent flows. And keep the number of VMs as one in each host that we have. And let them communicate with each other in a ring, if you remember, as one of our scenarios. And as you can see, the performance remains between 8 to 9 gigabit per second as we increase the number of flows to something around 60. So let's further push this and see how far we can get. And after we passed 120 flows, I noticed a significant drop in the throughput that each flow was receiving. Each flow was receiving. And that happened more or less in the same fashion for Linux bridge and OVS drivers. Pairwise latency, it is the standard pink and also use the IPERF UDP round-frape measurements. Again, comparable results. What I'm showing on the left is the graphs that I have generated. But CBTool provides something like what you see on your right, where when the workload arrives, it takes around a minute or so for the VMs to get up and start communicating and performing the operation and reporting the results. And the results are generated automatically also the graphs. So what about packet loss? That's where we saw some difference between Linux bridge and OVS. If you look at the numbers, I should have mentioned this earlier, but you see the max and mean and the average. The bars show the average and max and mean and you see the 25%, 75% tile box in the middle. If you look at the numbers, statistically, they are equivalent. The mean max kind of overlap. But the mean in Linux bridge is much smaller. And this is for this particular test. And the amount of data being injected is such that we are going beyond the capacity of the network. I just wanted to see how many packets get dropped. And as you can see, Linux bridge does better here. And then I say Linux bridge. I'm referring to the Linux bridge mechanism driver with ML2 plugin. So the idea is that we could do all these with different plugins, commercial, non-commercial, open source, not to necessarily limit ourselves to these two particular reference implementations that may or may not remain the reference implementation. So what about throughput across network? The results we got as expected, where we get around 400 megabit per second for each flow, if we have 20 flows to get us to around 8 gigabit per second. When we go across networks, we get around 150. And this is with L3 agent in Juneau. So I want to emphasize that this is where we are. We have collected this data. Obviously, as you increase the number of, I think I disconnected myself. As we increase the number of networks connected to the router, you can expect further degradation. The problem of L3 agent is well known to the community. And there have been significant amount of work to address that problem. And that's an ongoing effort. And I will refer to the numbers that we have with distributed virtual router towards the end. Nevertheless, this is the baseline. This is what you get out of the box. For getting out of the network using floating IP or network address translation, for getting out and for getting in, you again get fairly low performance. And you get fairly comparable performance among these two drivers. What is missing, I think, from these results that we have, what we haven't presented here, and we need to analyze is whether these equivalent performance is achieved with the same amount of CPU utilization or there is a significant difference between these drivers in terms of CPU utilization. We have all the numbers for all these experiments, but I simply ran out of time to put them back. Here, we will provide them. That would be the biggest piece that we need to add. That would be an indication how these perform under further pressure. What about the security groups? So if the number of flows and number of VMs communicating with each other is small, there is not much of an impact. There is a significant impact on the control plane. And I'm not presenting those data. I mean, if you go back in Ice House and it started hundreds of VMs, the system will crash eventually because of the problem. And those problems were addressed in Juneau as you create more and more VMs. It takes more and more time to get those VMs up and running, but the degradation is much more graceful. And I'm sure there have been improvements in Kilo. But with Juneau, if you increase the number of flows and number of VMs communicating with each other, you get a significant drop in the throughput. And here, we are not talking about hundreds of flows and hundreds of VMs. I mean, when I say here multiple VMs, I'm talking about 10, 20. Multiple flows, 10 flows per VM. So this is a closer look of the data that we collect. I just want to draw your attention to the fact that you cannot read the scale. But if you could, the scales are different. This data is from Juneau again. I want to emphasize that. On the left-hand side, you see without security group, you have to believe me, we get around 800 megabit per second for this particular workload scenario that we are looking at. And as you can see, the variation of the performance we get across flows is limited. With security group, not only we see a drop in the performance to something between 300 to 500 megabit per second as opposed to 800, now, as you can see, different flows in the same experiment get a very different share of that bandwidth. So there is much more variability as these security groups are being added. All right, I want to leave some time for a discussion at the end. So let me hurry up. We have done very little with GRE and VXLAN. As expected, on a single flow with single VM, you get much lower performance encapsulation. GRE or VXLAN, pretty much the same. As you increase the number of VMs or flows, you get to utilize more and more. And beyond certain point, the CPU utilization kind of becomes such a bottleneck that the performance drops significantly. Again, these numbers are without any hardware offload. Those are the things that one could try. And if you have been through the discussions, the community, as a whole, is aware of all these issues. These are not unknown. And there are solutions that makes you use the encapsulation at a different level of your network and use VLANs at the lower levels with hierarchical port binding and things of that kind. And of course, you have the adapters that do some work for you to improve the performance. With DVR, we wanted to do it with VLAN. And with VLAN, just to establish the baseline again, we got big impressive numbers. Across networks, we get around 9.2 gigabit per second. So very close to what we get within an network earlier. It turns out that in our setup, after several tests, the performance degrades. So we are talking to the DVR people who have developed those who have developed the support for VLAN, which is just added in Kilo to figure out what the problem may be and if there are bugs that need to be addressed. So what are some of the lessons we learned? There is, in setting up something like this, you have to deal with all the problems that you may have run into as an operator of standing up a large number of VMs. And then for benchmarking, if you need hundreds of VMs, you have to decide what to do if some of those VMs don't come up. Do you continue the benchmarking? The workload allows that or not? If some of those VMs come up, but you don't have network connectivity to them, what you do? And then how you clean up the mess afterwards. So these are the things that you have to worry about. You have people or customers or users who are tied to a particular older distribution for various reasons that has significant impacts. If you're using certain versions of Linux, the networking stack is at a stage that you will notice significantly less performance. So that's another thing to keep in mind. We have experimented with OBS 2.3, which is the latest, sees slightly better performance. We were using in our setup double VLAN tagging. And we were puzzled about the low performance that we are getting. We still don't know exactly why. People do double tagging. Maybe we weren't doing it right. But we used the tool to look at the performance utilization. Notice that a lot of our machines are having cores running at 100%. As soon as we got rid of the double tagging, we got in a better state, where we could get the maximum or close to maximum rate. So in general, I just want to emphasize this is a framework. It's just that the early stage, we need to add more workloads, scenarios. And we presented the results for OBS and Linux Bridge Drivers. And the performance is, for the most part, comparable. We need to do a better analysis of CPU utilization and see if there is any difference between these implementations. Security group in Juneau still was a problem. We will do the same in Kilo and see if some of those issues are solved. And there are other solutions that avoid using IP tables altogether. And they were at talk just this morning about that that may improve the state of affairs. And with that, where do we want to go? Beyond what I have already said, we want to see if we could use a setup like this to help the development team. For the most part, we use single-node testing. Infra has a multi-node to be accurate, a two-node testing environment that we can use. But that is certainly not suitable for performance testing as it is done in a virtualized environment. And you are already running within VMs. And those are connected to each other through tunnels. So we see in the community, people identifying things that are probably a source of the problem and do some testing and say, yeah, it looks like this is a problem. And probably we can solve this by doing this. That this could be something that could be used while that is happening and help the developers to figure out what are the bottlenecks and if the solutions they are providing are having really significant impact. And the way I can see it, and it can be totally off, is that it is used as a third-party kind of CI that is used selectively. If you need to kind of use that for every single patch, you really don't need to do that in terms of performance analysis. And with that, I just want to see if there are any questions. And if you find any of this of any interest. Hi. Thank you for that. That was very good. So you've talked a lot about the high-end performance that you're seeing in terms of pretty close to saturating a 10-gig connection. As it turns out when you kind of dig into that, one of the big reasons for that is segmentation offload. You end up with tens or hundreds of kilobyte packets on average being passed around. And that speeds things up tremendously. If you disable that, you actually end up dropping down to about maybe one gigabit a second max. So do you have any plans to have any tests like that in your test suite to cover the use cases where it's not possible to use TSO? We are open to adding any workloads. And please talk to me. And let's see if we can add that particular workload ourselves. Or we can use your help to doing that. We were more than happy to expand it. As I mentioned, this is a very small set of scenarios. And we hope that we can expand it. Thank you. Thank you. Slightly related. You mentioned hardware offloads. And this is something my company does. I want to be able to measure the offloads. For example, RDMA offloads. Ceph is supporting RDMA. Do you plan to start benchmarking that? So that's something we have to decide. All of the machines that we had were equipped with adapters that had offload. We didn't use it for this set of tests. That's certainly a possibility. Then the question becomes what kind of environment you want to use, or what adapter, or what kind of offload. And that's certainly. I'm talking about VM, SRIO, VO offloads, VLAN and VXLAN offloads. Absolutely. Definitely. That's something that we will consider. OK. I'll follow up. Thank you. Yes. Please do. Thanks. Given you have the background in development in this domain, how complex is it to set up this test infrastructure and get these test results? Is it one person, one week effort or? Extremely easy. Instantiate opens like it's nothing. It's better than anything else. No, the tool is rather straightforward. It has a nice gooey. I would say it is something that you will learn to use in a couple of hours easily. There are a lot we can do to make it easier in terms of introducing new workloads and scenarios. But as it is, it is very straightforward. The bigger part is the standing up the open stack environment. And again, for that, we use our tools. And that is the much more complex operation than just doing the benchmarking. OK, thanks. So we don't have that plan. But if this is used to spec, that's how things are done there. Different users and companies publish the results. And they are collected there. Whether we will end up doing that or no, I really don't know at this point. Yeah, but definitely that would be helpful, right? I forgot to ask, what about historical data? Do you have a mechanism to store and look like last week, last month? Yes, all the data, I'm being told that we are out of time. But all the data is stored in the database at the end of the operation. And it is preserved. Thank you very much, guys. Ladies.