 So the topic I'm going to give is OpenStack Network Performance Tuning and the design at PayPal. So basically, my name is Aihua Li. I'm leading the Cloud Networking inside PayPal. I have a co-worker, Zenghua Fen. He's also contributing to this work. Unfortunately, he cannot come to this OpenStack Summit because of the visa issue. So I'm going to present the slide here. So the slide I'm going to give is, well, briefly, just to give you an introduction about PayPal and the PayPal Cloud. And we'll pick one specific issue from each of the following categories. One is from data plan, and one from the control plan, and one just from the problem just because of the hyperscale. And then I'm going to give a very quick summary on what we present. So just a briefly introduction to the PayPal. As you may or may not know, PayPal is a 4GEM 500 company with a revenue of about 2.6 billion per year. And so here, these slides give some of the financial data. If you have interest, you can read through. I'm not going to give a line-by-line description about the data. What I can say is PayPal is a big company with many subsidiaries. Here, I just want to give you some idea about the PayPal Cloud. PayPal has a very large-scale private cloud deployment. So specifically, we have many regions, many availability zones. We have more than 500,000 cores and many, many VMs in all the data centers. So again, this just give you some idea about the size or the scale of our private cloud. So after this, I'm going to just jump into the specific issues that we are dealing with. And the first issue or challenge that we faced is in the data plan. When we deployed the standard ML2, here I take the standard ML2 diagram directly from the OpenStack installation guide. And the diagram is a little cloudy. And so what I just used to highlight with the red block, this is the area that I want to concentrate on. So what this shows is, OK, I put this red block over here. So basically, what this shows is we have extra components along the data paths. The VMs are not directly connected to OpenSwitch. And as we know, any time you introduce an extra component along the data paths, you'll introduce the latency and the great throughput performance. In addition, as we know, the in general Linux bridge is much slower than OVS because the implementation is more laxative-based. So however, I think that the main drive from why the Neutron choose this model is based on the security implementation more from the security implementation limitation point of view. Because if you want to implement the security using IP tables and the IP tables are not compatible with the OVS on the earlier versions, so I think that's the reason we choose this model for the standard ML2 deployment. However, this is a very high cost for the performance. Later, I'll show you the specific data to see how the impact on the performance. So I'll present our solutions first. So basically, our solution is we don't want any of the extra components. We just want to connect VM directly to the OVS. And here is the more simplified model that we use inside the PayPal. And so we remove the components. I think the immediate question is, how do you handle with the security? So basically, at the PayPal, most of the security is done at the physical layer using the physical hardware. The user-defined security rules are very rarely used. So on the other hand, if we're looking forward to implement the security, we actually have multiple alternatives. First, if you want to implement the security using IP tables, you don't necessarily implement the IP tables along your data path. You can implement the inside VM. The second alternative is you can make OVS to be compatible with the IP tables. And I think along this line, we know OVS is actively working on this area. And the third alternative is a lot of these security rules, you can directly implement it using open flow. So since OVS is an open flow switch, you can implement that in the open flow. So we have implemented this and put this into pilot production. And we would love to contribute to the upstream if this has interest to everybody. So here's the performance comparison. So we picked two performance indicators, one is the throughput, one is the latency. So from the graph, you can see the traffic, the test we have done is we use the 10G physical link. And we pump traffic into the VMs created by OpenStack. So we can see on the latency side, with the Linux bridge, it's around 1GPPS. After you remove the Linux bridge, we can immediately jump to over 6GPPS. So you can clearly see a big performance difference. On the latency side, with the Linux bridge, you can get 55 microseconds latency. After you remove the Linux bridge, it drops down to 15 microseconds. So this clearly shows the Linux bridge has a tremendous performance impact. The next issue that I'm going to talk about is on the control plane. So again, I take a very typical standard ML2 deployment diagram. So we have a Neutron control cluster. And we have a lot of hypervisors that talk to Neutron controller through the message queue. And I highlight this message queue here using this yellow box. And that's the focus we want to focus on. So when we first tried to deploy this using the kilo version, in the very early kilo version, we did some simulation with the Neutron running in this ML2 mode. We pumped the message hard to the Neutron controller. We can see at around 45 messages per second, the Neutron cannot keep up. It starts to, from the graph, you can see the message starts queuing up. And that immediately becomes a problem. Because as one example, in one data center, we have 2,500 nodes. If you have 2,500 nodes, and the default report interval for the message is 30 seconds for each state's update. And if we do some calculation, this immediately goes to 80 messages per second. It's already bigger than the threshold. So the immediate band data solution we can get is, OK, we can tune the state report interval. If we put it into 90, then the problem is fixed. Because the message rate will be reduced by increasing the interval. However, we are thinking from a different angle. Each time you have some component in your deployment, you have to put a lot of maintenance on top of this. You have to make sure the server is running smoothly, and they have to monitor a lot of this work. So instead of directly fixing the message rate issue, we are saying, why can't we just replace the message queue with a standard REST API call? And we do it on demand instead of doing the periodic state update. So we did some coding, and we did find out that all the message information that we need from the message queue, when you do a VM create, VM deletion, you can get from the REST API. So again, we made this code change in our own repository, and it works perfectly for us. And so the benefit we get is not only we remove the one component out of the maintenance, we also realize that we get another benefit, which is we remove the version dependence between the control plane and the data plane. This is actually a big deal for us because of the size of our deployment. Because each time, if you are talking about upgrading all the hypervisors, it's a big, big task because you have 2,500 nodes. When you are upgrading a control plane, it's relatively simpler operation. Because for example, Neutron, you really have a class of three nodes. I mean, it's much, much smaller problem to fix. So this gives us the flexibility that we can keep the hypervisor if there's no issue. We keep the hypervisor running at the certain version X, and we can upgrade the control plane at will without losing a lot of, depending directly on the data plane or the hypervisor version. The last issues are formed just because for the big scale. As we mentioned in one example, in one of the data center, we have around 2,500 nodes, which I have not heard about similar size in all these OpenStack deployment. So some of the issues that we face, again, I'm just picking some examples. Some bugs, they don't show up until you get to that big size. One typical example is in Linux system. You typically, the limit size is around 1024. So if you have a connection that's around a few hundred, you really don't realize there's a five-descript limit issue. But when you go beyond that, you'll find some bugs. And the second issue that we face are some of the trivial performance issues. It may look trivial when you have a small scale. But anytime when you have a small issue, when you times 1,000, it becomes a non-trivial issue. For example, the message queue processing rate, as we saw before, is one of the examples. Because in that case, when you times 2,400, the message process rate immediately becomes an issue for us. And another example is sometimes you may tune the DB, you configure the DB to get more reliable. Anytime you tune the DB, you actually change how the DB writes the rate, how fast the DB can write. And again, a trivial issue becomes non-trivial when you times 1,000. And lastly, I just want to ask the OpenStack community to consider how do we really test the OpenStack for big scale. So the solution from us is before we deploy any of the new release, we do simulation because the size of our deployment. And for example, for the ML2 agent, we have written a message generator to simulate how Newton will perform for the message queue. And again, internally, we did a lot of bug fix in our deployment. We did find some common bugs that's also, in this case, for example, we found a novel computer retry issue that's common for the community also. So we submitted this bug, and we also submitted some fix for that. And also, in addition to bug fix, we do customization. Some of the examples is we use an IP margin when we add the IP margin as an extension to the Neutron API. And then we also add a customization on the Nova side to use the IP margin as one of the filtering criteria so that you don't get into the situation when you select a hypervisor. And then you find out, OK, you run out of IP. Lastly, I want to also request the OpenStack community to test each release at scale using some simulation. And also, if we can specify what is the scale that this release has been certified, that'd be very helpful for us. So here's the summary of what we talked about. We picked one issue from each of the areas inside the data plane, in the control plane, and in the big scale. And in the data plane, we removed the Linux bridge to gain performance. And we proposed some alternative security implementation methods. Inside the control plane, we mentioned about the issue with the message queue. And we proposed we can use the REST API as an alternative transport driver for the RPC. And for the scaling issue, we really like to see if we can work with the OpenStack community to work out some load tester or load simulator for testing the big scale. And we do some bug fix and customization in our own deployment. And lastly, we would like to work with the community. And we are happy to share our work. With that, that's the presentation. And I welcome any questions? Regarding removing the Linux bridge, did you evaluate the other alternative solutions to add security in the fall-based security or something with OVS before you gave up of doing distributed firewalls on the hypervisor? Well, we actually did not use the distributed firewall. But in my presentation, basically I mentioned about a few alternatives that we can consider by looking forward. At the end, you decided to give up on this? At this time, we are not using the security. So we are looking forward. We have the alternatives. This was my question. Thanks. Thank you. Hello. I'm coming back to the RabbitMQ performance issue. Yes. Just to understand, did you assess how the resiliency was impacted replacing messages by Resta PI calls? How resilient? Because if you replace a message by a call, you could impact your resiliency and how the solution is reacting, losing the message because it's a Resta PI call, it's a direct API call. OK. Let me try to, if I understand it correctly. So I don't see much losing the resilience. Actually, I think it's more the opposite. I think in my view, using the API is more resilient because when you issue the API call, you either fail immediately or you go through. There's no other scenario. When you use message queue, one of the issues that we deal, I mean, to get with the NOVA team is sometimes we find the message just suddenly, you know, you don't know where the message goes. And in NOVA's case, we found the more interesting issues. Some messages go out of order. And sometimes when you put the load balance in front of the message queue, there's a lot of issues like one message service can go down and it can come up. And it's very tricky to debug those issues. You see the message certainly fell over to another service, another node. But somehow this creates a lot of messaging ordering issues. Some messages get corrupted. So the REST API don't have those issues. OK, thank you. OK, thank you. Can you tell more detail about how to you implement the REST API? Use your implement HTTP REST API? Yes, by REST API, I mean it's HTTP. So basically, I take a neutral agent code as it is. So you can actually, we did this on the neutral code base. You can see the actual REST API class is abstracted pretty well. So all I did is replacing the REST API class with my own class. I call it the REST API class. And there's not much code change on the agent side at all. All you do is you use the second mechanical driver. So basically, one driver is the message queue API. And on the other, the imperil, that's my REST API. So it actually decouples very well. OK, but how is that if you use some kind of the messages broadcast? It's also a situation, how do you deal with it? When we replace the REST API, we don't have this REST API, I mean the broadcast issue, because we reduce all the REST API code as an on demand. So it's all coming from the agent side. You post the REST API to the neutral controller. OK, thank you. OK, thank you. More often when we use messaging, it's for notifications. So it's on demand. But with your REST API calls, is there an issue about how frequently you should poll for things and update? Yeah, I think that's a very good question. So what I can say is, using the delay a little bit by maybe 30 seconds or to one minute, I don't see a very big issue to us. Because for example, if you want to change your security, I mean doing all the paperwork to go through the approval takes probably a day to push that. I mean by delaying one minute, I don't see a big issue. OK, got it. So essentially, you're still going to use messaging for when you want to create a port or something. But otherwise, you'll use this for updates and security rules that don't change often, you're saying? Yeah, basically, we use the REST API when you have the VM created or the new event happened. Thank you. Thank you. Thanks. OK, thanks.