 All right. Welcome, everyone. This afternoon, Sharco and I are going to be talking about some experiences deploying neutron at scale. And my name is Scott Drennan. I work for Nuage Networks. And Sharco is the chief of SDN architecture for China Mobile Cloud, part of the China Mobile Group. Hello, everyone. My name is Sharco. I'm the chief SDN architecture in China Mobile Public Cloud. I designed several large SDN networks in China Mobile Public Cloud and Private Cloud. I'm here with Scott, joined present Nuage SDN deployment in China Mobile Public Cloud. I'd like to share our experience where which we learned from China Mobile Public Cloud deployment. We hope this presentation can help your future deployment. Sorry. Okay. China Mobile Public Cloud is top three public cloud provider in China. There are more than 2,000 nodes in our public cloud. And 1,000 computer nodes are using SDN for the networking and NFV. And we plan to add 3,000 more computer nodes in 2017. China Mobile Public Cloud provides both IAS and PAS to the end user. Nuage is a SDN provider offering network service for the VPC, layer 2 service, layer 3 service, firewall as a service, low balance as a service, and VPN as a service, security monitor, and VPC into data center interconnection. Here is SDN and NFV architecture. China Mobile Public Cloud includes three tiers. The first tier is called service logical management tier. It manages the business logic of the cloud. We call it OP, which is a portal to management customer configuration and service logic. We also use Nuage VSD to management the network policy. The second tier is a control tier. We call all the network resource is... Sorry. Sorry, technical difficulties. Yeah, I can't get the mouse. There we go. All better. The second tier is called control tier. All the network resources are controlled and management by this tier. Include network, subnet, port, security monitor, security group, virtual router, floating IP, and extra. The last tier is Nuage, SDN, and NFV. This tier implements and hosts all the SDN network connectivities. Now, here is the Nuage SDN deployment portal. We deployed 1,200 computer nodes into data center. When located in Beijing, we call it North Based Data Center. It has 500 computer nodes. Another is in Guangzhou, which we call South Based Data Center. All the two data centers have exactly the same architecture. For the management tier, we use the Channel Mobile customized OpenStack Kelo as CMS. All the network configuration goes through and the Neutron API called the Nuage plugin to Nuage VSD. The VSD manages the policy down to the VSC controller. The VSC then pushes down to the VRS in each computer node. There are two pairs of SDN controllers deployed in each data center and they communicate using the MP-BGP. Channel Mobile public cloud offers very competitive network features for the end user. First, we allow the end user to self-design their own network in the VPC. TNES can select the network service like a subnet, secret group, virtual router, virtual firewall, virtual low-balance, floating IP, path, rate limiting, and VPN. Second, we leverage the Channel Mobile existed. VPN backbone to connect the VPC to our one connection. We support NPRS VPN as a service to connect enterprise to the VPC via NPRS VPN. We also support multiple one from different operators connect to VPC via application based on routing. Third, we support very large network skills in our public cloud. 100 concurrent network provision, which means we support 100 end users to prevent their VPC at the same time. Fourth, fully distributed control. Control plan and data plan and no central point and no traffic to bond in. No expansion without service interruption. Now here is a list of which we are offering now in our public cloud. We can see we have VPC service, redundant firewall service, redundant low-balance service, redundant VPN, include side-to-side and remote access and path and floating IP and rate limiting. Across decent VPC interconnection, we also support hybrid cloud networking, VPC secret monitor, NPRS VPN interconnection. This line, interworking with IPS, IDS, and Wolf, flow monitoring. Now let's talk why we finally deployed UART in this field. First, the stone neutron cannot support large-scale deployment. When computer nodes are more than 200, so the network performance gets great degree and very difficult to improve. Second, neutron nodes stability cannot comply with our requirement. Third, the feature gap, no large-scale firewall deployment solution, not support for low-balance VPN and firewall redundancy, no bare mental solution, and no SSL VPN and NPRS VPN solution. We also did spend many months to improve stock neutron, but we found there are still several issues we cannot resolve. And today, I totally blocked our deployment. Here is some details about the neutron scaling issues we found during our test, which we found it is very difficult to improve. In our test environment, we put 500 computer nodes in the same data center. We deployed five controllers with HRE mode, which include Nova, Kingston, Glance, Swift, and extra. We deployed three stock neutrons with HRE for Rapitam-Q, Database, and Neutron Solve. We split the NFV function from neutron nodes and port in dedicated physical cells. There are two DHCP nodes, VPN nodes, five VPN nodes, five low-balance nodes. For DHCP nodes, we have OVS agent and DHCP agent. Here are some issues we found out in our test. In our test, we have rebuilt the OVS agent at a computer node. It will take 50 minutes to recover the service. This is under 500 compute and 20 VM per computer node. When we reboot the DHCP node, OVS agent, it will take 17 minutes to recover the service. This is under 1,000 network and 1,000 subnet configuration. When we reboot DHCP node, DHCP agent, it will take 30 minutes to recover the service. It is under 1,000 network and 1,000 subnet. When we reboot the VPN nodes, OVS agent, it will take 47 minutes to recover the service. This is under 200 routes configuration in one node. We were able to use New Arts SDN network to resolve all these problems. Let's start to talk how to... All right. Thank you very much, Sharco. Sharco has outlined some of the challenges that China Mobile faced in terms of scaling their deployment. They worked really quite hard at making the stock implementation work both on their own and with others before finally giving in and taking a commercial solution. I wanted to talk a little bit about both how the New Arts solution works and some of the architectural choices that we've made that could aid in improving Neutron's scalability in future. As Sharco mentioned, the New Arts solution has three layers in it in addition to the OpenStack connectivity. I want to go into a little bit more detail about how those layers affect the scale out of the platform and how they improve the communication models over the reference implementation. As you can see here, we've got OpenStack talking to the New Arts management plane. When there is a request for a new port, that is passed to the New Arts management plane, the virtualized services directory, and it just sits there. Nothing further happens until a VM is instantiated via Nova. Nova will pass the VM to a given Nova compute node. On the Nova compute node, we have the new Azure VRS, the virtualized routing and switching layer, which extends OVS to provide additional functionality and capability. It's sitting there listening to Nova waiting for these requests. The event occurs that a VM is instantiated, and that request is taken from the VRS up to the controller layer, requesting information about what the policy is for that particular port. In some cases, the controller may already know. If it doesn't, it will ask the directory. And this is a single message up from compute to controller and a second message up from controller to directory. The directory passes down all of the information necessary for the policy on that hypervisor, on that virtual machine. And then the virtualized services controller decorates that with information about the networking, because the controller layer is scaled out using multi-protocol BGP and knows the locations of all of the other routers, networks, subnets, ports on the compute nodes. So that scale out layer is responsible for a lot of the heavy lifting in terms of connectivity. So we have a new VM that's appearing on a hypervisor. That hypervisor needs to know about it, as do any of the other hypervisors that may need to communicate, because we're using VXLAN tunnels to communicate between them. So it's the responsibility of the virtualized services controller to pass that down. That's a much more scalable model than many requests going back to the Neutron SQL database, and we'll look at that in more detail. The other benefit that that gives using standard protocols such as multi-protocol BGP, like Sharco said, we can peer with multiple service provider instances, and also between data centers using these standard protocols. So that allows for a much more flexible connectivity model, fully programmable from Neutron. So going into a little bit more detail, we have the Nuage plugin talking via REST API to the VSD. There is an XMPP message bus to the scale out controller layer, and then on each compute node we have a Nuage VRS and a Nuage metadata agent, and I'll come back to why that's important in a minute or two. So going back to what happens when you try and scale out the OpenStack native implementation. At the control plane you've got the Neutron server, the SQL database, and the RabbitMQ message bus. All of these need to be talked to via multiple messages from any L2 agents on the compute nodes, L3 agents on the Neutron nodes or DVR systems, the DHCP agent, and LBAS. So there's an awful lot of chatter on the message bus. And within Neutron we've been making incremental improvements in this release over release, but inherently it's a flawed design, unfortunately. So the approach that we've used is this scale out using MPBGP. We have an extremely high performance controller layer. These are quite lightweight, 4GIG, 4VCPU controllers, and each pair of them can control up to a thousand hypervisors. If you want you can scale out really quite significantly using that sort of scale. And as Sharco said, at CMCC there's a pair of these in each of the data centers. These are responsible for any of the fan out, and we're using the same technology that we use in our internet routers. So millions of routes are really a light snack in terms of pushing information down. So scales out really quite nicely. So that's how things work at a high level. Looking at some of the specific Neutron scale elements that China Mobile encountered, we've got first of all the scaling of routed subnets. In the reference implementation we've got two choices. We can use a centralized network node or a series of centralized network nodes, or we can use DVR. In the centralized case, you're dealing with a centralized bottleneck where all traffic between subnets needs to go up to the network node and back down. All traffic needing to exit needs to go up to the network node and out. In the DVR case, you're dealing with a bunch of added complexity on every compute node, and every compute node needs a gateway IP on every subnet. And that's okay in some cases, but especially for people like China Mobile, where APAC didn't quite get the same generous IPv4 allocation that some people within North America and Europe got, every IP address is valuable. So burning IP addresses, whether they're public IP addresses that are routed or floating IP addresses, that's a bad thing. So in the Neuage implementation, we fully distribute the routing across all of the nodes, but we don't add the complexity of DVR, and I'll talk about that in a minute. Key point here is the same gateway IP and MAC is used everywhere. It doesn't matter which compute node on a subnet you happen to be on. There's only one gateway IP. For floating IP, a lot of the same challenges. I've already talked about most of them. Not only does each compute node require a gateway IP per subnet on DVR, I believe it still requires an IP on each floating IP subnet, which again, you're burning an awful lot of public IPs. There was some work to try and fix that, and to the best of my knowledge, that didn't quite make it, but I still hold out hope. On the Neuage side, floating IP is fully distributed, so the one-to-one NAT function occurs at the compute node. That means that we do the header rewrite in OVS, and at that point, it's already on a floating IP. It can be handed off directly to a gateway and routed out to the internet or routed wherever else you want. For security groups, when you add a new security group member, the Neutron server has to push that membership update to each compute that is a member of that tenant. That's a very expensive undertaking if you've got hundreds or thousands of VMs in a tenant, if you've got hundreds of subnets. It used to be that that was also very burdensome on the compute node. That's better now because we've introduced IP set use on the OVS agent, so that's an improvement, but there's still a significant load on the Neutron server and database. In the Neuage implementation, there's a single update from VSD down to the VSC layer, and then the VSC uses its scale-out capabilities to push things to each VRS. One key thing to note here, and I meant to mention it earlier, but I didn't, this isn't a traditional controller in the way that if there's a new flow that's instantiated on a given compute, the flow resolution needs to go up to the controller and back down. All of that intelligence lives on the compute in the VRS, so the controller is just responsible for pushing new policy information. I said I was going to come back to metadata. In the reference implementation, metadata from the compute needs to be decorated by the Neutron server in order to have enough information to make the call into Nova, so there's a call from Neutron on the compute to Neutron on the controller to Nova. Again, more load on the Neutron database. If we can avoid that, that's probably a good thing. So in order to do so, we ensure that there's enough information at the compute that the metadata agent can make Nova API calls directly with no Neutron interaction required. Firewall as a Service is another very important point for China Mobile. And again, you're looking at a centralized bottleneck for each of your Firewall as a Service nodes for Firewall as a Service egress or Firewall as a Service between subnets. Same constraints. In the Nuage implementation, we fully distribute these Firewall rules down to each of the computes, so you can communicate either between subnets through the Firewall locally or out into the wide world without needing to go through a Firewall as a Service node. So what does this mean from a design considerations perspective? I've mentioned, and many of you have seen this slide before, OVS is complicated. You've got Linux bridges, you've got vEath pairs, you've got network namespaces, you've got multiple OVS bridges. That means you've got multiple context switches through the kernel, you've got multiple agents with independent config paths. Some of you may have been in some of the earlier sessions about how to debug Neutron on the reference implementation, and if you haven't tried it yourself, it's worth trying because it's really, really hard. And it's actually a really interesting learning experience because this is complex, but it does use a lot of Linux functions in very interesting and creative ways, and you learn a lot trying to walk through the data path. So that's a little side note. In the Nuage case, because we're using a single OVS bridge with a complete series of rules, we can, for any given VM or any given port, just list the rules, and we can also use standard OVS tools like OFtrace to say, okay, this packet's dropping. What rule is causing it to drop? So you're only looking in one place on the source compute. You can see where it's left. You can see where it's gone. You can go to the destination compute. Most of the time you don't need to do that, but from a troubleshooting perspective, it's really useful when you do. So there was a slide that Sharco presented earlier about the Neutron reference performance. I've blathered on for a number of minutes about architecture, but architecture is very nice. What happens when you actually try and compare performance? So here's the same chart that Sharco presented earlier, showing the four test cases, and there were a bunch of additional test cases in various other flavors showing recovery time. So instead of 50 minutes, you're looking at less than a minute to recover from a reboot of a VRS at a compute node. Same less than a minute, I say not applicable because the DHCP agent is distributed, but we reboot a DHCP agent on the reference implementation versus rebooting the OVS agent and the VRS again on a compute, again less than a minute. Same thing, same thing, less than a minute each time. We can also reboot any VRS node, or any VSC node, or multiple VSC nodes, and traffic continues to pass. Recovery time is within a minute or two. VSD is a highly available cluster. You can reboot VSD nodes. Again, traffic continues to pass. Things continue to work. So there's no failure mode where you're going to be dealing with loss of traffic for large numbers of minutes. Again, you try and avoid this in production situations, but there's always something that goes wrong eventually, and being able to recover quickly is important. We've shown in the past with containers tens and hundreds of thousands of containers spinning up and getting connectivity within sub-10 minutes. So there's a lot of room to scale this while still getting very good recovery and convergence times. Another key point is maybe I don't want just one rotor. Maybe I don't want just one open stack instance. How do I manage connectivity in that case? In the reference implementation, you've got lots of L3 agents, you've got lots of neutron nodes, or in the DVR case, even more neutron nodes, and in order to route between them, you need to go out to external physical routers. That means those external physical routers need connectivity, they need routes, and you've got traffic that's coming out of the rack up, often to the edge of the data center and back down again. That also means you're dealing with lots of routes in your physical fabric, which means you need a physical fabric capable of handling those routes. It's a whole lot simpler if you leave all of that to the overlay and ensure that whether it's floating IP to floating IP, router to router, or open stack to open stack, you're going through policy and direct from one compute to the next. So in the open stack to open stack case, you can either do that with a single VSD layer, exposing multiple sites, or you can have multiple VSDs if you want full isolation, and you can still have federated connectivity between a Nuage instance running under neutron in site one and in site two. So I said I would get back to what should we be doing in the reference implementation. We need to do some more work on distributing the neutron control plane, on making sure it's efficient, on minimizing the centralized network services, on providing efficient intratenant and intratenant routing, minimizing mass updates, minimizing traffic to that central database, and ensuring that there are ways to simplify the open stack interdata center networking. So hopefully that was helpful, showing how China Mobile has made a really compelling deployment in China using open stack. Any questions? This is not depending on any hardware. No, this does not depend on any hardware. This is purely software, and you can deploy it on top of whatever hardware you want. So I threw up that slide with lots of things connected to lots of things, and there's only limited time, so I didn't talk about hybrid cloud beyond multiple open stack instances, but yes, being able to connect out to other clouds, whether they are using proprietary hypervisors, public clouds, whether it's AWS, GCE, those are important use cases as well. So we are using OVS in the kernel for the data path. We are replacing the OVS user space or overloading the OVS user space with extension capabilities. But in effect, we actually did a presentation at the last OpenVswitch conference on how OpenVswitch could have extensions embedded in it to do exactly what we're doing. Right now, we extend the OVS user space, but ideally that would be a pluggable model. So there is not an OVS agent. If I said OVS agent in the context of Nuage... Oh, in the measurements. So in the measurements, I used the same list of failures and then aligned them with the corresponding Nuage failure. So in the case of the OVS agent, it said an OVS agent failure in the upstream. In the Nuage case, that was a restart of the VRS service. There was no OVS agent. Sorry. The upgrades. So upgrades of each component are upgraded independently. Each Nuage release supports multiple releases of OpenStack, so you have the choice of upgrading OpenStack first, upgrading Nuage first, and each upgrade is entirely hitless until the point where you need to restart the VRS with a new version. And in that case, again, you're looking at less than a minute of downtime basically to restart the service. So the question is, what are the gateway options to get out of OpenStack in this model? And there are a number of options. We can support... We provide a virtual gateway. We can also provide a physical gateway. And we also interoperate with a number of other hardware gateways. So Arista, HP, Dell, Cumulus. We interoperate with all of the major VPN endpoints as well. What did China Mobile use? Oh, what did China Mobile use? So in the China Mobile case, they were foresightful enough to buy a Nokia 7750 VPN gateway. All right. We are officially out of time. So thank you very much for your attention.