 All right. Good evening, everyone. My name is Vijesh Yashadri. I work in the cloud platform engineering organization at Symantec, and I'm joined at stage by my colleague, Jason Venner, from Arantis. I'm the chief architect. Oh, sorry. And together, we're going to talk today about the SDN performance and architectural evaluation that we did at Symantec earlier this year. OK, so here is the agenda for the talk today. We want to start by first giving you an overview of the cloud platform engineering organization, its architecture, as it relates to SDN. We then want to switch gears and talk about why SDN is important for us and what are the use cases for which we're looking to use SDN, and what is our high level test plan for us to be able to validate a number of different technologies we looked at. We also then want to talk about our physical test setup. So we wanted to do this test at scale. So we will go through our physical environment, the network configuration, the host configuration, and also talk about a software automation framework that we wrote for us to be able to do a number of the tests. We'll then present the test results themselves. And lastly, probably most importantly, what we want to do is to share our key findings and insights as we went through this multi-month effort. All right, so with that, we're going to kick into the first section. The cloud platform engineering group, the objective for the group is to build a single consolidated cloud infrastructure that provides platform services to host all of Symantec's SAS applications. So it's important to recognize that we are not building an OpenStack cloud that's optimized for one or a few applications. We're actually building a platform on top of OpenStack that allows us to host the entire portfolio of Symantec's SAS applications. And this is implications on our SDN strategy. It's also important to point out that while not on OpenStack, Symantec's cloud infrastructure already supports a wide variety of workloads across security and data management applications already at scale. So I'm going to walk over a few examples. This is not an exhaustive list of applications, but enough to give you a flavor of the type of workloads we expect to run on top of the OpenStack infrastructure. So first, on the analytic side, we pioneered the concept of reputation-based security where we are collecting security telemetry data from PCs and mobile devices and sending them to a cloud-based analytics backend. The cloud-based analytics backend is doing predictive analytics to be able to process petabytes of information to block threats and come up with reputation scores for files, for URLs, and IP addresses. And this system is running in production, processing over two trillion log lines. And we have about a dozen applications that are blocking about 40,000 threats a day. So significant compute and storage infrastructure already running in production. The next one I want to call out is storage. We have consumer and enterprise backup and archival applications. And here's where the diversity of the workloads come into picture, because if you look at a backup workload on an OpenStack infrastructure, you have this write once never read pattern where you have to actually ingest tens of petabytes of data coming into your cloud infrastructure, but they are readily read. And that's a different sort of access pattern than analytics. Finally, on the networking front, we have a hosted email security offering that reroutes email from your corporate network or a device to a semantic-controlled data center that allows us to inspect that email for threat detection and policy enforcement. So this system in production is scanning over 1 billion emails a day across six different data centers across the globe. So if you look at the sum total of analytics, storage, compute, I think it's important to recognize that the platform we need to build, the SDN technology we need to evaluate and stand up needs to be able to support a wide variety of use cases, and they need to be able to run at scale. So with that, the last point I want to make here is that because we're building an OpenStack-based cloud platform, we're standing up a number of storage and analytic services alongside OpenStack. And hence, the concept of secure multi-tenancy is very, very important for us. So in this slide, I want to talk a little bit about our high level platform architecture. As I mentioned in the previous slide, we are standing up storage as well as analytic services, bare-metal storage, and analytic services that are running alongside OpenStack. So the idea is that the platform is a collection of loosely coupled services that fit into an overall framework. The important thing for us here is that we're using Keystone as the centralized identity and access control mechanism. And this is two implications for us. The first implication is around the authentication and authorization model. So we're essentially extending the Keystone authentication model into non-OpenStack services as well. So for example, in our platform, you can authenticate to Keystone, get an authentication token, use that same token to both spin up a VM on NOVA, as well as use the same token to start a job on Hadoop, for example. So that's the first implication. The second implication is around tenancy. So we need to be able to support multiple levels of tenancy in our platform. So we've extended the concept of domains and projects in Keystone to non-OpenStack services as well. So what this allows us to do is for either batch analytics, which is Hadoop, or for stream processing, which is Strom as a service, we need to be able to carry over the concepts of projects and domains over to these services. And we need to be able to provide isolation and access control for the data that's being stored in our storage and analytics infrastructure. All right? Next one. OK. So with that, we're going to switch gears and talk about our SDN objectives and our test plan. Why is SDN important for us? The first use case is SDN is really our vehicle to provide secure multi-tenancy by using strong network isolation. And what this specifically means is we want to be able to provide the mechanism for our security applications to have a policy-driven network access both within a project and across projects. So if you think about a fairly complex cloud application, and in our case, it so happens that not all cloud applications fit your traditional web application tier model, where you have a web tier, you have an app tier, and a database tier. Most of our applications have a very heavy data processing component, and you can think of them as headless. So when you have an application profile like that, where you have multiple components, they may be exposing an external-facing web service, but then they may be leveraging analytics and storage services in the back end, how do you allow the application teams to define a network topology that allows the app teams to specify network access control policies across different components? And we want to be able to provide this using REST APIs. The second probably equally important reason is that a lot of our cloud applications also run on partner infrastructure, where we need to integrate with cloud service brokerage tools. And today it's a one-on-one integration, so each cloud application needs to be able to integrate with a particular CSV tool. The idea with automated network provisioning that SDN gives us allows us to provide the right APIs once for the platform that can be leveraged by all semantic cloud applications as they integrate with the partner's infrastructure, because the entire application and the tenants for those applications can be orchestrated using software APIs. Now the other use case for us as well is that because we're standing up a number of services that are bare metal, both storage and analytics, the so-called east-west traffic needs to transit out of the SDN zone into the bare metal zone. So if you look at the traffic workflow, if you're standing up on external-facing web service, you've got a pipe coming from north to south, but because we're exposing the concept of stateless VMs as a programming model for our applications, bulk of that traffic will actually transit out of the SDN network into the bare metal storage and analytics infrastructure. And this is an important use case that we need to be able to test and provide. And then finally, we also want to be able to provide software-defined networking functions like load balancing, DNS, et cetera. All right, so with those being our SDN objectives, what is a test plan that we use to go through this evaluation? We first started off by coming up with a test strategy for secure multi-tenancy. And there are two aspects to this. First, we want to be able to test network isolation under various configurations, SDN configurations. So for example, we want to be able to place two VMs on the same subnet and test for isolation or put two VMs in different subnets by the same network or put two VMs in different networks belonging to different projects. So we wanted to be able to look at a number of different configurations and then test isolation in each of those areas. And the way we want to test isolation is we want to first start off by a deny all approach and we want to make sure that deny all works by default and then test for allow specific. What this means is that this allows the application team to say, if I have a multi-component architecture, I start off by saying that the components can't talk to each other unless I specify network policies that allows them to talk to each other. So we wanted to be able to test both deny all as well as allow specific. So that's secure multi-tenancy. First section. The second section is around data plane performance. And there are two use cases here. The first one is what we call as OpenStack internal. Because we have a lot of data processing applications, we wanted to simulate a full end-by-end mesh where we have a client server setup using TCP and IPerf where we are able to simulate sort of a full mesh network connectivity. And this is all within an OpenStack cluster, so it's all within an SDN environment. Next slide. So the second use case, as I talked about earlier, was about the egress ingress out of the OpenStack cloud. So what we wanted to be able to simulate is take the same client server mesh, but then move the client out of the OpenStack cloud and place the client on bare metal and then execute the same tests. And this has yielded some interesting results for us. So that's data plane scalability. Third, control plane scalability. We want to be able to test the rate of creation of various networking objects like ports, routers, active flows, et cetera. And we also want to know where the solution breaks. And when the solution breaks, we want to be able to test that the desired functionality works as we would expect. So those are the three sections. And with that, I'm going to turn it over to my colleague, Jason, to talk about the test framework and test results. Hello, everyone. I'm the chief architect at Mirantis, but they still let me play with hardware. It's a very fun job. The first thing we did, we had a fundamental assumption that neutron OBS VLAN would be the highest performing SDN solution. So we did a test case using that as a baseline. And what we discovered, which everyone who probably operates large L2 zones know, is that L2 zones with multiple VLANs that span many tours is very painful. We spend most of your time debugging the network in very little time actually using the equipment. So just say no. The other thing we did was we switched out the large L2 zone. And we put each rack in its own L2. And we set up routing between the tours. Then we deployed various overlay SDN solutions. And we ran the same test harness across all three environments. And one of the first things is, OK, let's go. And this is really our preferred solution. It provides much better failure isolation. The downside of this, of course, is you have to pay the packet encapsulation tax, which is about 4%, which we'll catch up later. This was my test lab. We had seven racks of 20 dual socket Xeon servers with between 256 gig and 128 gig of RAM. Each one, we had two 10 gig nicks plumbed by LACP into a tour. And we had 80 to 40 gig connections from each tour up into our spine. We also, as you'll note off in the bottom right there, in one of our test cases with Contrail, we used Juniper MXes for our gateway service. So I stuck them on the slide. Our test harness, we actually wrote a domain-specific language for writing these large-scale tests so that we'd be able to execute these test cases consistently across all of the frameworks, as well as build out more and more test cases as our learning improved of what we were wanting to explore. The core of it was we would have a, we'd define a test case. It would go into a control VM that would actually instantiate the various resources, both Neutron and Nova, wire them together, and then launch our test cases and then collect the results. Remember, lots of file descriptors are very important for large-scale testing. For the NSX, just in case we haven't said it, Neutron OBS VLAN, NSX Juniper Contrail were the test cases. And when we were using NSX, we carved off the bottom four servers of the right and most two racks. Four of them were used for the NSX control servers and four of them were used for Neutron gateway services. All of their, okay, next. I'm gonna kind of whip through these. When we were doing the external testing, we used an entire rack of servers to provide gateway functionality for ourselves. We used this rack here, the third one from the right, as our external rack. We basically carved it out of our OpenStack cluster and ran our many thousands of instances of IPerf sourcing and syncing on that rack. For the Juniper Contrail, we basically took the 10 gig ports off the bottom four servers off of that last rack and plumbed them into two Juniper MXs. And we used that for our external routing services. And we had three control servers over here for the MX functionality, for the OpenContrail services. So here's one of the first sort of telling differences in architecture between the two solutions. I got back a whole rack of servers. The trade off is two Juniper routers. Should walk over here. So we talked about some of the test cases. We could talk about these in more detail. We had a little, when we first tested with OpenContrail, they were using their 1.04 version and denial wasn't the actual default. The default was allow all. And that threw us for a little bit of a loop. They changed it very quickly, but it was a surprise. We had no difficulties plumbing two VMs on different subnets connected to a Neutron router. All of the behavior was, I was expected, security groups, both sitter-based and non-sitter-based just worked. It was lovely. See, sir. When we were doing different networks, where we were bridging between different networks, with NSX we had no difficulties. It behaved just the way that Neutron OBS VLAN did. Not surprising really, since the NSX team wrote the Neutron NSX code. I mean, Neutron code originally. The Contrail team, we had a little bit of difficulty because they didn't support overlapping IPs, which is one of our test cases. It was just a policy decision. We actually had to fake it. It took us a little while to figure out how. I believe they'd modified that, but, you know, fun. So this is the way our actual, we're gonna jump up to the data plane testing here. For simplicity's sake, we would basically launch somewhere between 60 and a few hundred VMs on a rack, which we would use as either a source or a sync for IPerf traffic. Because of the way IPerf worked, it's one connection per IPerf pair, and they're on a standard port, we got lazy and we used one IPerf session per VM. So our traditional full mesh test, since we had seven racks, we'd have four racks sending and we'd have three racks receiving. So that's kind of this picture here. You know, we kind of worked our way up and it took a while for us to get this to work with every use case between neutron bugs, open stack bugs, configuration, and I'm not gonna say there were bugs in the vendor code, but you can guess. So overall, the bottleneck for us was the tours. We saturated the 80 gig links to our tours. This was unexpected and surprising. What was even more surprising was, we completely saturated our link and we were getting nearer line rate out of the VMs with 1500 byte packets with both NSX and Contrail. It took work to get line rate out of neutron OBS VLAN. Real work. So we were very, very surprised. Yeah, we basically would blew up the spine. So we were very happy. What else can I say? Yeah, so the key difference is with neutron OBS VLAN, the only time we could get wire rate was when we ran jumbo frames and getting 140 servers, seven switches, actually seven switches and two spine switches and the controllers jumbo frame clean is not for the faint of heart. So we talked back. Do you remember when we doing our external gateway test, we ended up having to make a double hop through the tour that was providing the external gateway service. So our peak performance was 40 gigabytes per second on a theoretical. This horrible diagram up here in the corner, all those blue arrows are over covering the one tour that everything had to loop through. And with both NSX and Contrail, we had no difficulty saturating that tour. We did have a bug in the OBS version that was installed with our NSX which caused it to fall over after about 10,000 concurrent connections. I believe they've since fixed that. This was in February. Contrail's V router. Okay, first thing, Contrail doesn't use OBS. The V router, which is their equivalent was completely smooth. We stopped testing at 100,000 concurrent connections through the gateway. It met our performance targets and we didn't push it. Setting these tests up takes a day sometimes though. For NSX, we stopped testing after we provisioned about 16,000 networks. The fundamental limiting factor is the amount of RAM in the controllers on the NSX boxes. With Contrail, we gave up at 2,500. The OpenStack tuning was different and we weren't able to create the networks at the rate we needed and we exceeded our time box. A very sort of key learning that we'll talk about later is OpenStack at scale. Performing OpenStack at scale is requires skill and effort. 1,000 subnets per network, no problem. We were testing the number of network ports on NSX and we hit 8,000 and we spent a day trying to figure out why everything failed at that point and it turned out tucked away and the plugin config file was the hard-coded limit that we found after we'd wiped the cluster in preparation for the next test. And again, the Contrail, we hit about 5,000 and we just gave up. For NSX, the published limit for the number of VRAT, networks attached to a VRAT or as 10, we had no trouble at 5,12. Again, we stopped testing. We ran up to about 300 on OpenContrail and we let it go at that point. They were fine. We were very happy. So this is one of the things I alluded to earlier. You have to tune at every level to get real performance out of your OpenStack clusters after, you know, under 20 nodes, you're not gonna notice it. At 40 nodes, it's gonna hurt. At over 100 nodes, you're gonna be surprised. I'm not gonna, we even tune at the BIOS level. The other thing is, Neutron and Havana is designed for fairly small scale. I'm gonna be polite and leave it at that. See, one of the fun things for us was we rolled this before the token expiration bug in Neutron token expiration bug and Nova compute was fixed. And we were trying to figure out why we couldn't provision VMs at a real rate. And we happened to notice that Neutron was, Keystone was accumulating tokens at the rate of 10 a second. So, you know, after a day, we'd have 600,000 tokens sitting there in the cache. And, you know, performance was terrible. We put the patch in for that. We changed the expiry times. We started flushing the tokens and we put the Keystone, the memcache in. And, you know, you could actually do a Nova list on a, you know, 1000 VMs without it taking eternity. And you could provision 50 VMs at a crack. As long as you weren't using the security group based, non-sitter based security group rules. What's, I don't know what the right word for that is, but the sitter ones are fine, but the self-rentral security groups became exponentially long. And after about 300 VMs on a logical network, you couldn't wait long enough for the provisioning to finish. The other one is, if you're gonna run at speed out of your VMs, you have to tune the network stack in your VM. Pretty much any quick Google search will tell you the basic parameters for tuning the network read write buffer space and the TCP window size buck and various other things. And if you do that, the performance is pretty good. Sort of the interesting numbers here is untuned stock Ubuntu were sent to us VMs on our fabric. We would get 1.8 gigabits per second VM to VM, which seems startlingly bad. If we change the TCP tunings in the kernel, still at 1500 byte packets, we get about 3.8 gigabits per second, which is pretty good. And then if we jump up to the jumbo frames, we can run at wire speed. We left all the tuning parameters in, but we left the 1500 byte packets and our overlay solutions, wire speed, almost, we were running 32 thread. I'm saying thread very careful because these are hyper threads, so it's really maybe 18, 20 CPU sockets. We would burn two or three threads with the CPU doing the TCP processing, running at wire speed from the VMs. It was noise. We were very happy. I think I'll jump over to Vijay. Okay, thank you, Jason. Oops. So I want to conclude by a few closing thoughts here. So I want to start off by saying that we believe that SDN is a core capability for us to offer a secure multi-tenant cloud platform. And one of our key learnings as we try to test the overlay solutions for a strong network isolation is, while functionally we were able to validate that if a VM gets compromised, which is our starting assumption from a threat modeling perspective, if a VM gets compromised, and if it puts that VNIC on promise, Kismord, we want to know what the VM can do on the network. By that I mean what other network services can it access and what other tenants are their resources can it probe. And while functionally we were able to validate that overlay solutions are able to provide the isolation, it turns out that being able to stand up and conduct an SDN testing at scale, you run into a lot of issues not related directly to SDN. As Jason talked about, a number of the issues that we ran into had to generally do with the fact that deploying a 100 plus node open stack cluster at scale and being able to simulate a very high number of concurrent network connections requires a lot of tuning at the host OS level, at the database for the message queues and other components in open stack by Keystone, et cetera. The next thing that was another key insight for us is because our use case requires extensive traffic going in and out of the SDN zone, we really had two different choices being offered from different vendors. NSX requires a host space to approach which means that we're able to use commodity servers to provide gateway services, but it actually requires additional network configuration because wherever you place those gateway services, you need to be able to configure your underlying network so that you can actually have the traffic bounce from the racks that contain the VMs to the racks where the gateway services are hosted and then onto bare metal. And the same happens when the traffic is coming ingress. So this requires some additional complex network configuration that I think you need to think about as you design sort of a software-based gateways for your SDN Cloud. The Contra loses a standard space approach which allows you to talk to MPL as enabled routers, but it requires some strong integration with the underlay network configuration. One of the interesting concepts that SDN explores is that you have a clean contract between the underlay network that provides essentially basic L3 routing and then all of the software-defined services, networking services, not just basic networking, but also load balancing, DNS, et cetera, that can actually run on top of the underlay network. And in this case, I think you need to watch out for the strong integration between the overlay and the underlay network. And this is something that we learned when we did this test. All in all, I wanna say that both of the overlay solutions we looked at met our short-term performance and scalability goals. However, we are going to continue to evaluate the SDN space. We expect that this year is going to be a very interesting and an important year in the SDN landscape because of two reasons. One is the Trident 2 chipsets are almost nearly ubiquitous on most star switches right now. And a number of vendors are offering software orchestration capabilities for us to be able to essentially do the VXLAN and cap decap at the top of the rack. And the key thing for us is orchestration through OpenStack, right? We don't wanna have two control planes. We wanna be able to orchestrate the networking and the compute infrastructure using the same control plane. So that's an interesting trend that we are watching and we'll see what capabilities are offered by different vendors. The other interesting thing that's going on around VXLAN especially is on the host where you have the host nicks that are now beginning to think about offering VXLAN encapsulation decapsulation on the card itself. And what this allows you to do is to retain the ncap decap on the host as opposed to on the top of the rack port and hence distribute that load across your compute form. So these are the major changes we're watching for and hopefully we had a very interesting ride in this test. It was a huge effort. Multi-month was a big team behind this. But hopefully we have shared some of the interesting insights and results as we went through this process and now we'll open it up for questions. Let me add one last thing. If your network team is part of the project and you can look at the switches, it's much easier than if they're somewhere off the other side and everything is going through a ticket-based system. There are two mics on site, so if you could come on to the mic and ask your question. Hi, I'm yours. I work for Cisco. Sorry, I missed the first five minutes of the talk so I hope I don't ask something that you already talked about. But two quick questions. You talked about all the different configuration settings that you needed to tune. Is that something that you documented and that you're willing to share with the broader community? And my second question is, these tests that you wrote, is that gonna be shared with the community? And perhaps, I'm not sure whether you're familiar with the OpenStack Rally project, but that seems something like, it's pretty close to it. Just wanna take the rally one and I'll take the config. We're looking at building a lot of this into Rally. We're currently, at the last point in time I spoke with the PTL Boris, they didn't have the ability to orchestrate tests that required waiting on actions within the VM. As soon as that capability is in Rally, we will look at building these large-scale network tests into Rally and then running them on Morantis OpenStack Express with 70,000 VMs and things like that, or on hundreds of hosts. But we're waiting for that one particular capability to add in. Well, you could help by building that capability to Rally. Morantis, a lot of our guys are on core rally committers. Okay, sure. So to your earlier question about the configuration, absolutely, as we talked about earlier, a lot of the config tuning we had to do had nothing to do with neutron or networking. It had to do with the database and the message queue layer. And this is something that I think other people in the community have also experienced as they have tried to stand up, more than 100 more clusters. So we'll certainly look to work with the community to share what we've learned. Thank you. Prakash and UCLA. Regarding that keystone token expiry, is it sufficient that just in the keystone.conf file we expire it every 24 hours, or do we have to run the token flush every 24 hours through a crown or something? I believe we shortened the token expiration time to one hour, and we ran the token flush every hour to keep the amount of the number of tokens being flushed small and the total table size small. So flush every one hour. Yeah. If the table gets large, the flush becomes a user impact, a SLA impacting event. Okay. Also, indexes on the tables make a huge difference. Okay. By default, it should be one hour, right? By default, everything should be perfect. LAUGHTER This is Rajeev from HP. I saw you covered the overlapping IP addresses in your... Overlapping IP addresses in your test cases. Yes. Is there a case for overlapping MAC addresses also? We didn't try. It would probably work, but we didn't actually try that test case. Would you consider a valid test case, actually? How valid would you consider that are relevant is that test case? For MAC addresses? Yes. I mean, I see theoretically it's there in the spec. There can be overlapping MAC addresses, but is that something? I think it's gonna... A lot of these things depend on the implementation you're using. Yeah. And without testing, it's hard to say. I'm kind of the opinion we're moving beyond the need for MAC addresses for management because we are now... With these more full-featured SDN solutions, we're no longer relying on physical routing and ARP. Everything is known a priori, so we're creating the flows based on the definition rather than discovering the flows based on the pattern of usage. Yeah, okay, thank you. Thank you. Yeah. You mentioned how you did this sometime around February and you used OBS 2.1, which I believe at that time was not a release version, right? So you mentioned... So since then we have had a release version of 2.1 and even subsequent bug fixes. So based on that, are you considering testing it again with that or are you looking forward to community to do that? So what's... Reserving labs of this size takes time and it's a little painful. So the next time we get the lab back, we'll think about retesting. Okay, so just a rough idea. How long do you think would that be? I think we're looking at our options for sort of doing an ongoing evaluation as we prepare to go live towards the end of this year. So yeah, so we're keeping our options open on that. Thank you. Yeah. Hi, yeah. My name is Hung Wieng, working for Intel. I have two questions. The first one related to your test case and the second is about your performance tuning, MTU. I think if I understand it correctly, you have a case where you say you move all your clients outside of OpenStack and your server inside OpenStack test case. We're not moving all the... So let me just clarify that. So we're not moving all the clients outside OpenStack. What we're exposing is the programming paradigm so that our cloud application developers can actually build applications that are going to be highly reliable. So what I mean by this is that for the VMs to be stateless and ephemeral, we want to be able to provide storage APIs that allows the application to persist all the data using one of the platform APIs. And this is what allows the VMs to be ephemeral and stateless. And because of that programming model, we are having to route a lot of the traffic from SDN onto a bare metal. So I think we're coming up on the hour. We probably have time for one more question. Do you want to go? So what was the performance like? What was the comparison? Is it better to run it outside or inside OpenStack? We were getting line rate, essentially. We were getting line rate minus about 4% from the VMs versus line rate on the hardware. So we're paying about a 4% overhead penalty for the packet encapsulation. Just the additional data we had to throw around to manage the tunnels. OK, all right. Thank you. Thank you. Yeah. Hi, I have a question about MTU. I noticed that you make your VMs out of 1,000 and the 500. How do you get that? According to my experience about the excellent or other SDN, I master to reduce the deal to make the data transmission. Actually, to be clarified, it's 1,500 you just say. We ran 1,470. OK. OK. OK. Thanks, everyone. We're available for questions after this. But thanks for joining us. Thank you.