 OK, are we? Oh, it's working. All right, so we'll go ahead and get started. My name is Robert Tayar. I am a 15-year IT veteran, maybe 20, primarily in networking. We're going to present today as our current server farm topologies, as well as our new CLOS topology. Towards the end, you're going to see some hardcore numbers on VXLAN and EVPN that we did with load generators. I'll start off by covering our requirements and why we went to this topology, because when you go to cloud, you should always ask why. Second person that will be speaking today is Shree, and he'll go through our new security requirements, followed by Jason, who will actually cover those hardcore numbers. First slide, Walmart's big. We're all reminded of this. 2.2 million associates were bigger than some countries. The reason we bring this up is because as a large company, we also have large processes. This is not how long it takes to do new compute anymore. But when I came in five years ago, it was pretty bad. And for the compute, each team had to be hit in order. And I'm not listing all of the teams, but suffice to say, it could be six months, sometimes a year, before you actually got the compute on the floor. Why? As you hit each team, we had processes. There are typical processes you see in a Fortune 500. You have a security review. You have to get on to the facilities floor, change controls, clarity, and I'm not listing all of the processes. But for every process, every single team had to iterate that process. So you had engineers that were working extremely hard, extremely talented engineers that were buried in process. So why did we go cloud? It had as much to do with process as it did with tech. Our current server farms, traditional core distribution access model. For the past 15 years, as far back as I can remember, this is what network engineers design towards. I'm going to bet most of you have the same type of topology somewhere in your data centers. Nexus 7K, 5K, and a bunch of effects. Sorry. Chubin Resource Intensive at every point the foe. Can you do this? I know. It's gone crazy. I told you, manual. Yeah, let me just stick to it. I'm sorry, guys. OK. So Chubin Resource Intensive, it's kind of a bad diagram. But bottom line is we had to engage a team for the firewall rules, another team for the network services, another team for the network, and a completely different team for the servers. And you had to follow the same process as I showed you on the previous slide. So even turning up a new network was resource intensive. All of the engineers want layer three, because we don't like spanning tree issues. And all of the application folks and server folks would layer two for clustering. We can't agree on anything. It was approximately 1.1 million before you could put the first compute in the topology. And it took forever to put the first topology in. And large spanning tree domains, which will be a theme on the next slides. Go ahead and click it. Engineers and the business didn't exactly have the same requirements. And I'm not going to go through all of these, but there are a couple of themes. Our theme is we needed to stop outages. We needed to stop human error. A lot of it had to do with spanning tree. And we had new security requirements. We had to build dedicated infrastructure for PCI and other types of security. And so we would literally put firewalls at the top of the rack, very expensive, overbuilt. And you had to do it every time you wanted to have a security domain. Go ahead and click. So why CLOS? Well, it's very deterministic. No matter where you are in the CLOS topology, you're three hops. In that traditional 752 model, if I wanted to cover across an L3 boundary, you had to go up to the distribution switch. Just a lot of hops. And we had problems with organic growth. In other words, the business would come to the engineers, whether it's on the West Coast or the East Coast or in Bentonville. And they would say, I want computing. I want compute right now. And we would rush to get the computing before lockdown in October. And layer two might have gotten a little out of control as we were trying to turn it up quickly. So with a CLOS fabric, it's three hops anywhere. There are no snowflakes. We are using the same access cabinets, whether it's Cassandra, Ceph, General Compute, name your workload, same top of racks. It's the same. I can go 12,000. I think it's on the next slide. A lot of ports. I can scale insanely wide. Our current high-level topology, we use OSPF strictly to learn the endpoints. It is our underlay. That's what I'm responsible for. We have two very talented engineers. One couldn't be here today named Daniel Justice and Pavan, who's somewhere in the audience. We are at about 90% on no touch. I don't touch the switches anymore. I'm getting rusty. The underlay is completely auto-deployed. We pixie-boot the switches up. He auto-carves the addressing, and it applies it to the config. Even the Slash 31s that are in this topology are automated. PCI has to traverse the distribution layer between the internal and the external border leaves. MPBGP is what we use for a control plane. The VXLAN standard does not dictate a particular control plane. We could have used multicast. We preferred BGP because most network engineers know BGP. It's a standard table. Much easier to troubleshoot. He'll go back one more real quick, and I'll wrap up. 12,288 compute ports, 12 open stack fault domains. All of the L2 and the L3 terminates in the top of rack. I'm going to go and pass it over to Shree, who's going to focus in on the security requirements, and then we'll get to the performance numbers. Thank you, Bobby. So today I'm going to be covering some of our security opportunities and challenges that we faced at Walmart and how we tried to solve them in this iteration of our clause design. So at any big company, there's two competing priorities. You want your developers to be able to move fast, be able to do services and deploy applications into production quickly. But on the other hand, you also want to have some sort of governance and process around things getting to production, especially our insecurity. So prior to this, most of the workloads were being handled through physical buildouts, which are slow and process driven. With the new design, we're hoping to ease some of that pain and build security into the deployment process itself. So one of the first things we kind of settled on was we're going to run multiple security tiers or application tiers on the same physical infrastructure similar to how public clouds run it. You go to Azure. You can run all your workloads on Azure, but you don't have different clouds for your PCI workloads or management workloads versus web tier. The other requirement was to enable fine grained access controls between applications and services. The third one was to bake that security process into the application lifecycle. We're not going to touch on this a lot today, but if you want to talk about it, we can do it after the talk. We have an open source tool called OneOps that we use to control some of this policy. The last two is to enable our security team to have visibility into the cloud and see what's happening and have some sort of control on who can change policy and having the right review for policy changes. So before I get into design and the solution, we're using VRFs to do core segmentation between our security tiers. And for people who are not familiar with what a VRF is, here's the Wikipedia definition. I'm not going to read it, but at a high level, from a network point of view, it's a different routing cable running within the same physical router. And for people who are more familiar with Linux, think of it as a Linux namespace where you have a different network stack to handle different tiers or applications. So one of the first requirements was to be able to do network segmentation between applications and having to do fine-grained controls between apps. So before we get to that, I'm going to, hopefully that's the diagram is a little bit clear on the big screen. We're using VRF as a security boundary. So we have multiple VRFs based on our security tiers. We might have a VRF for PCI. We would have a VRF for our management traffic. We might have a VRF for internal traffic and then another VRF for handling web tier or application traffic, which doesn't need a whole lot of security around it. And the reason we did this was to leverage the same network infrastructure, but do different things based on security policy. So for instance, the PCI VRF can send traffic out to a firewall or idea system and a security stack, which is needed to handle that sort of data classification or security tier. We are mostly using stock neutron with provider and tenant networks and mapping each of those provider networks to one of those worms. It's a one-to-one mapping provider network or tenant or a VLAN maps to one of those security tiers. So like a PCI workload would be on a provider VLAN, and that VLAN could be shared between tenants at the same security level. The three options that we enable our users to use to do that fine-grained access controls is using the physical firewalls up there. That's done for the most secure workloads where we might need a web application firewall, an IDS IPS system, and additional firewall functionality, which is not within OpenStack itself. The second option is to be able to use security groups within OpenStack to do fine-grained controls. I'm going to talk a little bit later on the challenges of using security groups and the issues that are trying to do cross region or cross tenant access controls. And the last solution that we can use is host-based or agent-based controls. With this, I mean, you could use something as simple as Chef or Puppet to control IP tables on the box, or you can use agent-based. There are multiple agent-based solutions out there where they not only do access controls, but they'll do file integrity checks and a whole bunch of other things for you to meet some of your policy requirements. So some of the challenges that we face in this environment, because it's an ephemeral cloud environment, IPs, keeping track of which IP falls into which security tier is difficult. The information is available to you, but there's no good tool to tie this across all of our regions, across all of our clouds, across bare metal. So that's one of the challenges that we face, where within a single OpenStack region, you can do fine-grained access controls. But as soon as you try going across regions, you need something to tie IP to security tier across regions. And right now, we're still exploring a couple of options there. The other issues, all of these are under unmanageability around multiple regions and having a centralized view of your infrastructure for security, not having visibility of what traffic flows between your instances or containers across regions. Within a single region, you have that visibility. You can enable some functionality within OpenStack to log drops and feed that into your infosec system. And the last one is being able to audit policy across your infrastructure. Right now, it's a little painful to actually do that and have a centralized view across all of your OpenStack regions. Things we hope solve some of our pain points are firewall as a service. Right now, it's not production ready. I believe the last one was marked as the APIs being redone. And we hope it's something that we can leverage in the future. Another big thing is our security team needs to be able to tap and see what traffic is flowing in and out of VMs or containers. We've been following Tap as a Service for a while now. I think a couple summits, presentations are on Tap as a Service, but it's still not something which is an upstream neutron as yet. And the last one is actually being able to virtualize our entire security stack, being able to kind of have a policy around deployment and having that all virtualized with VNFs and orchestrated using OpenStack itself. So now I'm going to hand it over to Jason to go over more of the details on our design. Yep, so I'm going to talk mostly from a neutron perspective and how we tied in BGP, EVPN, into neutron. And I think the important aspect is we got to understand there was a pretty aggressive timeline with the solution. And as Bobby hit on earlier, we had a lot of Layer 2 issues. And our number one goal was to solve those issues. And so we wanted something that had great performance at scales. Being Walmart, any solution we deploy has to be highly scalable. And then also something that we could quickly deploy to fix the issues that were touched on earlier. And for that, we're leveraging MP, BGP, EVPN. Operationally, it was very easy to integrate. We already have the top of rack switches. Everything's monitored. From a neutron perspective, we're already using Linux Bridge. So being able to abstract, not adding all the complexity with OVS and neutron allowed us to build the solution and actually get it deployed quickly. So this is just a high level overview of what an open stack region looks like. As you can see, we're terminating Layer 2 at each top of rack switch. And then we're extending the provider networks with EVPN across multiple racks. And like I said earlier, just neutron is this plumbing in provider network through Linux Bridge. So this is more of a detailed view of how the compute node and how a VLAN that we tag down from the top of rack switch, how it actually maps into our EVPN in the top of rack and how the hardware VTEP will route out the traffic. So I'm just going to go through each of the steps. So on the compute node, we have an instance which is plumbed into the Linux Bridge, which is the QBR Bridge. And then we have a VLAN sub-interface, which maps to a specific provider network. And now that provider network will map into a tenant VIRF. And so if we wanted to, let's say, launch an instance in the PCI zone, we would just associate it with the provider network that's in the PCI VIRF. And then once it hits into the top of rack switch, we're associated in VLAN, in this case, 1301 to VNI 5001. And as you can see, it lives in a specific VIRF. And then from there, it'll hit the VTEP. And the VTEP will either do bridging, layer 2 bridging, if it's going to another endpoint in the same logical network, or if it's going to a different VNI, it'll then do L3 VXLAN routing. So I touched on this earlier. We wanted to move fast. There's a lot of different solutions that we're looking at longer term. So do application-driven and then integrating OpenStack via the ML2 plugin. But for this, it was just a quick win that we could use to solve our existing issues and operational challenges. And one of the nice things with this solution, we get high performance. And I'm going to go into detailed benchmark showing what kind of performance you could expect to go into this model. And also the distributed routing. So we looked at the DVR solution in Neutron. And that adds significant complexity, especially for our compute team to manage. So by leveraging this model, we keep all the networking guys doing the networking. And then on the compute side, they just have to worry about Linux Ridge. So just an overview of our test setup that we've conducted the testing with. We used the standard HP DL360 servers, 512 gig of memory. This is just our typical hypervisor configuration. And then for the networking, we used the Cisco 92160 top of rack switches. Each compute node is connected to the top of rack with 10 gig of it links. And then each of the top of racks do ECMP up to two spines at 40 gig. So all tests were done on Ubuntu 14.04. And then just describing the specific tests. So we did a couple of tests. One was TCP, UDP test at varying packet sizes. We wanted to measure the performance for small packet sizes and also at large. For latency test, used net perf. So did a TCP request response test. So to measure how many transactions per second we could do, so the higher the better. And then also wanted to stress the scale of the fabric. So we wrote a couple of custom Python scripts that just basically does an ARP and sends an ICMP packet. And that'll generate thousands and thousands. They'll simulate thousands and thousands of endpoints to see how the performance is with large table sizes. So a couple of the performance tests we did. One was the L2-VX LAN. So what's the performance within a logical network? Then also L3-VX LAN, so routing across different VNI's. And then the final test was to measure the performance from one verve to another verve in the same fabric. And also as a stress test, we used an application mix that simulated 5,000 unique flows with the varying different types of traffic that you would commonly find. SMTP, RDBMS, HTVS, BitTorrent. And that was transferring at around 50 gigabits per second while I conducted all of the L2-VX LAN, L3-VX LAN and interverve tests. So that was to see if load on the fabric had any impact on performance. And then finally was the 50k table size test using the custom Python scripts. And also wanted to see if that had any impact. So this is just a overview of each of the tests. You could see that all the tests were conducted across two endpoints that were connected to two different top of rack switches. So three hops away with the exception of the verve test. The verve test actually has to go up to the external border leaf because that's where we're actually terminating all the verves. And that's where the firewall and security stacks for cross-verve traffic. So right between the internal border leaf and external border leaf, that's where all the security devices will live so we could apply policy across verves and zones. And so also to start the benchmark for the Layer 2 performance, the first thing I wanted to do was to determine what's the baseline performance of the top of rack switch with just a simple VLAN and then two compute nodes connected to that same top of rack switch, so one switch hop. And as you see on the throughput L2 test, when it says L2, that's basically the baseline test. That's one hop. And then next to it, you have the Layer 2 VXLAN test. And that one is measuring three. That one's going three hops, so it's going tour, spine, tour. So there is a little latency introduced because we're going through more devices. But on the throughput test, you could see there's the percent decrease so that basically measures, was there a performance decrease? And if it's positive, there was, but if it was negative, then it actually was slightly faster. And as you could see, the max out of all the tests at different packet sizes, both with TCP and UDP, the most was 2%. And that could just be accounted to the variation between tests. So the variation was about 3% to 5% during each test run. And then on the latency, you could see that the highest was 10% at small packet sizes. So that's the TCP Request Response Test. And the main reason for that, I associate to the additional networking hops. And then on the right side, this is where you could actually see the impact that the table size of 50,000 endpoints or the application mix would have on the test. So while we were doing the performance test, we had two bare metal endpoints that were transmitted in traffic between each other. And then we had several other bare metal hosts generating the Atmix traffic and also simulating the large table sizes. So the systems under test were not running the Atmix. And as you could see, there was really no performance difference, both for throughput and latency. For the Layer 3 performance, it was basically, so the Layer 3 performance is measuring the difference between the Layer 2 VX LAN and then the Layer 3 Routed VX LAN performance. So you're routing across different VNIs. And as you could see, there is no delta. I mean, right here, we're at like 0%. So we're running, essentially, a line rate with VX LAN because the VX LAN is offloaded in the top of rock switch. And also, application mix and then the large table sizes didn't have any impact on performance. And the final test was the inner-verf test. So measuring the performance and even that, there was no drop in throughput. The only thing that increased was the latency. And that could be a result of going through additional network hops, because as I showed before, it has to go all the way up to the border leaf. So in summary, so this solution with Linux Prygine using MPBGP and VX LAN offloaded performs very similar to bare metal speeds. So there was no noticeable performance decrease. And we know it scales to at least 50,000 endpoints. At one point, we even had 90,000 MAC addresses into the fabric. And there was no noticeable issue. And everything kept running and working. So with that, I'm going to lead into questions. So first off, this is probably one of the best use case sessions I've been to all week. So thank you for that. First off, on the testing, did you do testing between VMs either in the same tenant or between tenants on a single physical server and look at the load and latency testing there as well? So for all the testing, so one of the goals was to measure the performance of the fabric itself. And so I wanted to eliminate as much in the data path as possible, because we're using Linux Bridge today. We already know the performance of it. It's already tested. So all the tests were on bare metal servers with 802.1.Q interfaces. And then I would run the test between the two endpoints of the two bare metal hosts that would live in two different racks with two different top of racks, which is. There was one point of clarification. I think it was stated that the verse lived up on the external border leafs. If we go up to, what is it with me in this thing? To the external border leafs, that would be the switches at the very top. The fabric actually ends at the internal border leafs. And we're playing a little old network trick, because I'm an old grizzly network guy. And we worked with our vendor. And we're actually trunking across all verfs up to the external border leaf. So the external border leaf is not verf-aware, right? The fabric actually ends at the internal border leaf. That's just a layer three hop. So it's trunked through the firewall, layer three out. Thank you. Question over here on this mic. Can you talk a little bit about failure modes, resiliency, any testing you've done there, maybe on the BGP part in particular? No, we ran out of time. That's probably the next thing that Jason and I have to test. I can tell you we've had, we have tested some failures, but I'd rather not go into it until we have XE up and running, and I can give you some hardcore bench numbers. So yeah, that's what we're at. That's the next testing we do. I have a question that, in your design, there should be a steam controller or a switch manager in the fabric, right? Yeah, so there's that. So when you talk a controller, you mean in like the neutron is talking to as a plug-in? But as far as I know, a neutral plug-in cannot talk to the switches and tell the switch, you need a wharf, you need a bridge domain or something. Yeah, so for that, so actually too. So that's a good question. So to add, let's say, a new ver for provision, a new VLAN that's mapped to a VNI, we were initially leveraging Cisco VTS, which will actually talk to all of the top of rack and border leaf switches. So we just go in there and we'll add a new provider network, say it's associated to this ver, and it'll just pull from a VLAN pole that we specified and it'll provision across. And I guess for future, we're looking at moving all of that configuration and automation with Ansible. So every time we want to add a new ver for, we want to add a new private logical network, we'll just use Ansible. And then longer term is move into a controller model. OK, I have heard of VTS and therefore the other question that you have showed the test result of the data plan and performance. But if you introduce a new testing controller, there is a biggest problem is the control plan performance. Since you have 2DB, right? OpenStack database and the VTS database, there must be some problem like the database may be not inconsistent or the transaction may be not going right like this. No, so as a clarification, we're a very large company. And so we have to contend with a lot of different type of workloads. So the one thing that we are working on as a controller, we are not using VTS the way VTS was intended to be used. We're kind of beating up our vendor to make it work how we want it to. And they've been very phenomenal about working with us. We're using it as configuration management only, right? It's literally just pushing out, here's the VLAN in the appropriate VIRF. If I personally have to figure out what VLAN goes and what VIRF very quickly, it can get very cluged in the config if you do it by hand. So it's just a way to automate the configs. OK, thank you. Question about how you guys are spanning VLANs using VXLAN across different racks. Are you doing anything to egress locally from a rack which does not have the default gateway pinned down on it? Explain a little bit further, I'm sorry. So you guys are taking a provider network VLAN and putting it across multiple fanouts, right? Those fanouts are top of the racks are going to be your L2 or L3 terminations. Well, you're talking about just a straight up Layer 2 VLAN, like a cluster VLAN? Yeah. Yeah, we do it. There's no difference. We either make it an SVI or we don't. So the internal border leaf either has an SVI for it. If it doesn't, the Layer 2 stays local. So yes, those numbers that he was testing for like the L2, that was just a VLAN with no SVI. So the SVIs are all the way up on the internal border leaves for the provider networks? Yes. Well, I was going to. So we're using the distributed anycast router function. So the SVIs, they really live in every top of rack as well. So hello? Yeah. So I wanted to ask the question is not more on the technology side. And I think I heard this correct, is that you guys were on L2 topology. And so what was the change that you guys had to undergo from an architecture and operations team to go directly to MPBGP, EVPM? And I'll ask the second question after this. That's what we're about to find. And the second question is I think the gentleman out there already answered is right now it's not controlled by OpenStack. So there's nobody from your app developer team talking to Neutron plug-in into some SDN controller that will control this. Or do workloads requests come in through some zero ticket and then somebody does something about it? Yeah. So right now we have a abstraction layer above our OpenStack regions. It's called OneOps. So once we provision a new provider network in our fabric here, we could orchestrate it to add it in the OneOps. And then OneOps will actually talk to OpenStack and actually provision the instances. And we are currently not doing this by hand. So VTS has APIs. So if I've got to push a new network, we've already scripted it. Everything that you see that we're doing is completely scripted. We need to clean it up. The two engineers I spoke about, Pavan and Daniel, that's what we have them working on. So we'll have more good stuff at the next OpenStack. So the request would be still coming in from your app team to somebody. And that somebody will then do executions of scripts, correct? Wow. Not quite, right? So we're going to have different V lines or whatever, right? Different security tiers. And then within our PAS platform, tenants can be mapped to, like you matched this PCI tenant. And then you are automatically going to launch. The past year is automatically going to launch your instances on the right network. But like he said, like all of that, the automation around building those V lines and tying them to works is handled through the physical layer. That we're going to know ahead of time, in terms of when do we need it? Are we running out of space? Do we need more? And then that's done by the underlay team. And then once our PAS platform has the information, it's heading from the customers. They just launch instances. And it happens to happen on the right network. Thank you. Hi. I've got three questions. The first one was with regards to Linux Bridge and the decision to go to Linux Bridge instead of LVS. And the second question was with regards to latency, what was your test environment for that? How did you measure latency? And the third question was related to packet loss. With your traffic generation, when you're going through the architecture, whether it's north, south, or east through west, how are you measuring that? All right. So the first one, so why did we choose Linux Bridge? So we're deployed on the Liberty release of OpenStack. And one of the things, we already use Linux Bridge. So everyone's comfortable with it. It's proven in production. And then also OVS, it's fixed now. But whenever you restart a Dell 2 agent, it would drop all your flows when you're leveraging a provider network. But that's from my understanding, we're now resolved in Metaca. But we're still on the Liberty release. And then on the second one for how do we measure the latency? So that was, we used Netperf. It's an open source benchmarking tool. So just simple doing a simple TCP request response test between two bare metal endpoints. So one was running the Netperf server, and another was the client. And it just measured the time for a request to go from client to server, and then server to client back. So the more transactions per second, it actually meant the latency on that test was lower. So the higher numbers on that one were actually better. And then what was the third question? The packet loss. Yeah, so for the TCP throughput test, there was no packet loss on those, because we're using it. Because it is using the TCP window. But for the UDP test, that's probably the one you're mainly referring to. When we were running the test, the packet loss was minimal. It was about 1% to 5% is what I was seeing. And the goal of the UDP test was, it really does us to see, hey, how much bandwidth can I throw through the pipe? And how does that compare to the Layer 2 baseline test? And for the UDP traffic, I didn't notice a difference from the Layer 2 without VX LAN test to the Layer 2 VX LAN test, and then the Layer 3 VX LAN test. So it was consistent across all the tests. I had, sorry, just one more question. On the spine and the leafs, it was at all white box, and were you running? We were going to test other topologies. This was the first one that we tested. We have a pretty big partnership with a vendor that supports all of our stores, and we saw how many stores we have. So that was 9,508 Cisco for the spines, and those were 92, 160s brand new switches. They have one, 10, and 25 gig all servers, and 40 and 100 gig native uplinks. So pretty beefy switches perform very well. But we are also looking at Arista, and we're also looking at Cumulus. So the same testing that you see, we're doing that in our lab. Literally, like next week. Are your next plans to start looking at network monitoring and visibility? That's a huge topic. That's a great question, and the answer is yes. Thank you. Actually, I'll add on to that. We do remember we're very much of an enterprise, so we've been using HP, NNMI, StatSeeker, and we realized that we needed to look at some open systems. And our other org, our compatriots on the West Coast, I wish we'd had one of them up here to answer that question. Maybe we could answer it after this. But yes, there are a lot of open systems, a lot of network management. It's very impressive. Right, we're out of time, folks. Thank you for coming. Thank you.