 Hi, so my name is Justin Pettit. I am one of the core developers on the Open V-Switch project, and I'm here to talk about Open V-Switch and the Intelligent Edge. So for the purpose of this discussion, we are going to call the edge the hypervisor really. So you can see in this diagram you have the virtual machines running inside the hypervisor with OBS. And so this constitutes the edge before it sends packets out into the greater network here. So the edge is really at a unique position within the network. So it's got the benefits of isolation that a guest agent that would be doing this sort of protection does not have. And then it has greater context than an end network device. So with an end network device, it either has to infer the state on tags, sorry, on packet headers that can be easily spoofed. Or if you have tags, the tags are usually limited in size. So quite often you might see VLAN tags being used. And those would only have 12 bits of data that could contain all the context about that flow that's coming through. And then, as I mentioned, that if you want to have running the guest, which has a lot more context, then you risk having that agent attacked because it's co-located with the code that's running. Another advantage is that you get to enforce the policies earlier because the services are, the enforcement is happening right next to where the traffic is being generated or it's being sent. And in a lot of cloud environments, you have oversubscribed links. And so it's nice to be able to enforce that earlier before you send the traffic out on the link. And then also oftentimes the guests are not trusted. And so better to enforce that policy at the hypervisor when you know where that traffic is coming from, then put it out in the network, and then where that can be a little bit more difficult to determine. You also have the ability for different parts of the system to communicate with each other. So for example, if you had a service VM that is running an intrusion detection system, if it detects that there's some sort of problem, it can communicate with the networking subsystem, for example, OVS, and isolate the traffic from that system that's been determined to be compromised. So to mention this is an ideal location for network control and visibility. And so we're able to in first state by observing traffic as it goes by. And then we also have the ability to do some introspection by looking in the guest if you have some small amount of code that's running as an agent. And we'll talk about both of these a little bit later. Also at this point in the network, we have if you're running an overlay in that work where you're taking traffic and you're sticking it into a tunnel, that overlay is done actually at the edge. So we know that mapping of the logical traffic to the physical. So if we want to do something with that traffic, we know what it looks like when it's gonna be on the wire, where that can be a little bit more difficult once you stick it in the tunnel and you send it out into the fabric. So the first two bullet points I have here about modifying behavior are things that you can currently do in OVS. So since we're at the ingress and egress of the tunnel, we can enforce policy at the very beginning or at the very end as the traffic is going in or out of the tunnel. We can also modify bits in the inner packet or the outer packet. So for example, if we wanted to do some DSCP marking, we can do that on the outer packet based on information that we saw in the inner packet and I'll show some examples of that later on. These last three are things that we've been sort of investigating for other things that we can do at the edge. One of them is TCP pacing, which is that right now if you have a traffic that is being sent from a VM and it's TCP traffic, it may get sent as large segments called TSO segments that get sent to the NIC. And then the NIC breaks them up into NTU size and it just jams them out on the network as fast as possible. We've talked to a couple of NIC vendors about possibly doing what we call TCP pacing, which is that you send these large segments into the NIC. But then the NIC actually rather than just saying them all at once, it paces them out a little bit so that you don't fill buffers up along the way. We've also looked at sort of doing the opposite, which is sometimes when you have all these TCP stacks communicating and they're going through a single box, you can end up with synchronization of the TCP states and so that you end up filling the buffers because everyone is sort of following traffic in the same way. And so by introducing a little bit of jitter every once in a while, you can break that up. And then another option would be Flowlets, which is a research paper that was done where the idea is that you are able to take traffic and you can send a single TCP stream, for example, over multiple links by judging what the round trip time is and then making sure that you don't send to keep them in order. You can make sure that you choose different links once they've cleared the total round trip time. And so these are all areas that we're actively looking at adding support to. So in terms of inferring the state, this is something that is commonly done currently on switches and people who are implementing policies. The most common one is if you don't, if OpenStack a lot of times will have the MAC address or the IP address. But if you don't, you can learn the MAC and the IP address the first time that VM comes up and starts communicating. Something that we don't currently have support for in OBS, but people have written demons for is to do IGMP and DHCP snooping so that you can learn the group or the IP address that is being used by the guest VM dynamically. And then you can see obviously which pairs of systems are talking. And you can see their flow characteristics. And so this is the advantage of being at the edge and seeing all of the traffic that your guests are producing. A newer area that we've started looking at is using guest introspection. And so the idea there is that you actually have an agent that's running in the VM. And that communicates with a demon that's running in the hypervisor. And that demon can query that agent that's running in the guest to find out information about what's going on in that guest. So I had mentioned before that there is some risk in terms of running agents in the guest and that's still true. But the idea with this agent is that the code is relatively small. You're not actually doing enforcement there. It's just returning information. And what you can do if you're hosting that in a hypervisor is that you can mark those pages as read only. Since they don't have to maintain any state, they're just retrieving data. And the hypervisor will actually fault if anybody tries to modify those pages in the guest. So you end up having a smaller, you do have some code that's running there but it's relatively small and it's protected by the hypervisor. So the types of data that such an agent could get are the users that are on the system. It can look at all traffic and it can determine who it is that is generating that traffic for each flow, both the user and the application. And when you identify the application, not only can you identify, this happens to be Firefox or IIS, but you can actually determine the version. So for example, you could pull the hash, the SHA1 hash of the binary. And there's services like Bit9 that will actually be able to take that hash and tell you for this particular hash, it's this version. And then if you know that that has a vulnerability, then you can enforce your policies based on that information. So in addition to all of these identity things, you can look at the data transfer rates, the socket queue depth, so that if a particular application is sending traffic more quickly than it can be put in the network, you can identify that, which we'll talk about a little bit later. And then general system characteristics of the box. For example, the kernel version or anything else you'd want to query about the system. So as you can imagine, there's quite a few applications for using the state for quos and load balancing. You can imagine that now you could actually make these decisions based on the user or the application in much finer granularity than you were able to before. You can also select traffic to be sent to different middle boxes, depending on what the application is, and that's something that NFE is doing. And then you could implement better firewalls and elephant flow detection. And that's what the rest of the talk will actually be about these last two bullet points. So currently in OVS, there are basically two ways to implement a firewall. The first, which is something that we added just a couple of releases ago, is the ability to match on the TCP flags. And so to enforce this, what you would do is you would write your policy based on the SIN flag, so you would either allow or deny the traffic based on SIN, and then you'd allow all ACK and resets through. And so doing this is fast. We can generate one of the things that we introduced in version 1.10 is mega flows, which allows wildcard in the kernel. So these flags, flag matching can actually be pushed out of the kernel and it's very fast. And then this is actually how a lot of hardware ASICs implement their firewall policies as well, because they can't actually maintain the state of all the flows that are going through the system, so they'll do flag matching. The problem, as you can imagine, is that we're not actually, we're not keeping track of the flows. So you can get through, you're allowing all the ACK and any packet with ACK or a reset through, which means that certain kinds of scanning and other security, there's certainly security implications of doing that. It also only works with TCP. The other way, which is I think how OpenStack is currently doing firewalls, is to use this learn action extension that we have in OVS. And what you do there is that you implement your policy on Port 80, let's say, that allows the traffic through. And then you write a new open flow rule dynamically based on the five tuple, sort of the reverse direction to allow that traffic. And that's much more secure because you have the full five tuple when the sin goes through. And so then when you reverse it, you know that the traffic will be only for something that was initiated and trusted. But the problem is that the performance is really bad with this. Because it's actually creating flows, open flow flows, we can't cache flow creation in the kernel. So all new flow setups have to go up to user space, and that's orders of magnitude slower than if they can be maintained in the kernel. And then a drawback with both of these approaches is that they don't have support for related flows. So for example, if you're doing an FTP connection and you're monitoring the control channel, and you download a file, then it needs to open a data channel, and that's gonna be over a new TCP connection. These have no way of knowing that the data channel was created and that they needed open a hole for that. They also don't do any sort of TCP window enforcement to make sure that the sequence numbers are correct in that TCP session. So what we're doing is we're gonna add the ability to do connection tracking in OVS. And so we're gonna do that by leveraging the contract module. And this is actually what IP tables uses to implement their firewall, or to do their firewalling, they use that connection tracker. And so this will allow stateful tracking of flows. So it will do things like look at the sequence numbers. We also are going, it'll also support ALGs that allow those data channels. So for example, the FTP that I mentioned earlier, you allow the control channel through, it follows the, it actually parses that control channel. And it will then open up, it will then, sorry, let us know that a new flow is related to that flow. And then we are able to then open up a connection there. And so just a few of the things that are supported in Linux are FTP, TFTP, SIP, but there are a lot of other protocols that it supports as well. And so as you can imagine that this is much better than what we were doing before. It's gonna have much better performance. You're able to push down your policies with mega flows into the kernel. All of the flow state stays in the kernel because the contract module actually runs in the kernel. And so everything, once you establish that port 80, for example, you wanna allow any established flow through, that can actually be wild carded and stay in the kernel. And so to do this, we're adding a new action, new open flow extension action which says send it to the contract module. When you send it to the contract module, the packet will then get recirculated back through the pipeline. And then you'll get another chance to look at the packet and then match on the flags related to the connection state. So in the way that the contract module works is it has constate that's related to if the flow's new, established, or related. And so you'll be able to implement your policies with that. And so we actually have a prototype working. It's sitting in one of my branches right now, but it's not quite ready to be shared yet. But we expect to ship that by the end of the year in OBS. So if you take this connection tracking that we're adding and add the guest introspection and state inference that we were talking about earlier, we can implement a really advanced firewall now. And because we know precisely what user or what application is generating the traffic, we can now make very fine grained decisions as opposed to just very basic ones based on the five tuple. So for elephant flows is the last topic that we'll go over. So there was some research done in data centers that indicated that the majority of flows in the network are short lived, which they call mice. But the majority of packets in the network are long lived, which are called elephants. And so the mice tend to be bursty and latency sensitive. And the elephants tend to transfer a large amount of data, but they're less concerned about latency. And so you can imagine that as the elephants are sending their traffic, they can fill up the network buffers which introduce latency in the mice. Because the mice are coming through after the elephants have filled up these buffers and they have to wait for the elephants to make their way through. So you can imagine sort of like a theoretical bank that is running a backup. And that backup is filling up the network buffers as the traffic is flowing through. And then you have mice flows which might be transactions, database transactions that need to go through. And those you actually want to get through very quickly. And so you end up slowing down these mice flows as the elephants are filling things up. So one of the nice things is at the edge, because we are what's taking the traffic and putting it into a tunnel, we're able to affect the underlay based on what we're seeing in the overlay. So we have defined multiple mechanisms for detection that we've been playing with. The first one is rate and time. So that we keep track of all of the flows that are going through OVS. And then we keep track of how many bytes or what rate they're sending their traffic. And then we keep track of the time at which the flow has been alive. And so, for example, you could say, if this flow is sent 500 kilobits of traffic, or kilobytes of traffic, then it's an elephant flow over the last 10 seconds. And so once a flow has done that, we mark it as an elephant. Another thing that we've looked at is large segments, which large segments for TCP. And the idea here is that the way that TCP works is you have this slow start where the traffic starts off by sending not too many segments. And then when it gets an acknowledgement that the other side received it, it sends a little bit more. And then once it gets an acknowledgement, it opens up a little bit more. So you have this slow start. So you're sending more and more data as the large, more and more, sorry, more and more outstanding data in segments as the connection has been left open. So what you can do is we can look for when those segments start getting large. And the idea is once you get to a certain size, then we know that it's an elephant flow. And this requires a lot less state because all we need to do is look at individual transmissions and see how large is the segment. And if it's, for example, 32K, we've not know it's gone through slow start, and it's probably an elephant flow. And then finally, another option would be a guest introspection. So we actually query the guest to find out whether it's an elephant. And there is a paper called Mahout that did this. And what they did was they looked at the socket buffer depth for applications in the guest to see how much data they had transferred. Or sorry, how deep the buffers had gotten. And so the idea there is that as the application is sending traffic, if it's sending it at a greater rate that can actually be put on the wire, then this is an elephant flow. And so if we have that guest introspection ability, then we can use that mechanism. So we actually haven't implemented the guest introspection mechanism, but that would be an option assuming that we have guest introspection running in the guest. And then once we've determined that a flow is an elephant, there are multiple things that can be done on the network. So one would be that you continue to send traffic through the same port, but you use different cues and drain those cues differently. So that, for example, you drain the mice more quickly than you drain the elephants. You could route the elephants differently from the mice, just through your existing network, choose different ECMP links, for example. Or you could, if you had an optical network, for example, you could actually just route the elephant's along that optical network. And then you leave the mice just using the traditional network. Or you could just mark it and then just have the underlay. If the underlay is intelligent to know what to do, then it just figures that. We've just identified that it's an elephant and that the underlay can do whatever it wants. So this is a picture of a typical NSX deployment that VMware does. And so you have the two hypervisors in this NSX control cluster that's managing everything. And so if VM1 and VM2 are in the same logical network, their traffic will be placed in a tunnel by open V-switch and sent to each other. And if you're using VXLAN, even though we call them that the traffic is in a tunnel, they're actually different tunnels depending on the traffic. So that VXLAN uses different source ports based on the hash of the inner packet. So that way you can, that if you have an elephant flow and a mice mouse flow and those will actually look on the wire as different flows, even though they're both running in a VXLAN tunnel. So as you mentioned that open V-switch is really an optimal location to do this handling of elephant flows. Because it has a flow level view of all of the traffic that the hypervisor, through the hypervisor that the VMs are generating and receiving. And it also knows the mapping between the logical and the physical addresses. We've developed this so that the detection and the action occur separately. So that we can evolve them independently and we can also, depending on who the user is, they can use different detection mechanisms or different actions. So currently in the code that we've been developing, the two detection mechanisms that we support are rate and time and the large segment size, which I described. And then once we've detected that an elephant flow or that a flow is an elephant, we support two actions. One is to mark the DSCP bits in the outer IP header. And so that way that the underlay is aware that this elephant or this flow is an elephant, this packet is an elephant and that you should treat it differently. Or we put it in the OVSDB, which is this config database that OVS supports. We put it up, we identify the packet as it will look on the wire in an OVSDB column. And then there's an underlay agent that is able to respond to that. So, well, first example will be that underlay agent that I talked about. So in here you can see that the, if let's say hypervisor one is generating traffic. And then it's sending traffic to hypervisor two and it's going through these two switches. The, when a elephant has been identified, there will be a, it'll be written to the column in OVSDB. There's an NSX elephant agent that is running up here, that is monitoring OVSDB, the OVSDBs in both these hypervisors. And it triggers on when that column changes. And so, when it learns that there's a new elephant flow, it uses an API call to the SDN controller to inform it that this is an elephant. And then the SDN controller then programs all the switches under its control to treat those flows differently. And so in this way there's no marking of the packets. It's just identified through this SDN agent. And this is something that we actually developed and was shown as a technology preview at an HP conference in December. So this is something that we got working with HP. Something that we've been working on with other hardware vendors is treating elephant flows differently based on the, or identifying elephant flows with DSCP markings. So we actually mark the outer IP header. And so the hardware switches are configured to treat traffic with that DSCP value differently. So if these were configured, for example, to route the traffic differently depending on whether the DSCP value is set or not, then they will do that so you can imagine there'd be two links here and the mice would go over one and the elephants would go over the other. And we're actually working on an internet draft to describe recommended DSCP values. So we've developed this, we've done some testing with chemilis networks. And so what we did was we gave them a modified OVS that detects elephant flows by counting the numbers of bytes for each flow. And then there, once that, then you can configure the threshold for when it's defined as an, detected as an elephant. And once that line is crossed, then we'll start marking those flows, or the tunnels if they're being carried in tunnels with a flag, or sorry, with a DSCP value that indicates that. And the chemilis switches were configured to place the elephant marked flows into an alternate queue. And so this is the test setup that was used. So the VMs are communicating or sending traffic to that system on the bottom, that's the sink of the traffic. So the V switch, the modified V switch is running here. It sends its traffic over a 10 gig link to the chemilis switch. Then the chemilis sends all of the traffic over this one gig link to another chemilis switch, which then has a 10 gig link down. So you can see that this is, this one gig link is a bottleneck. And so the way that this was configured or you could imagine that once a flow has been identified as an elephant, we could send it over the 10 gig link. But what we did instead was just put it into a different queue and treat those packets differently. So they're going over the same port, but elephants are in one queue and the mice are in another one. So we used nut TCP to generate the elephants and mice at ping with a small interval. And so these are the results. So in the green here on the left, this is the elephant flow. And this is the bandwidth in megabits per second. If you can't see it, it goes from 500 megabits per second to 1,000. And in blue are the mice flows. And this is the latency as measured in the mice. And so this goes from 0 to 10 millisecond latency. And so you can see that when the elephant flows are running, so the elephant flows are running, and this is time. And when we introduce the mice, the latency is relatively high and variable. And then once we kill the elephant, it drops down to under half a millisecond latency. When we introduce the elephant flow detection, you can see that here we've started the mice, the pings. And now we've introduced the elephant flows. And the elephant flow numbers haven't changed too much. But you can see that the mice haven't been really affected at all by the elephants, which was a problem before. So here it is in a tabular form that would be a little easier to read. So you can see that if you only are running the elephants, that the traffic was 941 megabits per second. And then if you only ran the mouse, the mice, then it was less than half a millisecond latency. If you didn't have detection on, the elephants weren't affected. But the latency of the mice went up by about two and a half milliseconds. And then if we use it with the elephant flow detection, you lose a little bit of the performance of the elephant flow. But the latency of the mouse is not very affected. So I had hoped to have the code ready to share with everybody by the time I did this talk, unfortunately it didn't come together in time. But I am planning on sharing the code and I'll push it up on to GitHub in a branch next week, most likely. And so the code that Cumulus and some of the other people did was before I wrote it in user space and decided that wasn't really the best way to do it. We had to do things like disable mega flows. So I've re-implemented it as a in the kernel module, which has much better performance characteristics. We don't have to do things like disable mega flows. And it will support both the threshold based detection that we've been talking about and that TSO mechanism. And so the idea of the code is really just something to try out, see if it's worthwhile, see what customers actually find useful. And then we may go forward and put this into actual upstream OVS. But right now, the way the code is written, it's really more for a development platform for trying out new mechanisms and actions in OVS. And so here are a couple of references to documents that might be interesting. I'd mentioned about the data centers mostly sending about most packets being mice and most flows being mice and most data being elephants. That's in this first paper and then Martin Casado and I wrote a blog post about elephants and mice in the second one. And then the third one is describing the work that we did with HP for that SDN approach. And that's it. So these are some presentations that VMware is also hosting. They asked me to put that up. Is there any questions? You started off by saying there'll be some sort of an agent that communicates with the guest overs. To have that kind of a paradigm, is it possible to do in OVS and OpenStack or do we have to use some VMware kind of ESX kind of a platform? Well, I think that the only place that I've seen it right now is in VMware. So when you install the VMware tools, that's where it would install that sort of agent. So I suspect that it's something that if there was an open source implementation OVS could make use of, but I don't know of one right now. So whatever your change right now is specific to OVS and it doesn't have the detection, the firewall mechanisms that can be, that you talked about in the beginning. So the connection tracking that I mentioned will be part of the open source OVS. The guest introspection is something that's being developed at VMware. I have a question regarding the TCP Pacing. In the Edge, if there are multiple switches and the TCP flows are going on the multiple switches, then how does the TCP Pacing works? Well, I think because you're at the edge, I think it's an active area of research, but the idea is that by spreading out the traffic that it should smooth things out. There's no coordination that would be done to know what the effect is on the network. It's something that would be an area of investigation to see if it's something that's practical and useful. Right, because it may, I mean, drastically it may degrade the performance also. Because I mean, instead of improving, it may degrade. Because one switch tries to improve whereas another switch, it's not. Whereas the next packet comes on that switch and that much buffer is not allocated, there is chances of getting dropped. Right, right. Another question is when you go for IPv6 where it has a built-in path M2 mechanism, so how, I mean, is there any advantage in using TCP Pacing in case of IPv6? Yeah, I don't know, actually I haven't thought about that, so I'm not sure. Okay. My name is Keith Edwards from Skyline ATS. With the idea of adding jitter to the elephant, you're essentially purposely degrading the quality of the elephant, right? Well, I think that the adding jitter was not necessarily just for the elephant that was sort of separate. But yeah, I mean, the idea is, yes, to introduce some latency. Yeah, so to introduce some jitter which would degrade the performance. Decrating the elephant in favor of the mouse a little bit. Well, I don't know if the jitter wasn't really tied in particular to the elephant. But yes, if you wanted to do that, yes, then that would be true. So bottom line question overall, what if the elephant's really important? Well, I mean, I think then it's up to the network operator and that's why we're experimenting with this. So we have talked with some financial groups and what they actually want to do is, once they identify an elephant, they actually want to send it over a separate optical link. So yeah, yeah. A couple questions. Is the contract stuff actually available on GitHub in a branch or? No, it's something that I did as a proof of concept right now and it's not really end user consumable because it has so many restrictions like it doesn't handle fragment reassembly. So I didn't push it to a branch. I could send a document describing it or I could push it to a branch, but it's not useful right now. So a couple of other questions. Since the kernel is now moving to like NF tables, is this going to get deprecated at some point? Like is a net filter like the old way of doing it going to go away from the kernel and then what's the plans for that in OBS? Yeah, I actually don't know. I mean, I'm familiar with NF table. I don't know what the kernel community's plan is for the contractor in the long term. I know there's a conference in Germany coming up this summer that kernel the net dev network plumbers. And so I think some of us are planning on going and I think we'll probably have some discussions about that. Okay, is there a sense of performance? Like how many flows can you have before kind of one of the issues we have with contractors, the CPS connection rate you can do takes a major hit, like when you enable it. If you're going to do that in OBS, do you have a sense of what sort of a head that's going to be what's the maximum number of flows that you can have in the kernel before you kind of see degradation? Yeah, I'm not sure. I mean, we haven't done any performance testing. I mean, obviously we're going to be limited by whatever the contract module is capable of doing, but I think that the idea is that you have much better security by this. So you can define the, if you have some flows that you're very concerned about, you could imagine using two different mechanisms because this is all flow based. You could decide that some subset of the flow should go through the contract or another shouldn't, based on policies. But yeah, I mean, we're going to be limited by the contractor and if it ends up being a major hit, we've contributed to improve parts of the kernel in the past, it's the kind of thing that we might want to look into doing if it's possible. Okay, great. Thank you. Thanks. Two quick questions. So first, for the elephant flow detection, it seems that you heavily realize on the TCP. So how about if the elephant flow is not TCP? Is it UDP based or RTSP, whatever the video seems? Right, so obviously the segment size detection wouldn't work, but doing it based on the amount of data that's transferred for the five tuple would work for UDP. Okay. The second thing is, you mentioned when you were saying about the firewall implementation, you mentioned about performance thing, if that flow had to go to user and then go back to kernel where there's a performance hit. I just wondered, does that change things if later on, I know TPDK, Intel was doing something with OVS. Does that change things if TPDK hook up with OVS, I know this? That parts the kernel? So in the case of if you use DPDK, the kernel's entirely bypassed. So you wouldn't get the connection tracker, you wouldn't get Quas that we use from TC. So you end up having to implement all of those things in user space. DPDK. Thank you. Thanks. All right, thank you. I'll be up here if you have any other questions.