 Good sign. Hello, everyone. I'm Gal Sagi. I've been working on the networking side in OpenStack for the last two years now, with me presenting Russell Bryant. Hi, everybody. I'm Russell Bryant. I've been contributing to OpenStack in one form or another for a little over five years now and with more of a networking focus here in the last two years. So I'm really happy. It's a topic that I've been wanting to share for quite some time right now, and I'm pretty happy that it got selected, and we're going to show it to you. So I want to first share how it all began to me when I started to work on Neutron. It happened around two years ago, where I joined a large company. And they have a large OpenStack public cloud deployment, and they were having some challenges in their network. The network didn't scale. They were hitting some problems with security groups and all sorts of problems. And they came to me and said, OK, please take a look at Neutron. Take a look at the insides of OpenStack, and either fix this problem, find us a new solution, or just write something on your own. And I started to look around in Neutron, in OpenStack, and it's challenging. It's a hard thing to even find an updated document to understand not what it's supposed to be doing, but how it's actually doing it. And the first thing that I noticed, and this is a misconception that I still hear even from experienced people, Neutron is not the reference implementation. Neutron is not the layer two, layer three agents, and the OVS implementation. This is just one implementation of Neutron. Neutron, in its essence, is an API and debiler. And there are around 65, I think, even more drivers in implementations of Neutron in the world. And each of these implementations is doing things a little bit differently. So I started looking and trying to examine and trying to see how things are working, what solution is best for our use cases. And again, it's hard to understand how each solution is addressing all the challenges and problems that I mentioned. And this is why we are doing this talk. It's important to understand that we are not going to do a comparison between the solutions. You're not going to finish this talk and understand there is one great Neutron solution. There is one fit for all your needs. It's something that's a little bit more hard than that. It really depends on your use cases, on your traffic patterns, on what you're expecting from your environment. Each of these solutions has its pros and cons. What our goal is to highlight you all the important things that you should examine and look in your solution. If you're looking for a solution or if you're building one, so important things to notice. We're obviously going to focus on open source solutions simply because they are open. And we can see how they do everything. But everything that we are going to say is also relevant for the proprietary solutions. So one place you can start, if you want to start just getting an idea of what's out there is our user survey. This is some data from the latest OpenStack user survey published within the last week or two. And one of the questions is, what Neutron drivers are you using? Now, the survey isn't perfect. Just point out a couple of weird things about this. There's one that I'm not sure it's a problem with the data or the website. But the first two things here on the left are something called 764 and 956. And I'm not sure what networking solutions those are. Maybe it's like a startup that's still in stealth or something that has some early customers, I'm not sure. But I think it's just a bug in the analytics tool that they're using to show this. So I'm not sure what those are. That's one thing. And there's another odd one on here that's Modular Layer 2 plug-in, which isn't quite comparable to the rest. The rest are networking vendors or open networking technologies. And this one's a little bit different. That's a part of Neutron. That's the most common driver interface used for all the rest of these things to plug into. So that, as an option with the rest, is a little bit confusing. It shows it as 32%. I suspect it's drastically higher than that. And it was just people were confused. We can get a lot of insight out of this information with all that said. One is you immediately get an idea that there's just a huge number of different solutions out there. It points to the fact there's quite a wide variety of use cases that people have varying needs and these different things suit different environments better. And there's also an obvious most popular choice and that we should pay close attention to. That would be the ML2 OBS driver also referred to as the reference implementation. That is the most commonly thing used. It's not a surprise. It's been part of Neutron from the beginning. It's the default and nearly all the open stack deployment tools and whereof. And it does suit most people to a degree. There's also quite a few that are using other things. But one of the things I would draw from this is no matter where you're going, how special your use cases are, you should really understand how the default one works. You should understand what it's capable of, what its shortcomings are, and understand the components because it's going to be referred to in documentation, examples, and used as a comparison point in a lot of cases. So I would make sure you at least understand that. And Gal talked about some of his journey as he got into open stack networking. And for me, this is where I started. I've been involved in writing an alternative. But the first thing I did is I studied this closely and tried to understand very well how it worked to get some inspiration for what I would do differently. So it's easy to jump right into technology. But I think there's a lot of other things you should be aware of when you're going to select some of these things. Some of it's maturity. If you look across this huge number of things out there, there are quite a range of states of maturity. And there's a lot of questions you might want to ask. This is particularly true of the open source ones that are talked about a lot and out there from the very beginning, where there's perhaps even more presentations than code. You need to dig into it and figure out what stage of development's and how long has it been around? Are people actually using it in production? Or is it experimental? I mean, is it even ready to be used in a lab? Sometimes things are talked about that aren't even ready for lab testing, really. Another, maybe this stuff is hard to figure out directly, but you can start looking for good signs of maturity. For example, an upgrade story. A good solid solution should be able to tell you that, one, it's upgradable. And two, very clear instructions for how to do so. And ideally, even a good idea of how that's going to work and not give you any downtime. Certainly no downtime on the data plane, but we even want no downtime on the control plane. We don't want your end users, the users of the Neutron API, to experience downtime. So what's their status there? Do they support that already? Do they have a vision for how to get there? That's a really good indicator of maturity. CI test coverage is really important in the open stack community. There's some projects and solutions that have really great test coverage running in open stack CI, some that have absolutely miserable test coverage. And it's quite a range, and it's worth looking. And especially if you're considering one of the commercial vendors, you should hold them to a better standard than some have done. Is there good documentation? That's another good sign of maturity. Project velocity, there's some things that get talked about, like, every conference is so hyped up. And then you go look, and it looks like a dead project where there's been two commits in the last year. It's kind of really odd. So you have to try to figure out what the heck does that mean? Or is it actually a vibrant community moving quickly with lots of changes? And then what does the community look like? So degrees of openness. So Gallup mentioned that we're talking, kind of, we want to focus more on open source solutions. But to be clear, there's lots of proprietary ones. And actually, there's a lot of good ones out there. People use them for good reasons, whether it's tooling they come with, performance, whatever it is. There are good reasons. Don't discount them. There's the ecosystem exists with merit. But when we go to the open side of that, it's not enough to just say it's open source. There's a lot more to openness than that. And so in OpenStack, when we talk about what open means, to OpenStack, we have this document that describes the four opens. There's different to what we think is a healthy open community. So open source is part of it. You can have code that's open, but that only means so much. You can just throw code over the wall and publish it on somewhere, but that doesn't mean that there's any way to actually get involved. The design, so how does the project move forward? Where do people actually talk about the future? And can you be involved in that discussion? That's not true everywhere. How about the development? Where the code is written? Is that a process that you can participate in? Can you submit patches? Can you review other people's code? Or is it done? Is it effectively being done behind closed doors and then published up somewhere? In the community, at the end of the day, who's in charge? What, how do decisions get made? Is it a level playing field across organizations? Or is it an open project but effectively controlled by a company? I tend to think of that as a sign of a less healthy open community. So think about how all those things add up in terms of making a long-term bet on a technology and a project and whether you're willing to take risks if you start to see any signs of any bad signs. So some more comments on community. I talked about how I think it's important for a healthy community to be a level playing field so people from any organization have an opportunity to get involved, have influence, and play a part in where the project goes. So this is, there's a website called Stacklytics, very commonly used in the open stack community to take a look at. Some statistics about contributors to open stack projects in terms of code reviews and also commits. And these are a couple of graphs of the organizations contributing to the core neutron repository. It's a huge development community, lots of people. And pretty quickly looking at these charts, you get an idea that it's actually a very nicely, evenly balanced community. There's no one organization that dominates. I think the piece of the chart that's 18% is other. So there's a huge number of companies and it's well balanced. But if you start digging into other things in there, you can see the other end of the spectrum where it's open in a sense. But why is 99% of the work by one company? Now it's not necessarily a bad thing, but it's worth asking some questions. Could be something that's brand new. When things get brand new, it's usually championed by perhaps a person or a small team or at least a team at one company. That's quite common. So maybe it's just new enough and it's still sort of growing and getting that momentum to start then attracting lots of other organizations to jump in. Now it also could be that it's been around a bit, but it's not actually gaining that traction and the interest isn't coming from other organizations. So maybe the technology is not as interesting as they had hoped. Now it also can be a sign of an unhealthy community. There are mature projects with very large number of contributors that look like this. And that usually means that it's not really a community that's conducive for other people joining in. And I would look at it. Is that it, that is a bad sign and you might want to ask questions. So, control plane. So Russell touched some important things to look up when you're examining a solution in terms of the community, in terms of the companies that are working on. But we also want, of course, to evaluate a solution according to the technology. And control plane in this context means how your solution takes the neutron constructs, whether you use them in the Ryzen UI or you're using the API or any orchestration system and how it propagates all of these topology and security policy to the endpoints. And control plane is, as we will see, I'll show a few examples, is a very critical point in any solution. And as we will see, there are many ways to implement this rather simple task. Now, it's important to also notice when you're examining a solution, that some of them also use other libraries. For examples, for message queuing, RabbitMQ, or so on. And the same thing you're doing to the solution itself, examining its capabilities, you should, of course, also examine the third parties and other frameworks that it uses, because, for example, if you use a messaging, a framework that is not reliable, your solution needs to support it. So there are many different ways and different solutions to the problem of propagating the control plane. The two high-level areas that we see in solution are the centralized, which affect all the logically centralized SDN controllers, like Open Daylight, or ONOS, that control a set of virtual switches with OpenFlow. And we have the distributed solution, which has local agents, like the reference implementation, or local controllers sitting at each hypervisors, like Oven, DragonFlow, and Midokura. And this solution differs, but even the distributed part have different subsets of solution. There are some that are syncing the control plane, propagating this information using a message queue, RPC message queue, like RabbitMQ or ZeroMQ. And there are a solution that are using database-driven approaches where you have a logically centralized database and all the agents or the local controllers are synchronizing this information. Now it's very important to also understand what kind of information is being propagated, okay? You could have solution that propagate every change or every model in Neutron. You could have solution that propagate compiled flows or OpenFlow flows. You could have solutions that propagate policy or other thing that they define. And this is, of course, important because it defines the amount of information that needs to be propagated, how reliable it needs to be propagated, and it affects a lot of characteristics of your solution. Now I'm going to speak about one single pain point in when it comes to control plane. A usual OpenStack Neutron deployment looks like this. You have a few Neutron API servers, usually working in a cluster active-active approach with an SQL Galera cluster for the Neutron DB. So you have one database that is holding all the Neutron construct that you define in the UI or in the API. And then you have your SDN solution and it doesn't matter if it's SDN controller or any other solution that uses an external database, I think most solution do it maybe beside the reference implementation. And then these two database needs to reflect each other. Of course your solution might add the new features but when you are looking at the basic Neutron connectivity and security policy, then you are supposed to be able to derive one database from the other. Now the single thing of keeping them synchronized and keeping them reliable because you could have many API servers that are getting requests and updates on the same entities, that's a hard job. That's a hard job to keep. And during the time that I worked on Dragonflow I met a lot of people and saw a lot of solutions to this problem and sometimes people are saying, okay, this is a rare case, it will never happen in production. Okay, this is a 1% chance that it will happen, let's not address it. I can share my experience in the networking world. In production everything that can go wrong will go wrong. Okay, you'll see these craziest things that you didn't even thought are possible and it's very important that your solution knows to address and recover from this problem. Now even for the distributed case, if you don't have an additional database then you still have the local agents or the local controllers that are spread across the IPervisors and it's important that they will be also synced with the latest information and we'll get this reliable because we have here, as I mentioned, few neutron API servers so things can get out of order so it's very important that we keep things reliable and I'll not go over all the possible solution, we have journaling, we have DB versioning, we have a synchronizing threads so each one has its pros and cons but please look at how your solution solved these problems and ask the right questions. Control plane implementation is actually affecting your solution scale. It's affecting how many IPervisors it can support in one cluster, one OpenStack cluster, how much time it takes for changes to propagate to the entire system, how many changes you can do per second. The rate of these changes and of course you want to look for solution that has benchmarking and done testing and can show you results of these numbers, you don't want to wait and test it in your production environment, you want to test it and understand the implications of using a certain solution. So I've touched control plane, which is, as I mentioned, very critical to notice. Data plane is, we have our control elements and we're propagating the information to the endpoints and then we have some entities that actually the traffic traverse in this entities that's called our data plane entities and the one thing that people usually look is performance, like how fast our solution can be, how many packets per second it can process and so on. And this is an important thing, but there are other important things to notice. For example, how much resource your solution takes from each hypervisor or each server in terms of CPU and memory. So if your solution takes from each server 10 cores to operate, then that's 10 cores that could be used for user workloads. So pay attention to this. We want to gain data plane solution that don't lose traffic if control entities are restarted. So pay attention how your data plane entities work with control plane entities. You want to achieve zero traffic loss as much as possible. We also have some solutions have integration with specific and proprietary hardware. If your solution works only with a specific hardware, that's normally a bad sign, but you should know about it. But some solution also works and have integration with some sort of smart nicks or any kind of hardware. So definitely look up and see how your solution can work better. And if the benchmarking numbers that they're talking about are benchmarking of the normal case or of case when using all of these sophisticated hardware. Now the last point, I left it as the last point, but to me it's a very critical point that is missing in most solution today and that's visibility. Remember that we said that everything that can go wrong will go wrong. And when it does, you will want visibility and monitoring and debugging capabilities to your solution. So you want to be able to tap traffic and understand what's flowing there. Be able to see and understand the problems. And this is one area to me we need to keep working on because it's still missing. So anyone doesn't know what is open V-switch? Okay, so that's a good sign. So open V-switch is pretty much the common standard data plane entity in most solutions. It's an open flow virtual switch. It can also work in normal mode, but most solution use the open flow version where packets traverse in it. And it also has a DPDK version. DPDK to who doesn't know is a set of libraries that lets you process packets in a pole mode way in user space and it lets you achieve much higher performance numbers than the normal kernel stack. But the important thing to notice in these two things that if you want obvious DPDK, then in terms of feature parity, there might be some differences than the normal OVS. So if your solution uses specific features and these features are needed, then make sure they are also in the OVS DPDK version if you plan to use it. Now there are other variations of using OVS for middle net is an interesting approach. They are, they kept the kernel model but change the user space implementation. I'm not going to get into right or wrong here but just, you know, be familiar with this, understand the limits and query about it if you're using these solutions. I didn't touch it before but there are a set of layer tree only solutions available today. Two examples are Project Calico and OpenContrail. Basically you have routing from the VM or containers that you use. If you pick a layer two or layer three solution truly depends a lot on your applications and on your physical network as well. And I'm not going to get into it but things to notice is that Neutron, originally Neutron model, the concept of network was a layer two broadcast domain. And there has been some work in the Neutron community for routed networks to support the notion of layer three networks but definitely you should look how the solutions that you picked, if you pick a layer three solutions how they map what you configure in Neutron to the actual deployment, okay? So make sure that the behavior is as you expect it to be and also another issue that I see sometimes in these solutions is tenant isolation and overlapping IPs. So some solution expose the VMs and tenant IPs outside which mean you cannot have overlapping IPs for example. So some solution solve this in some way or another some don't, definitely something to pay attention to. So who knows what this animal is? You get a prize in the end. It's okay, I don't either. But apparently that's the EBPF icon. EBPF, anyone heard about it? You can raise your hand, so I'll see around. Okay, so EBPF is a really cool technology. In a nutshell, it lets you run a JIT compiled code in a safe way inside the kernel and attach this code to certain events that can happen in the kernel. It's a technology, some solution, there is a big project, open source project called Iowizer that is giving you examples and showing you how to use this technology. To me, it's still a bit complicated. It's not something that's like going to work for you and you're going to understand out of the box but it's definitely something that you should be looking at for the future and this is something that's maturing and is showing a great possibilities. In this regard, by the way, there is some work in the community to change open V-switch data path kernel module to an EBPF implementation. This will definitely increase feature insertion to the project. So these are not competing things necessarily, they're also complementing each other. Yeah, the work happening in the open V-switch project we covered with a little bit more detail in a talk on OVN yesterday but one of the portions was talking about what's happening in OVS and the work around BPF and looking at porting the OVS data path to become BPF instead of its current custom kernel module. There's a lot more differences. We talked about these broad differences, the control plane, the data plane and then there's this huge host of features that either not every solution implements or if they do, they implement them in very different ways and they can have pretty drastic implications on the performance. So one of them is SNAT, D-NAT. Is it supported at all? Probably is, I'm not sure if anything doesn't support it. It's pretty commonly needed for Neutron but they're implemented in very different ways. One of the key differences is is it done in a distributed way? And not just that but which parts can be distributed or not? That's complicated and it varies quite a bit across solutions so you should understand, one, where do you need NAT and how you intend to use Neutron and then how does that apply to the solutions you're looking at because it could become quite a performance bottleneck. Another one is routing. So how is L3 routing done? Is it centralized? A lot of things centralized all of that in a single place. That's how the Neutron reference implementation originally worked. It eventually gained some distributed capabilities but not all solutions can do that or some solutions do it entirely distributed. But also how is the routing implemented? You can do that, some project's implement that is purely as OBS flows for performance reasons. A more traditional way to do it would be to, if you're using OBS would be to layer that with network name spaces and use Linux routing instead. So there's different ways that can be done. Gateways, presumably you're going to be integrating your Neutron networks with your physical networks out to the world and you should understand the gateway solution. So different ones have different ways of doing L2 or L3 gateways and with L2 gateways, maybe they're software based, you can run those on Linux hosts or there's hardware based L2 gateways, some solutions can integrate with top of rack switches as your gateways. That's interesting, but not all things support that. Also, particularly with L3 gateways, what's the HA story look like? If a host or a node or device that you're using as a gateway goes down, how is high availability achieved? That differs quite a bit across solutions. In related to gateways, some solutions take advantage of this to make it easier to connect networks across multiple open stack clouds and not everyone has really looked into that problem space. So if that's important to you, you should start asking questions. Neutron has a core set of APIs, lots of additional optional APIs, load balancers service, VPN as a service or two common ones that have been around for a long time. Not all solutions support these things. So do you care? And if so, how does that your solution do it? And service function chaining, this is a more recent one and there's an enormous amount of variation on here. I'd say most things aren't even touching it yet, but there are a few solutions doing very interesting work here. And they're doing it in very, very different ways. So I would say that there's not necessarily broad consensus on exactly how this should be done, but you should dive in there and see what it's an area of innovation that's being pushed through multiple communities. Containers is another space. OpenStack has been traditionally around VMs and now we can do bare metal as a service. Containers are obviously a critically important workload and a part of data centers going forward. So how does your solution map to that world? If you're gonna invest in a networking solution for OpenStack, presumably you're making a bet on it and you wanna be using it for a long time and use it for as much as you can for your workloads. So in the OpenStack case, we have a project called Courier. I think there was a talk dedicated to that earlier. And Courier is a project that provides glue between the container ecosystem and OpenStack. It allows the container platforms to talk to Neutron directly to program the network. There's some benefits to that actually performance. Also having some integrated, a good integrated networking story. So perhaps you're not just running VMs on OpenStack, but you're running containers within those VMs and you can now have Courier as a path to having Neutron provide the networking for both of those layers, the VMs and the containers within those VMs without having to build a second level of virtual networking at the container layer. But if that's a mode interesting to you, then you should figure out which ones support that because actually most do not support that mode. Now hopefully you know more catch up and implement the interfaces needed to provide that but it's still a newer thing. And then also if you're speaking of containers, if you're getting into this, are you using Kubernetes? Then maybe you also wanna see what they're doing in that community. Are they providing any sort of direct integration? Do you have needs where you might wanna have Kubernetes talking to your networking solution directly and do they have any story around that? There's quite a range of status in that space. So with that, we can take some questions. So we did a good job, no questions. All right, thank you very much. Thank you.