 Welcome to Tokyo crowd in crowd in all right good morning thank you this is the open v-switch with open virtual networking session by people that don't work on open v-switch yet that that's a subtitle that you can't read but that's that's that's basically us I'm Sean Lim I'm a principal engineer at Time Warner Cable work with Dave and a whole bunch of our other team members on the open-stack team my focus is generally Neutron and Nova and I'm Dave and I work at Time Warner Cable as well we've had an open-stack cloud up for a year and a quarter now and have brought it up to Kilo and Partial Liberty and I'm the lead engineer for compute but also work on a lot of other things including the networking I'd like to give a shout out to the Denver open-stack meetup leaders that also joined our Colorado cloud we're both from Colorado and I also leave the Colorado open-stack meetup I think I'd like to open up before Dave jumps into the the heart of everything to say that this is probably gonna be a little more of an overview than some of you were expecting I see a smattering of faces here who probably know way more about some of these subjects than we do but we are operators of it and we're started spiking OVN so we're kind of delving into a few things for the first time and what we want to do is cover some of the problems that we've seen with OBS some of the things that we've seen with OBS and neutron integration and then go over some of our interesting material with the OVN yeah and he's stole a little bit of my thunder there I've been I've had West Nile virus for the last six weeks I just got better in time to get here but really not to prep so Sean has none of the none of the responsibilities for our lack of a preparation and I bear all those so all the mistakes are mine and all the omissions are mine we will also give a huge shout out to the open V switch and neutron team they did a stellar presentation on Tuesday if you have not had a chance to see it I'll show you a link at the end it'll go into a lot more depth on OVN than we will be able to go to all right so OBS 2.4.0 is basically the release that is out with Liberty I think all of the distributions are also providing that version of OBS now it includes open virtual networkings toaster oven I love their names the toaster oven release of oven sometimes OVN is called oven and that's the ready for test release but open V switch itself also includes a lot of improvements DP DK which is basically fast hardware enabled networking Geneva support for additional tunneling Docker support some default port changes and bash completion that's my favorite one that's your favorite one all right and line yeah there also obviously some open-stack improvements that are utilizing this in in key so when when we proposed this talk we were going to talk about kilo in Liberty I know Liberty's release but there are a lot of people still running Juno or get or not quite running kilo yet so we'll talk a little bit about that so yeah I was gonna say so if you aren't going the OVN route and spiking that and you're using the native are using standard everyday kilo Liberty layer 2 agent these are two things that you want to actually look at and start turning on we haven't rolled these out in production but I have spiked it all the way up to that essentially up to kilo most of these calls were native calls that were route wrapped and when you end up on the control node side the network node side of the world and you have a huge number of routers and you have to rebuild those it takes four over 10 to 15 minutes for 50 to 60 routers and there's a lot of causes of that but a big chunk of that is doing this route wrap pseudo inefficient call so as of kilo rolling out the VS CTL the VS Kettle calls all have a native Python interface on that I'm not sure I saw six to seven times speed up as the blueprint said but definitely noticeable in that and I haven't experienced any issues with with switching that over it's a one-line switch in the config file and we're getting close to running rolling it out in production as well and probably that last bullet is probably the most important if you're an operator this has been a long-standing issue with us if you had to restart OBS you took a significant outage on your control plane it wasn't so bad on a compute node because there just isn't as much running there but on a control plane an OBS L2 agent outage was painful and it was on the startup side you could actually shut down the agent and everything ran fine until about five minutes when the Mac learning ran out but on startup the decision-making was I don't know what to do if it's just flush everything so on our control nodes that amounted to about 2,500 flows 3,000 flows and that just takes a lot of time to rebuild and during that time we have customer outages so this is really really important to us absolutely yeah this is one that I ran into sideways it's since been fixed in kilo essentially when you create a port in the OBS if you've looked underneath the hood in the Linux namespaces and then that's added to a namespace that's kind of a semi illegal operation or has been recently and it just works and nobody really notices it during their gate test for DVR this cause started causing kernel faults and as I kind of stumbled on to this I realized that some of some of the cleanup we were doing for upgrade to kilo and some of the other operations we were doing in depth there were kernel faults that trace back to this it is a fixing kilo but it also required OBS kernel support kernel support yes so it was out in 4.0 but I think I think everybody's back ported it it's got a supported stream and the outcome of it if you ever see it is really that first of all kernel faults and the logs are not cool but and a little bit scary but your functionality of those ports and OBS just doesn't work anymore and you have to end up by hand removing them and getting them re-created from scratch so it creates operational problems so this is what we do at Time Warner Cable with open v-switch Sean actually wrote this slide but when did you go ahead sure okay we have pretty simple use case up front I guess at the time when two years ago when we put things in place VX LAN was not not simple that's a little bit why we decided to go OBS versus Linux bridge is better be excellent support and behind the scenes I knew that because we're a cable company and we have our cloud integrating with lots of other entities in the company that that we were going to have to do some fun networking at the edge of open stack that has actually come true and and OBS is providing us an easier means to connect to things on the edge neutron may not support it but we can at least put that in place the rest of our use case is pretty pretty normal traffic we aren't trying to do anything with video streaming or non TCP UDP protocols so it's pretty average and we aren't pushing crazy bandwidth so we don't use DBDK yeah yeah yeah and a lot of our business use is actually it's more like business usage than it is video streaming usage we'll get there but right now we've got plenty of uptake inside of our company with just kind of your web page type type access not we're not video streaming at the current time if you ended up seeing the incredible rack space demo that went viral and Paris that was the ludicrous speed demo we were caught by that and so it was exciting for us to see that other people were having a horrid problems as our as we went and started scaling up the cloud we started seeing a ton of crashes in OBS one for operational production level what's going on and it didn't trace back to anything so basically as soon as we got back from Paris like that week we used we used Ansible and we upgraded our all of our production like minimal testing I would say because we were super confident well we had we built our own packages that was one of the first times we had to build our own packages the distro wasn't ready to do this yet and they weren't sure that they had a root cause so they couldn't convince us that they had a root cause so that we could use a back port turns out they didn't actually have it at that point we found out later so so we had to we basically had to move forward and and I think with OBS that's probably the right answer move forward when you can with OBS everything gets better yeah the upgrade was actually relatively painless we we have pain points on our network nodes because of all our we used legacy routers at this point and and that requires those flows flushing and so we had to take a little bit of an outage or at least set my work around that but but the upgrade itself was pretty painless and since then we realized that we've come to blame everything on OBS and now we can't so we actually have to solve actually we've actually gotten better at troubleshooting because we didn't have an obvious smoking gun anywhere right we couldn't just always point at OBS we did come up with we pointed at Sean actually we didn't point at OBS I got to sleep finally but we did end up with a massive number of automation scripts to do crazy cleanup things that now just sit there gathering dust which is incredible for us to see and so we've had solid performance since then and let's go on all right so the two four oh release notes include some deprecations that you need to be aware of mostly for going beyond that I can't say that I'm intimately familiar with these but I did see them and I want to point them out something about Jerry 64 tunnel is deprecated and we'll be removed in two five so just be aware of that if you're using it we're not using that let's see external network bridge option for the L3 agent has been deprecated in favor of a bridge mapping yeah I'm reading this slide to you but that's what I do I have more to say on the external okay go ahead with the external one of our customer requests is you know we've got a huge subnet of our own and we want to put that into OpenStack we call this service bring your own IP BYOIT exactly and it's gone viral on the company kind of silly but this external bridge networking option allows it was the way you mapped your your connection to the external world this change here allows us to and related allows us to plumb out multiple even L2 subnets and L3 out of OBS and it actually helps us to realize BYOIP BYOIP I can't say it okay for our customers cool okay so open virtual networking again big shout out to Justin, Ben, Russell and Kyle for their talk on Tuesday morning it was up on YouTube by Wednesday morning so you can you can go watch it and you can just leave right now if you want because that'll be a better use at your time but it is in the master tree so if you do a get clone of OpenVSwitch you get open virtual networking for free it does it does have these goals L2 L3 network virtualization logical routers multiple tonneau overlays that added to Geneva works anywhere that OpenVSwitch works that's really important OpenVSwitch is a lot of the reference architectures for OpenStack rely on OpenVSwitch and so it'll work anywhere that OpenVSwitch is working I can't say that that strongly enough so that that's one of the big gets that's that's an easy get out of OVM it works if OVS works here's the architecture and this is this is a big change I'm gonna I'm gonna not do this justice but I'm gonna talk through that this particular diagram comes out of the build when you build OVM there's an architecture man page and this comes straight out of it so at the top is your cloud management system so you can you can just substitute CMS with OpenStack because OpenStack is the only one that's ported so far so this was basically built out of the box for OpenStack so there's a there's an OVM plug-in that goes into OpenStack and then there's a northbound database and that's what OpenStack is going to talk to and that OVM NorthD Demon will then talk to this outbound database and then inside of each of your hypervisors whatever that hypervisor is and or Libvert KBM etc you'll have an OVM controller and inside of there you'll have your normal vSwitch and your normal OVS DB server and so you're writing kind of at a high level language at the very top for the northbound database and then the actual flows get pushed in at the OVM controller level inside of your hypervisor so you're kind of writing a meta at the OpenStack plug-in is writing kind of a metaflow and then the actual open flow is written in the hypervisor what I didn't have and probably should have up here also is a view of how things were before OVM or if you're not yet using OVM there's actually some additional complexity that were basically that basically gets abstracted away and that's probably the biggest value I've seen out of OVM so far is it simplifies things it gets rid of a Linux bridge that handles the IP tables today and you basically just get closer to the metal so you get faster and one of our first slides I was saying you know neutron integration with native Python for OVS VS Kettle and OF CTL and Liberty that all goes away and gets replaced just do this yeah right so all right no no we can't no no sorry it's just very bright yeah I'm just blocking the light I'm not blocking you I'm blocking the light no no no OVM relies on OVS OVS is still there and OVS will continue working the way it was always worked 2.4.0 if you're running OVS today you should upgrade to 2.4.0 it'll just keep working the same way so your your L2 and L3 agents will work the same way they do today maybe better in Liberty but if you want to move away from that architecture to a different architecture this this is this is what you will find opportunistically available inside of the OVS Git tree there is an OVM directory and there's some readme's in there that tell you how to build it and how to play with it you don't have to actually deploy it you can just go play with it and you can learn about you can learn about this architecture very simply by just doing a Git pull doing a dot let's see the auto tool stuff configure make and then and then you can make a sandbox where you can start playing with OVM we'll talk a little bit about that on the last slide and and one of the big improvements on that is if you've ever trolled through the neutron code open L2 agent L3 agent you'll come down at some point where all all those OVS commands are wrapped right now so like oh add a port and delete a port and add this flow and it's just hugely enumerated and messy to follow and if you've ever tried to troubleshoot it backwards up the stack it's super painful those use cases were encapsulated so 10 10 commands might create a port and plummet out into neutron those are all encapsulated rolled back up into OVM and taken out of the neutron code so OVM also is utilizing some of those OVS improvements so in an upcoming release I think I think there is a branches for this already and again this came out of the release notes so my apologies and also all the credit belongs to Russell Bryant it does require new some new or kernel stuff in order to use that but this again gets rid of the IP tables type rules security groups are implemented as well it just it just cuts out several stages of the pipeline so everything should flow faster in neutron as a result Russell Bryant put up a blog entry last week that talks about these new security groups the security groups from a from a user perspective work the same way it's just how they're implemented via ACLs that changes if you've ever followed where security groups are placed and how they're placed in the 27 different hops that has to do I don't know I'm super exaggerating yeah you want to wash your hands afterwards you want to wash your hands after trace trace traffic through there it kind of collapses it down to a much more manageable troubleshooting scheme that's a good question yes all right so I told you there was a more info slide coming so there is more info so this is you just want to check out check out OVS do do the regular build and but inside of inside of there there'll be an OVN directory where you can actually go build play with OVN they it comes with a sandbox area so that you can create create ports create flows and watch it watch the kind of the meta flows flow that fall down into the open flow world and get implemented but more importantly you really want to watch this OVN talk from two days ago it's right here on YouTube they did an excellent job of covering this in much more detail they are the developers they know it way better than anybody externally could ever know it it is a good use of 38 minutes of your time let's see what else play with it in dev stack so this is something I haven't gotten around to yet but I would highly recommend it this link will take you to a page where you can set up dev stack in a multi-note environment and actually run OVN and see how it how it's supposed to work before you go to the work of actually deploying it and changing all your puppet and ansible and juju and whatever else you use but but but you do want to start using this there are also some other competing technologies I'm not going to say that OVN is going to replace sliced bread it's world peace butter but we're world peace I think I think there are a number of other solutions that are also helping to flatten that stack I think the Astara the new the new project from a conda will also help in that environment so be aware that there are some other technologies and oh I had one more thing yeah so in addition to the devs at Time Warner we don't really run dev stack we have a really everything's automated we can actually put a cloud of our cloud in our cloud so nested clouds throughout the stack so we can bring up a subset or our entire cloud inside our cloud and do all our testing it's the way we do all our development all testing we so we use vagrant with the open stack provider I think it's called so that we can actually deploy any of our architecture on top of our cloud so that we can do next next level develop I don't know if we did a did we do a lightning talk on that this week nobody nobody thinks we did a lightning talk on that this week but if you've got questions about what we call vdev virtual development of our event feel free to get any of us up the advantage there was in you know one of the long flights over here I was able to actually start hacking together and I say hacking because it was mostly on our automation side OVN and a place in our cloud and I got I got quite a bit of progress I'm pretty pleased with the way things are coming together but there's still some upstream patches that I had to do and most of the work to be honest was on our automation side to get that to work so I wasn't just manually hacking on 27 virtual machines but I'm pretty pleased with the way things are going it's worked worked pretty well okay that's all the material we have present prepared we can take questions we may not be able to answer them but we can start go ahead flooring now just your way too far back there for me and Mike you did okay I'm going to repeat that for the audience for the video as well how do you in a virtual development environment how do you basically nest floating IP type information and it is a problem so a lot of the work that we do the short answers we just don't run into the problem because a lot of the time we're actually trying to set up our automation and stuff that doesn't actually rely on a nested virtualation floating IP so we can set up the networks you just can't get out so that that's the main solution when we ignored the problem yeah we ignored the problem we do have a fix but we do do we do do live migration and everything inside of our virtual development environment but we're just not able to drill into it from outside they can get out but you can't get it so the recent solution that we've done since we upgraded the kilo is we yep yep what's that turn up for support security and then allow it to plum out yep you know properly you still have to have some external the other virtual machine and a device on there to test it but the other way we were doing it is the neutron debug port create for a long time at least allows you to do some funky magic to get down to that but and Florian see me any time after lunch or after 10 o'clock today I have a frisbee for you for being the first one to ask a question and this gentleman over here that asked a question also you can find me any time after this meeting and I've got a frisbee what's that okay no no you're not ringing any bells up here so short answer is no no no any another question go ahead so okay so my my take on it is that there will always be a kind of the dev stack defaults and that's what is going to be the best tested but anybody that gets gate requirements in and I think all three of those that you mentioned have gate gating requirements should work it's it's really kind of giving you a broader menu to pick from they have they have different resources behind them obviously if you you know if you need commercial support you might have to go a different way say from one than the other there are it's great that there are a lot of people solving this problem because there are lots of ways to skin this cat so if you need if you need switch level hardware support like if you need a metal iron support old-school old-school network metal iron support is going to favor one of those solutions virtual metal I mean an OS running on metal like a broadcast chipset is going to favor another one of these solutions or maybe a couple of these solutions and if you want to pure software implementation with no metal at the top you might want to go this way or the other ways but that that's going to skew your perspective right we are a large IP shop basically well I mean that's anybody that's in cable these days is basically an IP shop and so we've got a lot of a lot of custom or expensive networking gear that we would like to be able to plum into so that kind of skews our thinking but we also want to be able to go forward and do a pure software deploy on that same kind of gear and we're doing both so that that that definitely flavors are thinking I think I think the first step that I always think about is since since I take the calls as well our whole team I mean we're a real DevOps team so you take a few weeks every year getting yelled at which really teaches you quick that operational efficiency and ease of troubleshooting are you know two things first and foremost anything for us that collapses you know collapses the problem set down and allows us to troubleshoot faster is always more of a shoe in now we we have a usual at big companies pretty well-defined silos in the company these are these are breaking down pretty quick for us but but there's still a there's still a large networking silo or networking expertise so any solution that crosses those bounds outside of OpenStack or outside of what we should be controlling is is a danger area I guess so we would you know we've actually started going down the contrail open contrail route I think it's got a lot of benefits we're a huge juniper shop and there's a lot of integration with with neutron as well although it rips out most of it but there's some compelling arguments to that especially on those edge conditions where where customers want to plum things in and out of OpenStack it would make it a lot easier and it also creates a single dashboard for both for multiple teams to look at it and be able to control so that's a good thing but there's a ton of complexity behind that and a ton of extra operational and operational cost as far as manpower and training and getting used to that and I would say that we just aren't quite there yet with respect to that so we're always looking for solutions that we can deploy tomorrow or three months later or in the next release of OpenStack that really help fundamentally reduce the problems that we have with an OpenStack. One more follow-on comment on that Kyle Mastery at the PTL during the Liberty cycle has reiterated and re-reiterated and re-reiterated that neutron is now a platform it's not a solution it's a place to make solutions so be aware that that you know there is a lot of valid solutions and there will be more valid solutions as time goes on so so don't think of neutron as a solution think of neutron as a platform for networking and people are bringing solutions to it. I actually one more thing on that I actually really like some of the things that ODL is doing and I was trying to set up a project where all using ODL to map our entire network infrastructure in OpenStack which actually I just didn't have time to complete my my feedback on that is it's you have to be a technical expert at a certain point you need to dig in have time to maintain error to spend on that and become the expert because it's not out of the box flip a few switches and go there's a ton of setup time and so that goes back to the simplicity and ease of troubleshooting when something went wrong with ODL or ODL caused some impact on my my OpenStack instance which did happen up front until I learned the bits to flip there's nothing wrong with a lot of these tools it's what your organization is trained for and how much technical ability and time you have to introduce those maybe that's way too practical but another question and then you distributed virtual router DVR DVR is a very overloaded term it turns out so we have we have internal projects that's DVR digital video recording because we're and it's not only that it's cloud DVR it's very overloaded sure so we have to contact switch the short answer is I keep going down that path every release and following that and then I feel like there's not enough operational tooling and neutron to help troubleshoot as it is and now we've just distributed that problem around so I spend enough at least I have one I have I have three network nodes and I know where to go for most of those problems I don't want to have 200 problems faces to go across there are a few bugs in there that I'm following that I just at this point I for us it's not ready there's like two blueprints to to make that even more distributed I don't know if you saw those there were sessions this week on that as well so it's definitely something we're keeping in touch with but I think too if I had to you know put my finger to the wind and in guess yeah I think it'll be two releases for us before we probably are ready to take that plunge it's I think it's going the right direction slower than everybody wants obviously me too but it's the way it is gentlemen in the middle easy OS I don't there's some reference information in the documentation and in the blog post that will talk about the speed ups but we have not done any we have not done any performance evaluations I was happy to get it running so on the plane flight over I don't have anything yet but if I can get it put together in in neutron in our environment in 12 hours that's pretty good go there but another question yes ma'am this or that of the kernel that the Ford auto okay where are we here there we go I would so the best resource that I have is rack space gave a talk at the Paris summit that's called you know taking open v-switch to the ludicrous be different keyword is ludicrous they outline everything in there that that was an amazing so search for this term that's why I put ludicrous in there let's see yeah if you search for this term ludicrous not the pre but just the ludicrous and rack space and open stack you'll probably find the right talk and that's everything that they presented there I found true when we upgraded there was a quick we haven't we haven't tested that or even exercise I can't say we haven't even brought that up certainly haven't tested it we're certainly not I was unfamiliar with the term dpdk until last week well versus versus which versus I think because it's simple to find the pipeline it probably improves it I don't know if it still falls falls apart at scale or not we haven't done that test well got two things on that one is when I this is the reason I put normal use cases on here we're not trying to press for traffic we have where we gate right now on standard neutron is we're using you know to 10 gig bonds so we've got 20 good gig pipe for tenant traffic once we we have some weird interaction that I'm troubleshooting right now between security groups and VX LAN don't ask me why because I don't know yet if you take the straight straight tunnel I can push about eight eight gigs six to eight gig through a pipe which is pretty pretty good I thought it was pretty good I figure I can tweak that a little bit as soon as I add security groups to the mix I'm dropping down to about six gig and so with a reworked security group solution we may get some of that back so that I want to understand how that works I suspect although you know I my plane flight hacking aside over here that OVN will will increase or decrease the overhead of making changes to the system I think fundamentally if you're seeing some problems in OBS that's gonna remain there thank you for your participation and I have one more talk later today I'll give a little shout-out to it's about open-stack trivia I'm slightly better prepared for it and I promise Frisbees so and again if you had a question today and you would like a Frisbee come and find me in an hour we'll take questions up here and thank you guys very much for coming also a time-order cable has another talk coming up I think right right next upstairs I think or no downstairs downstairs using bailing wire to fix your production open-stack