 Excellent. Hi everybody. My name is Sean Lennon and I'm a lead engineer at Time Warner Cable. Thanks for coming today. We're going to talk a little bit about Time Warner's neutron implementation and some of my general precepts coming out of that about different approaches that you can take to neutron. If I can get a little bit of measure of the audience, who's running neutron in production today? Excellent. Nova Network and who's just getting into OpenStack. Fair measure of everybody. Excellent. Up front, I'd like to talk a little bit about discernment and it's finding the feature set that's right for you. Neutron took over from Nova Networks officially in this last release and there's a lot of hearsay and conjecture out there that may scare people off. There's some use cases that it definitely can't fulfill, I think, but in general, you may go online or be researching a new implementation or trying to move from Nova Networking to neutron and read things like this. It doesn't scale. It's unstable. It's all these other things. And then you go on the mailing list and you find out, oh, somebody can't get it to work and there's trivial questions. So up front, there's a lot of scary information out there. I just want to share with you what we did. We actually have Neutron running with ML2 VXLAN overlay in production today. We've upgraded from Havana, which was a more of a Nova Network flat networking style, to Icehouse, which brought in ML2 VXLAN overlay, and then we went from Icehouse to Juno. We also have L2 Population in the mix. We have standard legacy routers. We currently have a bug that we're tracking upstream that because we're using L2 Pop, the HA router doesn't quite work yet, or it works on one router and not the others. But we're working on that. It's a pretty straightforward setup. And we're adding new services all the time. Load balancing into the service is going to go into beta production for us, which is select customers next month, and DNS as a service is in production now. So some of this is about traditional networking and what SDN means and what adopting Neutron means. And it's kind of a cultural divide. And so there's some things that you need to dispel on your head in my mind that Neutron will or won't provide. I'm not an SDN purist. I find Neutron fits the space, the problem space that I need to fit in and solve. I don't think that there's any grand unified SDN theory. History has proven that those really don't work in physics and in IT. But SDN or Neutron is a great ecosystem as far as I'm concerned. But implementing it, and we've had several people from Time Warner talk about this, I think implementing OpenStack as a whole and getting buy-in to put Neutron in your environment really requires executive support. We have a fantastic executive team that fully bought into OpenStack, fully bought into transforming the culture at the company. As you can imagine, a service provider is not necessarily historically the most fast moving forward thinking companies, and we're trying to change that. Moving to SDN and even to Neutron is a process. If you haven't gone online on ipspace.net, there's a really cool video by Matt Oswald that talks about the pyramid of programmable awesomeness. It's actually a good way to frame thinking about implementing SDN, implementing automation in your networks. And I'd highly recommend reading that. He starts at the bottom of the triangle and basically he talks about provisioning. Have a process for provisioning, have a process for configuring, and then go to full programmability. SDN is at the end as part of this process. It's not at the beginning. Don't try to boil the ocean. Neutron is neat this way. It's a self-contained ecosystem. The edge is its problem. It's not well integrated at the edge. So as you start to look to roll out Neutron or upgrade from Nova Networking or really get into OpenStack in the first place, you're either on one side of this curve or the other. You're either playing with fire or you're sitting freezing in ice. And ultimately, your organizational structure should largely determine where you need to be on this curve. And what I mean by that is if you want to pull from trunk, your team that you're working with that's Neutron dedicated needs to be a very much development-organized team. You need to have your software processes down. CICD is not an option at this point. On the far end of this are people who don't have that background. It's more of a traditional networking background. And maybe installing from packages is the best way to go there. In the middle is where Time Warner lives and we're trying to move more towards the innovators. Really, we're at early mainstream. I would say we install packages right now, but we're moving away from that. We're rolling our own packages, Python, and virtual environments. And it allows us increasing flexibility and a move towards trunk. So when we sat down and we looked at what we wanted out of Neutron, we had some basic design principles that we picked out. And you'll go through this, too. Do you want to do straight VXLAN or do you want to go tunnel-based? Floating IP or flat? That's basically floating IP or Nova network flat networking. Requirements for other networking as a service? If you want load balancing as a service, you essentially have already brought in the need for floating IPs of some sort or another. Do you need IPv6 or not? And as my recommendation goes up front, and I didn't have this at the time, is don't stray far up front from a reference architecture. The new Neutron networking guide is actually a decent place to start full-on configs, description of the process to implement it. I did not have that, and it was a big stumbling block up front. We made decisions. By the time we were on Icehouse, we had VXLAN implemented with L2 population. That gives our tenants all the knobs and dials they need to create their own networks, create their own routers, stitch things together because at Time Warner, they had never been able to do before. We decided on floating IPs. We quickly added these other services. We'll come back to the IPv6. At the time, Icehouse was there. We did not feel it was ready. It's not in our environment now, but it's definitely on the end of your roadmap. So up front, one of the first things you have to do is figure out your HA strategy. There's a bunch of required services behind the scenes, Rabbit and MySQL. Get these together before you put Neutron on top of it. If these are rickety or if you don't have these in place in an HA manner, then your Neutron and networking is going to suffer. We experienced a few outages because of this. Up front, we made some decisions. We were moving very fast. Implementation from design to equipment purchase to rolling it out was about six months and hundreds of servers and multiple teams involved. So obviously, we're going to make a few mistakes or have an architecture that doesn't scale. We planned ahead and we are currently moving to a virgin to architecture. So we've built our cloud and now we're migrating our cloud to brand new hardware infrastructure and switching infrastructure as well. And on top of that, we're increasing our per server network capacity from 20 gig to 40 gig links on that. One of the big things that I can caution you on up front is service distribution. How you're going to implement your virtual routers make Tenant DHCP HA and how you're going to lay out your Neutron server. I'd suggest starting slow and learning to migrate to increasingly better processes through that. We currently have three network nodes per data center and then as we're about at the point where we need to break that up more fine grain and because our CI CD processes are pretty good at this point, we have the ability to do that almost with no outage. We use legacy routers. The DVR was not there. In fact, none of the HA for routers was there when we first implemented. And like I said before, we're having, there's an L2 population bug that's preventing us from going full HA. Currently, we keep spiking DVR, tracking the upstream process. I don't feel that we're quite ready to adopt it. I feel it's much more stable and much more interesting and compelling in Kilo. But it increases the complexity of your system since there's a router and more complexity on every compute node. And I don't believe that the operational tooling is there in order to we have it in time Warner at this point to just fully change to that environment. So we're not quite ready for DVR at this second. And then make sure you put multiple DHCP agents for the tenant networks, or you'll be getting crazy calls in the middle of the night as instances don't have IP. Up front, when you're designing your neutron implementation, make sure that you design upgrading as part of your design process. It is a pain point in OpenStack overall, but once the network goes out, you have big problems. So you need to really think about how you're designing your system, have your CI CD in place, and test things as much as possible and set up your development environment to do this so that when you go to production, you're not causing user outages. All that being said and done, there's a certain amount of mystery or reading the tea leaves that still has to be done on neutron. The tests are getting better. Every release is a major increase in not just functionality but stability of the product. We definitely, as a community, have a say in this, but at a certain point, if you're going to bring this into your large organization, you need to consider what features you're going to pull in, and it really is a little bit of reading the tea leaves at the end of the day. There really are scale limits. I don't know what they are. Different people's tests come out at different points, depending on your workload. We have a heterogeneous workload. Anybody can put anything on. We do have some gates on different bandwidth quotas, and so we kind of know what to expect a little bit coming into our environment, and I can look ahead and kind of read the tea leaves of how fast we need to expand our environment. We really haven't hit any limits. It honestly is a worry of mine that we suddenly have a need for increased capacity that Neutron can't scale past. I don't think we're nearly there, and I hope not to be wrong on that. All the testing that we've done, all the stability and code refactoring that I've seen, it seems like we're in a nice sweet spot right now at Time Warner. That being said, you can really help yourself out by having a capacity plan, writing that down, and then coming up with monitoring and trending products or internal monitoring and trending to kind of track that over time and see if you were right or wrong. But it's not all about the technology, and it's not all about processes and your design. Implementing Neutron requires some different team structures in my mind. OpenStack as a whole is the first time, at least in Time Warner, where the networking guy, the systems guy, the storage guy, the security guys were forced to sit in one room and speak the same language, in some cases, work at the same speed. And it was an interesting challenge, but fun. Up front, when you're designing Neutron or your OpenStack implementation as a whole, take whatever time you think you need for design and quadruple it. You've got so many different various teams involved in this that it's just impossible to move swiftly. You have to have everybody's viewpoint and buy in to a certain extent. What I mean by expect an impedance mismatch in timeframes, it's my polite way of saying some other teams in your company might not move as fast as you do. And you have to work with that. We were super lucky, and I'll jump down to the last one, is to have an executive, huge executive support on this. We were forced to get into a room, we were forced on timeframes, and we were forced to communicate properly with each other, and it was really fantastic to do this. But one of the big things is you're bringing together these teams that have never worked in each other's domain. So as you're designing, as you're building out these teams, it's a learning process. It's not just cultural, it's tech skill that you need to translate. And the last thing I'd say is, during your design processes, you've got to get everybody together. We had two weeks of whiteboarding sessions, and I feel like largely that was successful because everybody could step up to the whiteboard and scribble out and say, no, that's wrong, you can't do it that way. So everybody had a say, and it wasn't a disembodied voice over the phone for the most part. Jumping around a little bit, there's a, if you're thinking about implementing Neutron, there's a very peculiar skill set. It seems pretty common, more and more, but you need to know traditional networking, and you need to have really good Linux skills. And you need to learn some new skills. We implemented OpenVswitch, and I spent many, many days looking through flow tables and learning that upfront. It's not hard, it's just complex, and it takes time, and you have to budget that into your implementation. Don't think that you can implement Neutron and not ever be in the code. You will be there. You will run into some weird bug that there's nobody else to reach out to, and you need to track it down yourself. So I would say Python and Neutron API skills are a must as well. That kind of spans a large domain. Soft skills, these are my opinions, I guess. You need to have patience working with many other different teams. They don't all work at your same speed. It all comes together in the end, but have executive support and learn to team build, I guess. I put the sense of humor in there. As I was standing up in all our design processes, I kept flashing back to the key-and-peel skit where Obama was talking very calmly and he has a guy behind him who's his anger translator and saying things outburst-y. So as long as I thought about that, I could laugh a little bit. Everything's up and running, and you've had some pain points, but it's up and running and now customer calls start coming in and your servers go down and unexpected things happen, and they will. So here's a few things that we ran into and some of these are very recent and some of them are ongoing right here. In fact, an ongoing one is everything's a network problem including if you're message queuing hiccups. Technically, this is not our message queue rabbit that's hiccuping at this point. It's the new Oslo messaging handling of that. If you haven't followed that whole saga, basically up until Juneau, there was a fork of rabbit messaging that was embedded inside Neutron. Juneau changed that, came into mainline, mainline had improved, and the Oslo messaging ability to properly hold connections and reconnect with rabbit is not quite baked. This is fixed in Kilo. We're starting to bring down that code. You'll hear it called keep alive. It's a bunch of operators have gotten together and started fully testing that out and that's helping us out a lot. You have a running cloud, but I got to tell you operational tooling is poor in Neutron. Expect to be dealing with that from day one. You need to be able to go inside the entire packet path and know those inside and out. Develop your own custom tools. There's nothing out there. If somebody knows, please let me know, but I think everybody hand rolls their own tooling for this. You need to be fast with it. You need to be efficient. You need to get your logging in there. There is a call to arms on this on Monday. There was a talk by Carl Baldwin and Rosella Splendido, which is very good. I recommend you listen to it on the YouTubes. It was on L2 and L3 agent Juno to Kilo, if I remember. It talks about refactoring, performance improvements, all the good stuff that you need to know while upgrading, and it also gives you an idea of where you were left behind in Juno if you stay there. But there is a call to arms for the operators, for other developers to start jumping in and recommending tooling, recommending points in the code that can feedback to some sort of operational goodness, because right now it's weak, kind of flying blind on a day-to-day basis. Another thing that's been will be fixed by liberty. This is my impetus to go away from distro packages and start bringing Python virtual environments is that the OpenVswitch agents and L3 agents are super heavy-handed. If you're running 50, 60 virtual routers on a control node and you have to upgrade the L3 agent or you have to restart OpenVswitch, the genius of the OpenVswitch agent is that it flushes something like 2,500 flows, all of them out of there. Your customers are dead in the water and then it takes five minutes rebuilding it. Nope, need that one. Yep, need that one. Your customers are dead in the water. How do I know this? It happened. There are upstream fixes that are in liberty that should help us out. I'm looking to put it in production far before they're out in the real world. The other thing we ran into is VXLAN hard timeouts. When tunnels are stood up, VXLAN tunnels specifically, there's a hard timeout and then there's a messaging that goes behind the scenes that says, oh, well, yeah, I still need that tunnel. Well, if something's wrong with your Oslo messaging or Rabbit and those messages get missed, then you all of a sudden start getting weird calls from customers that say, these two boxes can talk to each other, but not to the third one. But if I jump to this guy and go back, it's a mess. There's no fix for that. It's just something that I learned as part of the real world. Early on, not anymore, we had problems with OpenVswitch. Distro packaging was back on 2.0 or 2.1 and we had crashes in OpenVswitch. Unexplicable crashes. No warning. Rackspace spoke very in detail about this at the last summit. Watch that talk. Definitely upgrade even if you get out of Distro, sync with your Distro. I think the other thing that we're finding is that we're getting more and more requests for things at the edge of OpenStack, at the edge of Neutron, and that's all hand wired together. Things like bring your own IP, we're getting asked for that. Tunnels between two different data centers for legit reasons. Things like that are all customer responses. Neutron's not good at that. I'm not sure it's really in the mandate that it needs to be there at this point, but realize that if you need that, it's a weakness of Neutron. The corner cable are, and not necessarily in this order, IPv6, I've been following this and it's patch sets and it's stability for a year, and I think we're about ready to pull that in. I'm a little bit leery about giving it to the customers because I think once they're in for a little bit, they're gonna want everything, and I gotta make sure it's fully baked. As a whole, in our OpenStack environment, we're moving further and further closer to upstream the ability to pull in patches on demand, new features, test this out and maintain it outside the vendor packaging system and avoid the morass of dependencies, essentially. We haven't gone there with Neutron, but some of the things I've heard during this summit and some of the patches we've been following, it makes it almost a requirement for us to move there. I think we will go there and it'll definitely improve our flexibility. Another interesting one, I put this in as a small bullet point, is we're talking about Neutron, but we also talked a little bit about working with your traditional networking teams. We're currently on a project with our traditional networking team to automate the underlay networking. We are moving to Juniper Networks networking gear and there's a lot of opportunity to automate switch configs, router configs via Genja 2 templates and go that way. We're in process with betaing that with them. As part of that, we're using the standard workflow that we normally did. It's a software workflow. Anybody can commit a change. They'll be looking at the same set files they know and love. They'll check it in. There's three different groups that need to plus to it to agree on it. There'll be a one-touch Jenkins job that deploys it when our operations team needs to roll it out. No more cut and paste from a Word doc, no more, and much more testing, shocking this. Primarily, what we're after is we have the ability in our software environment to retrofit any of our nodes to anything else on demand, but our hardware can't keep up with that or our hardware configs can't keep up with that. So in 20 minutes, we can rebuild a compute node into something else, but then we wait three days for the network port configs to catch up. So this really enables that. We can plan, go as part of our plan, submit the fix to the switch configs, get that approved, and get rolling in shorter measure. As I look back on a year, it seems like much longer time than a year in a good way. Need to engage your network and security teams early. I said that we had two different one-week sessions in design. I would say about 65% of that time would go down the rabbit hole of networking in short order. Everything's related to networking, right? New storage networking-based. Swift and Seth require networking. OpenStack kind of seems fast and loose at first, I think, to the traditional networking teams, and there's some time that you need to build in there to really describe what changes need to be made. So just build in extra time. Looking back on our implementation, one thing I can say is as you build up your new networking team, your SDN team, whatever you want to call it, match where you're at on that curve with the skill competencies that you have. We have a very, very good development-centric team, and that's why we're able to move towards trunk. Not everybody might be in that position, but you need to lower your risk by being the right place on that curve. I didn't have this. It would have saved me weeks, but start with one of those Newton reference architectures. The guide's getting better and better. If you don't like the guide, the community's there, feedback. I believe it was several people from Time Warner and a bunch of people at two operators' conferences ago that kick-started the networking guide movement, and they've done a good job with that. It's way more information than has been out there in the past. If you're standing up neutron, stand up monitoring and tooling at the same time. Don't think that'll come later. Calls late at night get really old. We have been a little slow in putting our tooling online, mainly because it keeps evolving so fast, but we will be better about doing things like... We have a script that does router checking throughout all the environments and tells you when the floating IPs seem to be down, and it for us has been a really good early indicator of things going wrong. If it's just one floating IP, you can call the customer up. If it's a whole region of floating IPs, you've got bigger problems, and it's way nice to have this upfront alarm go off before the customer calls start coming in. We'll get better, and we'll definitely share that. At this point, we had some performance problems as we got into hundreds of routers, but those are quickly being fixed. When these are posted online, feel free to use these links. I did not put the key and peel link, because it's uncensored, so you can look that up on the Googles. And I'd like to thank you all for coming. I'll take any questions that you have, and please use the mic. Any reason you didn't consider Linux Bridge for your VXLanin cap on the host? That's a good question. I felt like at that time the capabilities of OpenV switch were coming on par, and we don't have any wire speed requirements, so OVS fit in a sweet spot for us, but we could have equally gone to Linux Bridge. Thanks. Good presentation. Thank you. Thank you, Sean. What has your experience been with the upgrades from, say, iSoft to Juno, or one of the early releases to newer releases, especially Neutron, and General OpenStack? Yeah. Go to his talk tomorrow, number one. It's getting easier and easier. It is a recognized pain point. Every single project is working on it. Your operational maturity on this, as far as knowing what services interact with what, and even interest service, that's what makes it a much easier, cleaner upgrade path. Our Havana to Ice House was not pretty. Our Ice House to Juno was really actually quite comfortable in comparison, and there's still room for improvement, but it's a pain point. Hello. I totally agree with your points about Neutron, we also use Neutron in our production, and I have several questions. First, can we back to the slide of the HAA strategy 2? Mm-hmm. You use legacy HAA, agency HAA, right? Yep. So, in my opinion, the communities HAA has some problem with if your MQ is not very stable or the Qs are well blocked, the HAA will totally perform like a mess. How do you deal with it? We have HAA agent enabled, which we've had no problem with. Maybe we just haven't used it appropriately or inappropriately, who knows. We haven't been able to use HAA router, which is slightly different, which is the VRRP, because we're blocked by an L2 pop bug. So, I can't speak to that. I know that as soon as we had one router working and then we kept trying to fail over and figure out why things weren't working and realized L2 population was busted, that we backed out of that for a little bit. Okay. And my second question is about the attack. Have you got attacked in your cloud? And in our real life, we got many DDoS and any other attack which makes our network router get a high load and the normal flow can't run normally. Yeah. Part of that is mitigated by the fact that we have in front of our floating IPs, we've limited access. So, our agreement with the security networking and firewall teams and with our customers is that within Time Warner Networks, and there's a bunch of them, it's a free-for-all. It's only security groups at that point. If you want to go beyond that and put a production app in the mix, then you need to call out the firewall team. So, Neutron's not acting at the edge. That would scare me. The last question is about DVR. Well, you use DVR and what do you think of it? Thank you. We do not. Initially in Juneau when I did a spike on that and started tracking the bugs that were in that and, more importantly, looking at the tooling that was there to help troubleshoot when things went wrong, as well as the migration process from legacy routers. I just was not comfortable with the whole ecosystem. I haven't tried Kilo yet, but that list of my gripes has diminished a lot. Thank you. Do you have any experience or opinions on Neutron, particularly the reference implementation versus Aconda? I do not. I would love to help you out, but I don't have. Hi. I was going to ask if you used security groups, but you asked the gentleman earlier and said that you had. Did you do that early on? Security groups? So the IP tables on the Linux bridge interfaces. Day one, that was part of the... How's that working for you? Is it just working? It's just working. That being said, we have some application teams that are used to having a list of like 5 billion single slash 32 IP addresses to load in there before Juneau and IP sets that was impossible. We also have run into a couple of user issues where we've had to call them up and say they're doing things inefficiently, I guess is the best way. We kind of police that a little bit. Our environment is Ice House and we have particular problems if we have hypervisor to hypervisor or same hypervisor VMs, that just doesn't work. So have you found that that's fixed later on or is that... Yeah, there were some oddities in Ice House that I don't see anymore. Great, thanks very much. Hello, you mentioned this VXLAN timeout issue. How do you diagnose it first of all and how did you fix it? Diagnostics was late night in a self-inflicted outage. As tunnels just started dropping and ping tests started failing over a five minute period and it literally was the timer went off and networks would just fail out over a five minute period really smoothly though. So as other people started restarting and working with other things, I looked down in there and I started looking like it's got to be a timeout. I actually, because I knew the flow tables at the point, I looked in the flow tables for something that said timeout quickly read up on the spec and was like, oh, yeah, that's what's happening. It's hard-coded in there. There is no fix unless you roll your own. There's no way to set it in parameters. It might be something to submit upstream. Is this recurring for you? It can be, but only if we're doing things that... I would say mostly this is related to the heavy handedness of those L2, L3 agents. I think when that's fixed then this is a non-issue. The problem is it flushed all those tables so it's neighbor would timeout by the time the routers were built up again, right? Just for reference, we were running into a similar issue and we got rid of the L2 population and our problem went away. Just for reference. I will look at that as well. I have two questions related to scale on two different topics. One is the scale at the physical networking level. You mentioned that you are using Juniper Fabrics. Is that a flat layer to network across all of your front-side of Neutron? Leaf spine. Leaf spine. I don't even remember what this... Yeah, let's do the answer. It's a completely L3-class network. So it's layer 3 at the top of the rack? Yeah. Are you putting all your Neutron gateways in a single rack? Not day one. That would be a lovely thing to do, but no. So you're spreading them across different racks? Yeah. So is it different floating IP pools on each one of the Neutron gateways? No, we didn't break it up that way, no. So I'm curious how you're managing one floating IP pool across three different Neutron gateways? Remember, there's NAT in the mix. So we're basically pinning the floating IPs up on our MX router end of it, and then it comes down, gets routed appropriately underneath the hood and translated into the local address unless I miss your question. Okay. The other question is scale it on the Neutron level. I keep hearing from the vendors that they're trying to fix the Neutron scale solution. So did you try going beyond three Neutron gateways to see it as not scaling to a certain point? Before we do any expansion, we typically spike that. I have not done it recently, especially with the exciting kilo refactorings and fixes. I don't have any measure of that. I don't feel like it's more of a database problem anymore. It's a rabbit messaging problem. It's still pretty chatty underneath the hood, and there'll be fixes in Liberty and beyond that consolidate that down. But it's super chatty. Thank you. Any other questions? Feel free to come up here. Thank you guys for coming.