 Yeah, hello everybody, welcome to the next customer focused session My name is Michael Holterland. I'm working for Dell. I'm a sales engineer. I'm working since 10 years for Dell But this is a joint session. So this is a customer focused session here with me. It's Felix Schuren Felix Hi everyone, I'm Felix Schuren. I've been with my company for more than 15 years now I've been such admin network engineer and many other things in technology and I've recently made been made Enterprise architect so back to Michael Thank you So yeah, as I said, this is the next customer example and the customer name is host Europe And this is again a networking Focus session, but before we start with Felix input a few words what that is doing in the open stack space So what that is bringing to the table is a lot of experience. I think everybody is aware with the open stack timetable with the different Versions of open stack and this is well the Dell open stack timeline. So what that is doing in the open stack space Dell is one of the first aprons that supports the open stack project So we are in the open stack project since the Austin release summit So we support open stack since the beginning as the only a brand and what else we have done So tell for example Provided one of the first papers bootstrapping of open stack clouds. This is what we have done 2011 with the Bexar release Dell developed also his own Deployment tool it was called called crowbar and Well, we open sourced crowbar and we wrote also several papers to deploy open stack with crowbar What also there have done in the past we brought Open stack to Amir so that we support open stack also in Amir and this was announced for example on the World hosting day in Germany in 2012 and What else they left done Well, we've also built a paper together with Suzer as one of the major distributions of Linux and we launched Suzer based open stack solution and also one of the key components In the open-stakes base software defined storage based on SAF So we work since the beginning with ink tank since ink tank was founded Dell supports ink tank and SAF so we wrote several papers also in the SAF space So this is what we also bring on the table with a lot of experience in the SAF space And Yeah, one of the last big players we work together as redheads. So there's also joint reference architecture available with redhead and What that also brings to the table During the time frame of open stack is what we supported SAF also with the second SAF day in Amir. It was hosted in Frankfurt in the Dell office. So this is what Dell is doing in the Open-stakes base. So we support open stack since the beginning a lot of experience in this space so Reference architecture is important. It's important for our enterprise customers. So our enterprise customers ask for a fully supported solution software infrastructure and also services and tested Solutions based on open stack with all major distributions. This is what Dell is doing in the open space in the open-stakes space as well So we are delivering as a single point of contact a full supported Open-stakes solution depending on the flavor what you prefer. So for example, we work with All major open-stakes distribution. So beginning with canonical Oversusr, but also with redhead. So we give you the freedom of choice Which distribution you want to work with? We have also reference papers available Yeah, one example is one of the Last partners from us in the open-stakes space is redhead So we developed a complete go-to-market with redhead in the open-stakes base beginning with the proof of concept So pre-configured solution This can grow to a pilot system And then up to production. So this is fully tested and pre-configured from Dell Yeah, one also important part is SAF in the overall open-stakes solution software defined storage 2.0. So what is SAF? SAF is software defined storage solution clear it provides the block and object storage It could scale really massively into petabytes and it's open source base So it fits very well to the open-stakes approach and Dell supports ink tank as Yeah, the company that supports and developed SAF since the beginning Since the ink tank was founded So what we are doing in the SAF space reference architecture Optimized configurations, but also IO guides what you need really to size a SAF solution. This is what Dell How we support our customers and the SAF in this case And at last the Dell solution center. So it's a great support really to test Open-stakes like in a productive way. So it's free of charge for all customers it's a pre-sale support from Dell and We do briefings design workshops But also we give customers the choice to test open-stakes and SAF several weeks before they buy a solution Or before they decide to go which way Inside the open-stakes approach So that's what Dell's bringing to the table, but this is a customer session. So Felix Yeah, here we go. So thank you Michael for this introduction I wanted to talk I Didn't want to talk about our company really. We are a large hosting company in Europe. We do a lot of hosting We have tens of thousands of virtual machines I didn't get the exact figures sadly, but it's a lot Michael introduced me so let's get to the meat, which is why you are here we did a bit of Development on open-stack internally and trying to get it running and One of the things that we came across was problems with Neutral not fail over and I know that Usually when people set these things up, they are so happy. It works. They just have a single neutron node and we tried to at least have some redundancy and When we did fail over tests It didn't work directly. It worked after a while. So in general it was working but the way we had set it up was for it to use gratuitous up and after a lot of debugging it turned out that our Internet router didn't support gratuitous up because it is a security risk because you are overwriting the up table with information from the outside so The fix in theory was simple but vendor dependent We can only enable it on a physical interface, which means it doesn't only work It'll it'll it'll then work for every connected VLAN on that thing where we might have different security policies so this was Really bad in our case, but it might save you a bit of time if you come across something like this During setup of our environment We also came across trouble with Ceph and it's not so much with Ceph itself But with the surrounding network of Ceph where mtu issues Were a problem the fact that we were using more than one network was a problem and A lot of the yeah IP addressing Really was problematic because we were we weren't doing single flat layer to our single flat network because we were afraid of scaling issues and That proved very very troublesome Regents and the concept of Assigning fixed IP pools to regions for the floatings is a person that had peeve of mine something. I don't like and something that's Really a bit of waste and it compounds the problem which I want to talk about In Essence Don't touch the network is what you will be what we usually found people set up OpenStack and OpenStack is complex. There's a lot of Things that happen that need to work and people are so happy that it finally works They know then they never looked at the network or they sort of broke the network many times in between it at some point It works, so don't touch the network never never again But the problem of course is networks are complex beasts especially at scale and If you don't touch the network if you don't fix your underlying network, then It'll it's going to blow up at some point. It really will Now this is I suspect a very common Network design for an OpenStack installation. That's coming out of the first round of DevOps It really is a flat network Which I'm pretty certain most of you will have run at some point or are still running I even have two two neutral nodes in this which isn't the norm and for this let's say we have a Juniper EX4200 as internet router I Guess this is fair a fairly common type of model which many of you might have looked at for medium-sized deployments because it's It sports 16k up entries. It's a fairly Well-sized box and we are going to want to run 8,000 Floating IPs off of this so we're at 50% capacity and theory and a quick reminder or a primer really on how most switches and routers these days work You've got CPU which is really a fairly standard spec PC talking to special hardware, which is the actual forwarding Now this houses all the tables with routing tables up tables and all these things and pushes information Through this internal link towards the actual forwarding engines which do the packet forwarding and So we've got 8,000 floatings or a bit more it's a slash 19 about 75% usage at some point and The probe the trouble is in my mind really up now with an out with an uptime what of an hour we've got Roughly two up requests per second just to maintain the six thousand addresses in used Some of them will auto refresh whilst traffic is happening, but it's not a lot. So where's the problem? The problem isn't with the used IPs. The problem is with the unused IPs Because we have configured our router with a slash 19 as a directly attached Network, so whenever traffic comes in for any IP out of that network It needs to find out the MAC address. It needs to up and That means every second Roughly depends a bit on implementation, but roughly as a rough guideline Every second you will send one packet one broadcast packet asking. Hey, is anyone responsible for this IP address? And that's per host and if I'm having a port scan on my 8,000 floating range That's pretty much every host and I know that the Large ranges of virtual virtualization providers are interesting scan targets because a lot of Things come up and they might come up with default passwords or something other bad idea So going back to the EX4200 because I know it for fairly well That internal link Actually has a hard forwarding limit of 1,000 packets per second and At least two years ago The vendor didn't put any class of service in so You've got traffic hitting this thing the CPU needs to generate ARP requests And You've got two thousand free IP so that's two thousand hours per second going through that link Which also carries your statistics your monitoring your internal routing sessions. Maybe your external routing sessions and What happens and this has happened at some point? Is you start to get connection drops you can start to lose connectivity in your IGP And this is an Denial of service vector which most router vendors don't think of these boxes are fairly well protected for traffic Coming this way Into the routing engine, but this is the device dosing itself because it is generating these packets um and What what I'm really annoyed with when it comes to ARP and the implementation chosen currently for most open stack installations is It's Why are we asking a question for which we know the answer? Because we know Which neutron node is responsible for a specific floating IP? We've assigned it So why do we ask the network or the router to do broadcasting? Which is essentially? Shouting into the room. We're interrupting everybody's attention to ask. Hey, who's responsible for this? I Actually know who's responsible. I was just too bored to tell the router or I didn't put it in to the router so that is a point which I Attributed to the fact it only starts to show its ugly face at scale once you start to hit a couple thousand machines This starts to become a problem if you hit five figures. It becomes a real problem. It's really problematic and Before open stack a couple years ago we Had this at a similar situation where we are for traffic for virtual machines at 40 40,000 odd machines and we was really killing our routers really hurting them and we converted them to a routed setup Essentially since we know Where we want traffic to go to we know this floating needs to go to this neutron node Or if you think about DVR to which compute node it needs to go so We Instead of having our internet router ask we tell him and The way to tell Routers where how to reach IP next tops if the best way is BGP border gateway protocol and Border gateway protocol is a technique used in the internet since forever and It scales a lot higher than arp does and If you do this this Relatively simple change and I'm getting to how to get that information here what you do is on This machine you don't you no longer configure the slash 19 as a directly attached interface the router no longer arps only Every every floating that is assigned this guy gets told and he knows where to send it So the only information this needs to find out is The MAC address of the neutron node But that's just per neutron node. It's not per floating IP So it's several orders of magnitude that you save and This is just an internal operation and you can filter it you can't force someone on the internet So reports can will not trigger any arp happening on here. You can stop this happening and The the obvious solution might be to say right We've got the controller or some sort of controller some orchestration device Why don't where doesn't that talk to? the internet router But one of the design ideas is The open stack installation should continue to run even with the controlling components are down So we can't change the state, but it continues to work as it was so this isn't the greatest idea and One of the ways to do this is root reflection. It's a standard technique in routing and this is sort of DVR in mind already So each compute node and this in this case. It's a neutron component within it. The idea is it tells These special boxes and the special boxes are really just BGP route reflectors. It's a network device common network type of device Fairly easy to implement standard vanilla setup And what it what they do is they essentially tell right. I'm responsible for this floating IP and It's because it has this internal information. Anyway, it's rooted to the internal bridges and the tap devices and On export it simply changes the information and says actually you can send that to me send that to my address and The only job of these devices is to collect and then Consolidate this information and send it to the actual internet router and the nice side effect is in many many Companies the internet router is run by the networks team while the open-stack installation is run by a DevOps team or some other Function this is admin team and the networks team and I used to Be in a network team for quite a while usually are quite seclusive and quite yeah Yeah, you just give me a ticket. It'll be done next week Because running the the core is that there were stability is paramount because you're supporting everything It just works very very differently. So You could think of not using these and simply having each of the nodes talk directly to your internet router But every time you add a compute node, then you need to do a config change on this device and Your network guys network teams won't let you do that in most instances The workaround is you set up these you control these together with your compute nodes with puppet with chef What with whatever and if you have a new compute node you put them in here and The nice thing if you do this is In a migration what actually happens rather than what typically happens with with Gratuitous ARP and Gratuitous ARP really is horrible because it's it's just you you say something and pray it works You have no idea if anybody actually listened to your change to you telling you hey, please use a new MAC address You don't know you just assume it's hopefully going to work out and as I Mentioned initially it didn't work out for us because our router did simply ignored these messages, but What happens is I? have this VM and it's going to migrate over here and on migration this sends root a root withdrawal and and BGP each each BGP speaker has an active TCP session to these devices so It's it's a it's a stateful protocol. It's it's known how it how the current state is and On migration this sensor withdraw this sense. Hey, I've got a new route The information gets here The routing table is updated and traffic gets sent to the new device and you didn't need any broadcast You didn't so none of the others really know something happened and they weren't bothered You didn't shout in the room saying hey, I'm responsible for this now You simply told all the relevant parties and left the rest alone Um that brings me to Really what we learned during Setting up our open-stack clusters and Layer two domains are dangerous. They really really are let who is a binary product Almost a binary thing. It works and then it's broken at some point and when it breaks You're lost. You've got pretty much no idea why it broke so For us it really is always about Scaling down layer two remains as small as possible. We use layer two and we must but we don't want to and that Ties into up up. I see as a necessary evil Arp is so horrible in so many ways. It's a security risk It's a it's a risk to the infrastructure and it's really bothersome just shouting in the room and having thousands of devices Doing. Oh, no. No, it's not for me. Okay. Yeah It's fine. Um so cutting down on up if you want to scale is really really a It's a really good idea. I think garb Don't do it if you If you can't get to a routing setup quickly One important thing instead of garb if you can for redundancy use virtual MAC addresses If you use a virtual MAC address or you Sort of take over the MAC address together with the IP on a failover What happens is you don't need to update the app tables The only thing you need to update by sending an appropriately crafted ethernet frame is the switching devices To have a new switching path and that's a standard feature of ethernet switching and usually works Usually works. Whereas garb sometimes works not always One thing We also did which was very very helpful is an out of band management network So we've got our compute nodes and all the other storage nodes and everything with 10g But have a 1g connection from the side That is troublesome to a certain extent with the automated deployment and all the other things because it messes Most of the systems up because it's not a flat network. It's not a single network. There's multiple of them. They have got multiple ways to do stuff But the actual magic there is really to IP rule entries on Linux boxes it's not that hard and It was a tremendous help because we could debug and not shoot ourselves in the foot and needing to go The KVM route or something some some keyboard emulation thing Rooting in my mind solves everything now I come from from a network engineering background and rooting sort of is in my blood But over the years it's always been up and later to It's it seems simple, but once it's broken you you're lost you can't do pretty much anything and with rooting That's a a lot better supported From the group from the went from the vendors It's a lot simpler to debug because it's stateful to look up information and do all these wonderful things that we need to do in the operation of these things and That the one thing we should always prepare for is what happens when it fails or when it doesn't work and rooting and layer 3 really is The one thing that will help you there, so Yeah, that's sort of What I wanted to say our perceivable don't do it if you can only use it when you must and This opens the space for questions if you have this police The question is how are you affected by IPv6? IPv6 is Only making it worse So IPv6 and neighbor discovery really the same thing is up and The problem with IPv6 is the recommendation that you should use a slash 48 if you can at least a slash 64 on the links And that's just madness for your routers And in fact a lot of the core networks the core transfer networks Don't put a slash 64 on their transfer links. They put a slash 127 if they can Just to cut down on the problem of port scans triggering all these neighbor discovery entries It just works like up and in in rooting The beauty is it works Identical doesn't don't need to change anything you might run what one to you might have to run multi-protocol BGP Which has been around for a long time Sort of say right. This is a v4 address. This is a v6 address. So the Yeah, only downside with using rooting really is most routers aren't optimized for Host roots, so you've got because you've got single IP Prefixes that the router needs to forward which sometimes is a problem in the forwarding plane But ask your router vendor With modern kit it shouldn't be Other questions Yes, please How would this problem change if your tenants were running overlay networks? With overlay networks you have an entry and an exit point into the overlay and That's where you would run the routing component where you would say right Please send traffic for this public network, whether it's a single floating or a sort of longer Subnet prefix. Please send it to me. That's everything essentially all you are telling the Underlay network because the the internet router really is our underlay and That's where the problem is so a lot of what I talked about I'm pretty much all of what I talked about isn't so much inherent to open stack It's the network infrastructure surrounding it which is suffering because of some of some of the design issues that's the current design decisions and In terms of an overlay you've got all the usual overlay stuff you need to do you need to set up your tunnels You need to have the encapsulation and erase the MTU and all of that and There are interesting ideas Contrail for instance on how to do this with routing as well and I agree with many of these ideas, but in terms of This is about getting traffic to the correct entry point Into an overlay and whether it's an actual overlay or just a single compute node Which doesn't want to one mapping doesn't really matter But that's a technique to use to get it to the overlay entry point. Yes, please Okay, so the question is around The separation between running your network traditionally and having a DevOps team running open stack and when network problems happen Within open stack how to deal with debugging and and who to go to is that correct? Yeah, absolutely agreed. So it's a problem in in short the And it means that you need network knowledge in your DevOps teams in your open stack teams You really need network engineering knowledge in there Because if you're running overlays the way I've seen it in most places you don't have You don't have your you have your networks team And they don't have a clue about open stack and your open stack team does all the tunneling and all the encapsulation and runs the overlay Which means your network ops team your career classic network ops team only sees compute-to-compute traffic or compute-to-neutron traffic can't help you with what's inside and the the whole overlay thing is fairly complex and You really need network engineering knowledge in the DevOps team that runs open stack. That's the only solution. So In my mind, it'll be the the current network team Sort of evolves into a role similar to that of the internet carriers saying right. I'm I'm packet pushing Just give me your packets. I'll deliver them Whatever you do in there is your responsibility My job is just just to hand frames or pass frames correctly and send IP packets correctly So you really I think need in the engineering the network engineering expertise within the open stack team. Oh I think there was a question there Unfortunately, no, I did have one. It's sort of deteriorated because the hardware was too old over the last week I'll be happy to talk a bit more. I have some more slides Which I couldn't fit in here. So just hit me up I'm here at least till Wednesday, and I'll be happy to share some more thoughts The question is around the physical network hardware and how to integrate that and orchestrate that and The answer is we didn't We set up the physical infrastructure sort of statically and rely on the overlay technologies Sort of because of the because we have most of this traditional separation and the network team So we said right let's use the network team as packet forwarding engine in a way and have them set up in their their old ways And then use overlays and do the net the the network ourselves in a way There was a question. Yeah Yeah, so the question is did we consider using an SDN controller to solve our our issues and Yes, we did look at some SDN controllers but most of them in my book were trying to do too many things and to two people were in a way too prescript too prescriptive and To me it's about the keeping the modularity and the the openness really and So we did consider them but didn't find them to be applicable to us. Yeah Exactly and the other issue we had is most of the the management sides to most of the control the SDN controllers we saw Wouldn't would fall over at a certain scaling point anyway, and so we didn't feel they were scalable enough And they were too prescriptive for us. So that's why we didn't pick them Yeah At the moment we're using Jerry so fairly vanilla with some Optimizations I can't talk about yet, but we started with a vanilla jerry setup Think that come again We used such a big subnet mask the slash 19 I was talking about We typically in real and realistically Use several slash 21s rather than a big slash 19 But it doesn't actually matter to the router because it's the same router and the same amount of our traffic Yes, we do have multiple routers, but the traditional setup. We have is a set of large data center routers per side and so we kind of didn't want the management overhead of having multiple network routers in place and There are ways to sort of scale around it in a bit But it doesn't solve the actual problem And I think it's not that difficult to solve the actual underlying problem and brokenness with up in a different fashion The current production environment of open stack I Would have to ask my colleagues The open stack setup isn't that large we have only recently started deploying it The routing solution we have in place have had in place for years seven years or so I think and it carries 60 to 80,000 routes just for virtual machines And that works so I Don't see any any more questions raised so I Typically no, I actually lied. I didn't bring cake, but I sure hope that our hosts have some You might have luck there. So thank you for attending and enjoy the rest of the week. Thank you