 Well after three days of parties, I can see people are going to be struggling in for a while, but I think I'll get started The title of this talk was flexible networking at large scale But I think people have different ideas of just what large scale is Some people having a 50 node cluster or a hundred node cluster of something they might aspire to and consider large But yahoo is a little bigger than that So what we're actually talking about is a mega scale So what I mean by a mega scale is lots of compute nodes lots and lots of VMs and lots and lots of network bandwidth and Lots and lots of traffic between tenants as well as from tenants, it's not everyone facing out pumping data out into the greater world It's lots of things talking to each other and then coming up with something that they Project out into the great beyond so our goals in accommodating metas mega scale Three goals probably goals we all have but at mega scale things are a bit different We need reliability. I mean yahoo is large depending upon who you ask we're between the first and fourth in terms of page views on the web We need flexibility we Essentially not only do we have many properties that Supporting what our CEO calls the daily habits Things like news sports weather and so on People come to depend on those we need reliability as well as scale We need flexibility These daily habits are supported by many subsystems There are lots of things behind that having to do with serving advertising and various other things and then We need to keep things as simple as possible Problem is is you can come up with lots of complex systems, but the bigger skit you go with scale The harder it happens. It's it's not linear as you scale larger So we try to keep things as simple as possible as we grow it So our strategy is come up with a highly performant network backbone that everything plugs into We use open stack Open stack is at this is at the middle of this so far as providing all the compute resources We augment that with automation Unfortunately, a lot of this is external to Open stack in part because we already have many systems in existence. We've been around for 19 years We've accreted a lot of management Systems and so on we have to integrate with those they sometimes do jobs that are appropriate to our scale That may not otherwise be common and And then we design our systems so that they're made of components that are essentially Disposable and what I mean by that is Any given VM can be can go out of existence out of at a moments notice now It's something we already achieve because we have to have reliability and we have to have scale So the tendency is you have many many computers all doing the same task and There are going to be failures and the software is designed to accommodate those so moving that into open stack into the into a cloud is facilitated because There is no problem with having a VM go down So it's and we leverage that in terms of providing our flexibility So and we also have a lot of Systems management infrastructure and so on that we've created and by bringing that in to open stack as I will as I will talk about We find that we already have a lot of the pieces that are necessary Now When you're small-scale, and I'm sure yahoo was like this back in the 1990s Your networks are very simple. It's a layer to design. It's essentially a wire with intelligence I mean Ethernet was originally back in the old days just one long cable with with transducers plugged into it and That model grew as we created switches and and other equipment But nonetheless, it's a wire with intelligence that makes it relatively cheap to build fairly easy to manage It allows a great flexibility of solutions and that's not necessarily a good thing as you grow You have issues with things like broadcast storms and so on once you grow larger and larger People leverage things like multicast that can only scale into to a particular size various other issues When you rely on having a layer to domain seem to emerge But it's often it's often the way people start out and of course once you have a system on the ground you want to grow it so by Simply making the wire longer and longer and longer You continue to scale until you finally reach a point where something melts down now in the cloud world one of the advantages of of course having a large layer to domain is if you're doing things like live migration various other techniques of combining your your compute resources You have IP mobility an IP can move to any hypervisor on your system without issue so It's conceptually simple, but it has limits and we've discovered long ago some of those limits And just a couple of them now is as an aside. I am a software person. I am not a network person I've started with open sac. I was a I was a C++ programmer for 20 years And and then got into the the clouds So My networking folks tell me that there are limitations With the hardware their limitations with the various management protocols within the L2 domain and there are other issues which I already mentioned things like limited It's issues with with Broadcasts and other Other network phenomenon as it grows Now if you go to a layer 3 network A Lot of those problems go away because essentially you've partitioned off a bunch of L2 domains You don't have the sorts of issues of large L2 domains, but of course You've limited your flexibility now there's some potential solutions and There are probably a dozen vendors here that will provide them or in one way or another One of the one of the big hot ones now is make a software to find network simulate an Elt a larger L2 domain on top of an L3 domain One of the ways of doing that is you do an overlay so you encapsulate all your packets and that's That creates some issues as you scale larger because you're creating a large control plane If you have network issues If if your network gets partitioned you can wind up with big trouble and we've Although we've done some experimentation with software to find networking We can't really grow it as big as we need to for mega scale There are other solutions that don't necessarily involve encapsulation But once again We don't think they're quite there yet. At least at least not for our purposes So our solution is to come up with as big a L3 backplane as we can produce Now at this sort of scale the new hotness is to use what's called a close design It's named after a fellow named Charles close Who developed a way of interconnecting telephone switches back in the 1950s? And this was recently rediscovered by networking folks And it's basically an interconnection matrix that allows you to to scale to almost arbitrary size And yet allow up to line rate between from any port to any port So along with this as or as part of this each Cabinet of hypervisors is its own subnet its own its own L2 domain Now that's as I said before there was that's restrictive in some ways But that also gives us our advantages as we'll see Because this is such a high bandwidth backplane Then we support our needs so far as having massive east-west as well as north-south traffic I mean we pump out an awful lot of data But all that data has gone through Processing as it passes from node to node before it gets pumped out. So there's a The the fact that we have a very high bandwidth backplane is quite useful Also if we do Get go into an overlay solution later and we continue to look into that We will already have an ideal Backplane to put that overlay on to So this kind of gives you a picture of what it looks like Everything is More or less connected to everywhere By the way, there there are a number of areas here I am not talking about that problems that we solve things like well, how do you do? Network security with this kind of a setup. I'm not going to touch those. We are we are working on it but We're most in this talk are mostly dealing about bandwidth This also allows us to add greater robustness because everything is nicely distributed We use availability zones as I'm as I'm sure a lot of our larger open-stack users do Our experience is that things tend to fail along power distribution units And so we try to spread things between them so that we can have a power failure and wind up with With sufficient capacity to continue Well, of course, there are problems with going layer three. There was no IP mobility If you actually needed to move a VM from one hypervisor to another and that hypervisor happens to be in another rack Well the IP address has to change and when we are actually are working on ways of doing that, but we don't want to Because much of our application stack is written in such a way that it can be terminated and restarted harmlessly We don't think we have to do this much, but there definitely will be cases Especially having to do with maintenance and so on where we might want to move stuff around So we are working on ways of doing migration and simultaneously having the VM Change its IP if necessary So this also brings in some complexity some things open-stack is not quite set up to deal with yet Those have to do with things like Implementing rack awareness since since each rack is a different its own subnet As IP addresses are assigned to VMs. Obviously that needs to factor into that So and there are other issues that come up and you can imagine what those are So this doesn't sound very flexible and admittedly we've had to give up a lot of Flexibility I think if If those of you who work for public cloud providers are saying that this is like This is this would be a nightmare Because after all every VM is precious. Well, we can't we can't work that way so What we do and what we what we have long done, but we're integrating open-stack into this process is We use load balancing everywhere Obviously, no one machine is going to handle or practically any given situation so just about everything Internally and externally lives behind a VIP and What we are doing is we are integrating that existing capability with open-stack so When you're sitting behind a VIP Your IP is Irrelevant, I mean as long as the load balancer knows how to get to you so So VMs can come up VMs can go down as long as you coordinate that with your load balancers From the VIPs perspective Nothing has changed so As I said, we already do this we do this not just for scale But we do it for high availability and in fact we use load balancers within open-stack control plane itself so Open-stack has as as I showed back here Are the control nodes for open-stack are also in their own availability zones As a so and of course We use load balancing to provide that high ab able high availability now The third component to this is we've integrated we've produced a concept that we call service groups Now a service group is essentially a bunch of VMs behind of it That's that's one way of looking at it. It's a group of VMs that implement some sort of a service some logical component of of our stack Now At this point we are implementing them external to open-stack Because they're there just isn't quite the leverage within open-stack to do this not quite the tools Also because admittedly we have a lot of Infrastructure external the open-stack that we have to integrate with I mean as I said 19 years of legacy to deal with here so As I said a service group a group of VMs behind a VIP They're all running the same application Any one VM could could evaporate and things will continue to run So they implement a web service API. We have web services everywhere every every all of our components Implement or implement as a web service We kind of got the the rest web service religion a number of years ago And it has been very helpful in terms of integrating Integrating things and making it easy for developers So the basic idea is Somebody creates a service group as as As we create one of our Products the necessary service groups are set up that gives things like the necessary resources to support it How much load balancing is going to be required? You know, what's the maximum number of IPs that are going to be set behind the VIP? And so we have an a We so we have an external system that takes care of that It it works Creates a unique tag that's associated with those resources And then when we integrate that with open stack What happens is when a request for resources is done either by a User or in this case, I guess it would be an administrator for a particular project or an Application framework that is performing a function such as elasticity where You know, we're growing and shrinking a cloud orange depending upon The demands of the moment mean think of oh I don't know something like March madness for sports or some big news event or You know the holiday rush or something like that thing things tend to be cyclical But sometimes unexpected things happen as well Elasticity is very important to us and we're leveraging open stack to implement that so After a request passes through the front end a Calls made to the nova API just as on on the appropriate cluster That tag is attached to it that unique identifier and that Passes through as the instances are allocated but at the moment late in that process where The network is Or the IP address in particular is selected For that instance that tag is then passed back into the external system The external system Recognize the service group Because it's been given the IP address It's able to inform the load balancers that this VM is coming up and it's part of this service group So it's going to be associated with this VIP and Then the process goes on and As as you'll see this By using this association To control the load balancers We create a rather seamless way of adding and removing resources as needed The way we have done this is There are three points of integration with open stack as you have used as you've seen We have our front-end integration where Requests are Processed Prior to being fed to open stack We have the integration at that one point in instant instance creation where the IP address is finally known and then the third is When a VM is actually shut down We also need to inform the system at that point. So it's removed from the VIP and so forth We've patched open stack Essentially patched at only network essentially instance creation and instance deletion We've also patched it so that the subnet so that Scheduling is subnet aware So that when an instance is brought up on a particular hypervisor The pool of IP is available to that hypervisor is the source of the IP So in doing all this What is our relationship to open stack? What are we trying to do here? Well, we're Trying to minimize the amount of patching. There's some things we're doing integrating with Nova Network or neutron at the point of IP address assignment for example is something that's a point where I think a lot of people want to integrate with their systems because they have external systems need to track VMs need to track the IP of VMs Other people may want to do similar things with with load balancing as we are So we want to eliminate the amount of patching maybe provide plugins maybe provide a tagging mechanism In fact, we are we are talking with with other folks Who have similar issues who are doing similar things with L3 top of rack that we are Using tagging for network selection. So that's something that There actually is a blueprint out there From eBay We're working with them. We're talking to other people because this is a very calm as you get to mega scale This is very common to do this kind of L3 breakup so contribute Replace a lot of our custom External pieces with as much of Community code as we can use things like heat and perhaps Congress for automation Eventually integrate with Elbas And of course continue to share our experiences. So one of us will probably be up in front Talking to you again next conference or conference after that and telling you how this is all going so There are of course a lot of complications that anyone trying to do mega scale encounters Open stack clusters don't exist in a vacuum They're all you have lots of external systems you have issues of Well, frankly it tends to get into Politics and other things you have different organizations that need to work with The open stack with the open stack group. It's quite different than in a public cloud Because essentially you really are a service, but you're a service to your own company So I Pretty much reach the end Which will give which will give time for questions anyone who wants to know more about this Remember the provides though. I'm a software guy. If you ask me anything very deep about networking I'll just shrug my shoulders or something So The takeaway is that there are a lot of unique issues as you grow out to mega scale You're dealing with things in large aggregations. You're not dealing with one-on-one sorts of sorts of things So you can't As I said at the outset and this is this I consider this to be key you have to keep pushing towards simplicity because Complexity doesn't scale Now but one of the nice things as we've seen create creating this large backbone Something that you wouldn't do at a smaller scale mega scale allows us to do that and do that in a cost-effective way so recovering our flexibility as a matter of doing better automation and Replacing or integrating with external management infrastructure So all right Come come to the mic, please I'm guessing you found queuing mechanisms is a good conduit for configuration I'm just wondering which queuing services you considered for example rabbit versus zero MQ We use rabbit However, I mean we do recognize that The message queue is one of the big limiting factors in growing a cluster However, we support multiple clusters and one of the things that our external automation takes care of is The interaction between those clusters So Yeah, hi Is your smallest layer three subnet at the rack level or at the hypervisor level? So you mentioned that you have IP mobility within the rack. Yes. Okay. We do. Yeah, and that and We employ that if there's hard if say hardware is showing signs of failing within the rack we can move Migrate off of that particular hypervisor to another within a rack, but oftentimes especially when we're doing upgrades You know the entire rack is going to have to transition and so in that case if we happen to have Valuable VMs on there and like I said, we try and avoid that if at all possible Then we would have to re IP it as we move it to another cabinet And that that's tricky We've only just started experimenting with that Because it basically involves all involves shut it shutting down and snap and shutting down and Modifying this is the image or some other trick and you know, it's live migration at this point We aren't even looking at but we are looking at migrating VMs changing their IP If they contain valuable state and Within the rack, how are you doing the live migration? Is it block level live migration or shared storage? Yeah, I don't know It's evil. Yeah That's it Sorry, the reason that we don't use shared storage Mainly is because if you were to break out of a VM on the hypervisor, you would actually have access to a Shared resource so effectively you could Do nasty things to a lot of hypervisors. So yeah, we're not using shared storage So when you're migrating within the rack, it's then block level live migration. Yes, okay I'm wondering what kind of complexity you have encountered with the overlay experiment. Can you elaborate detail? What specific area you like to see improvement in those overlay networks to meet your mega scale need? You'll have to speak up. I have a very loud air conditioner vent right about my Yeah questions the specifics That you have encountered when experimenting the overlay networking your mega scale Mm-hmm and what aspect that you like to see the to improve You improve in open stuff or overlay networks. Oh, you mentioned over the net Well, one of the one of the issues with With overlays is that once you reach a particular size you start running into And by size I act I am talking about the kind of mega scale size we're talking about When when we ask vendors about supporting 1,000 IPs at this kind of bandwidth. It's like well, they're not quite they're not quite there yet. So I want to make clear we're not rejecting Overlay as a solution. We're just saying that right now it would limit our scale and so but issues with Behavior if a network is partitioned can it recover from that various other issues we've We want to see a little more experience in the community with with overlays and With their characteristics before before we try and scale them as big as we would need them so But I think they're getting there I mean some point in the future Yes, so here here comes a network question. Yeah, okay. I'll try Have you looked at or do you have any interest in dynamic route injection out of the VMs into the network to move IPs around? It's explained to me what you mean by dynamic route injection And I might be able to answer your question You you mentioned that moving a VM from one rack to another requires you to re IP the VM because it Lives in a different layer three domain. Yes routing protocol injection would allow you to move that IP address anywhere in the data center Without re IPing anything. You just have to get a gateway to Ideally probably use BGP to inject the routes to the network. Yeah. Yeah. Oh, yeah, okay I think the One of the problems with that is it adds a lot of state and complexity to the network It's It's a solution, but it's a solution to a problem. We're trying to avoid We haven't haven't really looked into that I don't think other than to say that Yeah, we should talk later, okay Ask me next time What networking service are you using Nova Network or Neutron we are Okay, we are using Nova Network our experience and our read on the community is that Neutron does not yet scale sufficiently and There may be ha issues Undress dress Neutron has a tendency to lock up we have heard We've only done a modest amount of experimentation with Neutron. We know that's the future We hope they get there. We would like them to get there quickly But right now we're we're integrated with Nova Network Are you able to talk a little bit more about the patches that you use? I mean are you said to open patch open stack? What what services are you patching you patching Nova scheduler and and? Are those patches Publicly available on github. Are you working those patches back to the community? The what the way that we're currently doing it is We've monkey patched Network API and So when allocate for instance, I mean if you're familiar with the code allocate for instance the allocate for instance Essentially those touch our infrastructure. I mean they call out touch our infrastructure and form it of what's going on Get a read back from the infrastructure and then the instance the instance creation is is completed from there. So those patches are Proprietary at this point. We would like to have an architecture that allows us to plug in that sort of thing and Just not do monkey patching or any of that kind of stuff, but right now that that is actually The main change that we've made it's actually it's actually a relatively simple patch Then it calls to a bunch of our own Python code from those from those two functions. So Okay, we are just about done I think Maybe one more question Alright, oh, okay. He's just sitting down. Okay. All right. Thank you