 Okay, everyone, welcome. Welcome to the seminar for this year. I think this is the 15th year we are having a seminar. I'm Guru Parulkar. I serve as the Executive Director of this ONRC. And again, welcome to the first next seminar of the academic year. And it's my pleasure and honor that we have JR giving up the first talk of the year. JR is one of the few people that I got to know about his reputation first. And then after several years I got to meet him. He has had a reputation to design the best switching systems for many, many years. That is how I got to hear about his name, then got to finally meet him. And he has been in the networking space for many years from the very beginning. And has built some of the most popular switching systems or switches that we use in the microcontroller. And now he's up to something even more exciting than you tell us about. So welcome. A little bit. Yeah, thanks Guru. Actually, what I came to talk about today is a little bit less about what we're doing at Cumulus and a little bit more about what's going on in the industry around networking. And I'm sure you guys don't really want to hear product pitches, but you're more interested in things that you might either be able to leverage or be exposed to. Clearly, most of you know that Stanford has involved a lot of research around what I call high fidelity networking. And there are use cases around high fidelity networking. But in addition, there's a lot of use cases around what we call bulk networking, where you set up a high capacity IP fabric. You want it to be easy to deploy, easy to manage, easy to use. I've been involved in a lot of those, you know, a lot of the MS we call the Massive Scale Data Centers deploy this type of a fabric. Suppose it's something super high fidelity, it's just kind of a chaotic environment. They set up a big high capacity network and they just want to let it roll, get out of the way. So what I want to do today is talk about some of the things that we've seen as we've been deploying these fabrics and some of the technologies that are evolving in the industry to help make it easier. Some of them are things that we're kind of championing or leading. Some of them are things that we're doing as well as others in the industry. So you'll start being exposed to it. But the fundamental question to ask yourself, if you're an assist admin and your responsibilities to get that Hadoop cluster up tomorrow, you want to know how do I deploy 100 switches in 5 minutes? Because I don't have time, I've got to eat a Hadoop up and running. I don't want to worry about my network. I just want it there, done, monitor it, right? So that's really what we're going to be covering today is what does it take to pull that off? Small two pieces of overview, we're going to go through. One of them is what do I mean by high capacity, high capacity fabric. And the second one is a little bit about cumulus and cumulus Linux so you understand where I'm coming from. But then we'll kind of get down into the technical details. Was that? Sorry, if I just make a snide comment, I was saying that my project is the mini-net network emulator, which is already the network virtualization system. And I say, well, use mini-net and just say mini-net with 100 switch network and it starts up. Exactly. You're pretty much right on point. Actually, it's a good point. We're going to get to that down the stretch, actually. So thanks for bringing that out. Don't be sorry, it's completely on point. The type of fabrics that people are building for things like Hadoop clusters, large data warehousing, three-tier applications, kind of a modern cloud-oriented data center, even the app-oriented data centers like a Google or an Amazon, are starting to build out or have been building out for a long time, what we call leaf spine or factory type networks. I'm presuming most of you guys know how that works. So this is the picture. If you want more details, feel free to talk to me afterwards. Please explain to you all with papples, even though I used to work at Arista, which talks about leaf spine all the time. Can you explain what that actually means? Yeah, I think the term's kind of sudden. I think you're kind of stupid. It's like a folded cloth network, really. If you look at it, you take the classic cloth telephone network and you know the X number of stages to the fabric of full bi-section bandwidth and you cut it in half and pull it over and you're done. And that's really what people do it. For some reason, actually I think a lot of it's a Cisco-ism, in that Cisco always has these terms for the layers in the data centers so they can talk about all their products that fit into a particular layer in the data center. So I think it got the term's leaf spine got invented so I can say this is a leaf switch or this is a spine switch. Or these are the things that leaf switches do and these are the things that my switches do. Realistically, what I found is oops, I'm going to go, am I going the wrong way, am I going the wrong way? Ah, thank you. Realistically what I found is in a modern data center there's very little difference between what a leaf and a spine does in terms of space function and there's very little difference between which physical platform can work in either job. It's completely scale dependent. So we have customers that are building leaf spine architectures where they're using modular systems in the leaf and modular systems in the spine. We have customers that are using pizza box type switches in the leaf and pizza box switches in the spine and they're building five tier folded cloths and it works out of that. It really gets down to what they want to build and why they want to do it. So a little bit of background on Cumulus I'm going to try and get through this part really super fast. Is it animated? Yeah, it's slightly less. Pardon? Pardon? I don't think so. I never rehearsed my slides. I just pushed the buttons. So I'm sure a lot of you are familiar and have Nick's been beating into a lot of your heads about the fact that traditional networking gear is pretty much locked up as bad as Fort Knox's. And what we do at Cumulus is we take kind of took, or we realize that networking silicon starts to look very similar or not silicon but the hardware looks very similar across the board. And in the context of similar looking networking hardware you recognize that the world's kind of seen that type of transition before and that's what invented a modern server operating system like a Microsoft server or any Linux experience. So in the Cumulus context we build a Linux distribution specifically targeting networking oriented hardware. You said you worked at a Rista, so it's much like EOS. EOS has characteristics that Cumulus Linux doesn't have and vice versa, but then we also have a lot of overlap. What we talked about on networking oriented hardware is this is a picture of a piece of networking hardware and we have a customer that has, they got 15 vendors, incumbents, ODMs, everybody brought their hardware in, they took the tops of the boxes off, they laid them all out and they recognized, I don't know whose this is. This stuff all looks the same. And so we have a few hardware suppliers we're working with now we're constantly bringing more online. But what you're seeing is the silicon underneath like from a Broadcom or a Melanox or an Intel falls into the same veins as like an X86 server CPU platform does. A lot of the IPS is embedded in that silicon and different people can take that silicon, put it on a board and get it to market using different cost structures, different market geographies and different levels of reliability. The last thing about Cumulus before we get to the cool stuff is that as we look through and talk about this stuff we literally are Linux. So the architecture of the software base is that the kernel maintains the database just like it doesn't on a server platform and we synchronize the kernel database with the switching silicon. And that allows all the user space apps to work effectively untouched on top of that. So it's super powerful for us. We know when a customer comes and says, hey I want to use a different routing protocol, we're stable to turn it on and it runs. They want to use a different configuration management engine and monitoring engine and any of these things they want to use they can deploy it immediately on top of this platform. We spend a lot of time with the Linux kernel community making sure that the kernel advances in a way that's interesting for the broader networking market not just for a server specific platform. So now on to installing or deploying high capacity fabrics. Your 100 switches in five minutes. So the first thing that you come up with, wait, something happened. I think I deleted the slide by accident. Oh well, sorry. Let's start from here. I'm going to go back here. So what happens at some point in time is these switches show up in a cardboard box and you take them out, you put them into a rack at some point and you have to do something with them. And at that point, the very first thing you need to do is make sure that they have the right image on them. And in the context of open networking environment or ecosystem, we've been working with a bunch of the we call them independent software companies in the market, a pk8, a big switch, you know, a Facebook OCP to develop an install environment that ships out of the factory on networking hardware. And what it does is it takes that management system that lives on there and it gives it an environment to wake up and go and find the networking OS that it wants to install. It's completely networking OS agnostic. It can install cumulus Linux. It can install the OS if it wanted to do that. All it does is access that install environment. It's the moral pool of an IPXE in a server platform. This picture is kind of confusing. This is, you know, a server has apps and operating systems in a BIOS and only lives here as kind of an install mechanism. What makes it slightly interesting is that it uses all of the modern awesomeness that we've learned from deploying servers at scale. So it doesn't use that TFTP in general. It doesn't use any of those mechanisms. It uses HTTP. It can be credentialed. You know, there's interesting waterfalls. You can run pre and post install scripts. All the types of things you expect from a modern installer. So you have to put the install so you can differentiate from Cisco? Yes. Cisco's buttoned up tighter than anything else because of the Ventra. Yeah. You know, if you look at it from a customer's perspective, or actually the ecosystem's perspective, you know, a piece of hardware, like we said, will come out, come in the hardware, you open the box, you put it in a rack. Somehow or another, it got shipped to you. So it means someone had to manufacture that piece of hardware and someone had to stock it, distribute it, and so a reseller had to resell it. And if you look at the ecosystem or the economies of scale of hardware, and it really looks, starting out from the server environment, focusing there, you recognize it's really easy for someone to manufacture something and get it into a distributor when a lot of different people can use it for a lot of different purposes. And so, having that install environment is super powerful. So our hardware partners, you know, obviously support us, but they also support BigSwitch. They're also supporting Pika8. It's kind of opening up that ecosystem around the hardware supply chain and then allows the customer to do what you'd expect. You unwrap the switch, you plug in something in the management port, power on the system, and then it goes off and that install environment will discover the network operating system. You know, I put cumulus Linux, but it could be anything else. It goes through, installs that network operating system and then starts the provisioning steps. How do you bootstrap the management network? You can have 100 switches out of the box, you don't have the management network. Yeah, you know, it's an interesting premise. Our experience right now is that most people really still keep that management network. And the management network, obviously, they come in and they kind of build that up as they go separately. That's generally how it gets done. The recursion has to end somewhere early. Yeah, exactly. The classic tidal wave problem. The... Did I miss a slide in there? Okay. After we go through that network install environment, what we start, then we typically do, is have a post-install script that will install something interesting like Puppet or Chef. And then from there, we'll use that to automate what it takes to bring that system into the network. So it'll set up credentialing, monitoring, configuration management, all of those pieces. I think... Let me just double check here. I think I have... Yeah, here we go. Some examples of what this ends up looking like. This is one of the simple configurations. This is like a DHCP file, but I don't know how many of you have ever set up DHCP, but inevitably, you just go through and set a couple of options in the DHCP configuration file. And those get picked up by that installer when it comes up. And it knows after it does whatever space operations are to go off into the next step and find the provisioning level. Where it would typically do something pretty straightforward like install Puppet from whatever the backing repository is, set up some host information, and then restart the Puppet agent and let the Puppet agent take off from there. And then the Puppet agent will take care of things like adding and deleting users, setting up authentication and audit privileges, license management, all of those back-end processes that are necessary to get the equipment into a network and running. Have you guys had much experience with any of these configuration management tools? Pardon? Kickstarter, Pixie Boot, Puppet Chef, CF Engine. Yeah. So are you guys using those? Are you provisioning servers with those? Yeah. For a moment, we're kind of using that, too. For instance, provision a whole cluster of hypervisors and so on and installing additional software packages for people to customize it. So do you ever do tests on just powering on the cluster to see how long it takes to provision a set of hypervisors? Most of... Actually, it doesn't take too long. Most of the time is in, for instance, this formatting file system. And that's usually where most of the time goes. Other than that, it's pretty light-installed that you need just for the basics. And that happens within, like, 50 minutes. Right. And that's what we're seeing. Right now we're talking about the part of getting a piece of hardware kind of basically set up. And all of the larger customers that we're dealing with, even the middle-sized customers are automating that process right now and they get to the point where they can go into a new pod, whatever it is at a certain scale, turn these things on and bring the network up within half hour at most. Which is pretty impressive because it hasn't always been like that. Move this out of the way. Okay. There's another piece. If you know the classic configuration management paradigm in the context of a server, what you typically get to is every server is the same. So you just basically give them all the same files. If it's an Apache server, they all hopefully get the same Apache configuration files. It's reasonably rudimentary. Oftentimes in the past, networking devices have been unique. I mean, the classic, you know, I'll pick on Cisco since, you know, since I don't think anybody thinks Cisco's here. You know, the classic model, and some of you may actually have lived this year, is people will generate these massive Pearl Haribals that generate config files for fabric of switches. And they do that to set up bridging and they do it to set up routing. And those configuration Haribals come from a couple of different dimensions. One of them is that routing protocols and routing protocol configuration is a pretty interesting dichotomy. On one side, routing protocols are great because, you know, they're really great at discovering things and finding out when things are broken and reconfiguring themselves around that. Unfortunately, on the configuration of routers, we as humans have added a little nuance to that. And we also use routing protocol to check topology for us. So we do really kind of dumb things where we'll have a, you know, a high port account router and we'll say, you know, a small subnet and I'm supposed to connect to the other person on the small end of that subnet and I'm going to run something really nice like OSPF that can discover and bring something in really simply. But I'm going to do it over that little subnet and because I have this subnet, it protects me to make sure that the person on the other end knows that that subnet is supposed to be on whatever link I'm connected on. And that way I've proved that the two of us are in sync with each other on our configuration. And that's just a nightmare. Especially when you start deploying fabrics at scale and you want to get stuff up and running in a very small amount of time for two reasons. One is you have to get all that configuration correct and it's phenomenally complex. And the second problem is it's the, when there's something wrong, it's typically not very obvious we report it back. So if you, I don't know how many of you guys are router heads, but if you look at any of the interesting routing protocols, especially kind of from the data center and the VGPs, there are ways to make these things run so they're effectively plug and play. So a great example would be OSPF with unnumbered interfaces. You can take OSPF routers and configure them with unnumbered interfaces and on point to point links, plug them all together. They discover each other and they come up instantaneously. It's really sweet, just like a bridge, but you get a nice IP fabric. So really awesome except for one tiny little problem what if I wasn't supposed to be connected to David? That's not good. How do I make that work? How do I deal with that situation? Because I had a benefit of the old complex way in that I could test to see that I got what I expected out of my network. So what we did, yeah. The decade or so old work excuse me, the IETF to extract the adjacency asserting adjacency and maintaining it over time from the routing protocol so that you don't have to be within ISIS and OSPF and yada yada. Has that never gone anywhere? Remember? Yeah, nothing with much force. Yeah, it never went anywhere. When you stand back away from it and look, inevitably what the customer is looking at is they have a topology graph. They knew this is what my topology is supposed to look like. This is my wiring diagram. How do I make sure that my wiring diagram is correct? And once my wiring diagram is correct, I'm willing to let everything go off and run. And what you'll find is every major data center that I know of and there's a lot of them has some piece of homegrown technology or partially homegrown technology where they've gone through and they statically check the wiring map as part of the provisioning steps of their networks. So, in one concrete example I can't give you their name, but they go through, they wire up all the switches and they go through on every switch and they check all the LLDP neighbors to see if everything's what they expected to be. And if the LLDP neighbors are correct, then they'll go through and download configurations onto those devices and start enabling them. And if the LLDP neighbors aren't correct, then they raise an alert and someone goes and figures out what got wired incorrectly. There's a couple fundamental problems with that A. There's a 26 state state machine across hundreds of switches at a time which you can imagine is rather error prone at best. And the second part is that that's at build time when they're building out a pod. All of a sudden someone comes along and changes a link somewhere here or there and that whole mechanism is now forwarded. It doesn't exist. Others run it periodically. They'll check it every five minutes or whatever. But it's still typically homegrown technology. We recognized both of those cases. A. The routing protocol can be super easy to configure. And B. That most people want to check against a cabling plan or a well-known topology map. And we put together a piece of technology that's called the Prescriptive Topology Manager or Module. It's actually out on GitHub, so it's an open source project, as is that network install environment we talked about. So if it's interesting and you can use it anywhere, it's not just on any of our equipment. And if you're also interested, you can contribute to it. And it does a really simple thing. You build a topology graph. If you have this topology, like switch one is supposed to be connected. Switch one, port one is supposed to be connected to whatever M stands for. M1 port three. So you draw that topology graph, you do it in the syntax we chose was the dot format. Are you guys familiar with the dot format? Some are nodding their head yes, some are not. It's a really cool format in that it's a graph-oriented format and there's tons of visualization and editing tools you can use to make your job a lot easier. It folds into almost all of the network graphing tools. Microsoft Vizio, it's very ubiquitous. Anyways, you take this dot file and you put it on every device in your topology. So every one of these networking devices we get the same dot file which is really cool. And then every time a link changes state from down to up, what we do is we go off and we check its neighbor and say hey, find T2. T2's not here. If I'm in M1 and port three also transitions the link up I'm going to go off and check my graph and say hey, who's supposed to be on the other end of that? And if it tests one, I'm going to take a certain set of actions which is kind of a thumbs up action. If it says it's not S1, I'm going to take another set of actions like call David and say David come here and fix the C for me. And that in the end is what people are looking for in testing out their topologies. So I presume most of you are familiar with IF up down and also one of the dynamic IF up down based on link transitions. Got a little demonstration slightly hokey but it's still useful. Okay. So in the demonstration I have four switches connected in a basic folded cost topology. Spine 0, Spine 1, leave 0, leave 1. And this is the topology file as expressed. So you have the device and its interface and what other device and interface it's connected to. So in our demo we see that same topology file and we have a set of actions. So we have one set of actions which is if you fail you're going to write out a failure message and if it's the correct topology then you write out a passing action. So let's see here in theory over here if I did everything correctly the log file for this stuff and we will see we should have seen already out of this right there. We have the log message that slide over here we have the log message over here that this device is not connected to the right switch. Looks like I messed up the host name in my switch. But it reports out that this is the connection was not correct and it was expected to be connected to a particular interface on a particular host and it got a different host interface combination than expected. Why that happened I don't know but it did. So let's come down here and do we're going to set another one up wait with faded breath and we should get another log message here. There we go. And this one's also errored because I was telling you as I put this demo together starting at about 11 sorry but it's kind of a leading indicator of how easy this kind of stuff is when the CEO can put something together like this in a half hour somebody that actually knows what they're doing can get it done much easier and a lot more robustly. Anyways I understand you correctly you brought some links up and this above script and before it actually brought the link up and validated or enabled connectivity and it validated if connectivity was correct slighted an error and therefore that link is not actually brought in, is that correct? It's brought up from the kernel's perspective but it's not put into a topology like typically those scripts that I showed you and the customers at the point they'll do things like add into a route of topology there's carrier on the link and maybe there's some kind of package but it's not in the routing it's not routed which means it solves your problem of oh I don't want x to talk to y but I don't know I also want to be able to discover exactly and you can imagine this was a super simple description of what you do on the pass and fail but you hit the right on point in the more complex cases enable routing protocols if it's correct or if it's wrong you'll go off and manage your learning system which might be something in syslog or oftentimes people will put together a totally separate collection tree for bad events and they set up a notification around that directly When you set it up initially does it have a separate control plane? No, it just uses this it's a standard provisioning text file so it's not a central control plane it's all distributed like you'd expect out of a routing protocol type oriented system or yeah here I guess we got some stuff I could talk to, I was hoping to put some things in the demonstration around monitoring and troubleshooting in generally the last big trend that we're seeing and I'd imagine you guys probably leverage this pretty regularly is you know there's the tool that people have used for quite some time and still use pretty extensively as an MP for doing networking monitoring problematic in a couple of different dimensions one of them is that its pull times are exceptionally slow the other part is that it's got a really high overhead across the system, the protocol itself is high overhead, all of the agents the servers and the clients are very high overhead and the what people are doing or our customers are trying to do and I think a lot of the other customers of other kind of more modern networking vendors are moving towards push based protocols like the to take statistics and monitoring information off of networking devices and push them back to standard collectors that are sexualized one of the big advantages that gives you is actually there's a ton of advantages you get out of that but one of the ones that's kind of seemingly innocuous but it's actually pretty cool is you can separate things like environmentals from things like performance or things like you can call them physical things you know is this box too hot are the power supplies broken did a link go down is my do I have a congestion event when I'm dropping a lot of frames on an interface those all go to different people same person doesn't manage each of those problems in a large data center so what you're able to do is set up a collective instance for each of those separate pieces push it out to the monitoring system it's tuned for that particular user or and they're able to make progress and move forward so it becomes really powerful and you can kind of use your monitoring framework to help drive your organizational efficiency so that's all I have any questions in that graph file you take regular expressions like leaf should connect to spine I don't care which port of the spine it goes to which port but this is my rough rules for my technology I can express things like that but do you see value in that? so I don't when we've talked about it a lot I don't know if it's implemented right now but that has definitely been discussed you want to be able to wildcard different pieces of not just wildcard but for regular expressions in there there's no fundamental reason to not do that again it's a clinic system so you bring it up it's not like you have to trudge through it and see or something like that Jarrett, how do you think about pushing open source so most of what we do is open source we talked about this open network installer that's an open source project I went through it really fast there's a github repository for that you can get to the prescribed topology manager is also open source we kind of work with bird, quagga, the kernel the Spanish tree demons we're doing work with all these people in our fundamental theories you push all that stuff up to security I have two questions the first question is do you see that being deployed in a certain environment or in a technical environment or both and what are the implications with respect to network security so on your network security question that's always a I could end up in a rattle there so when you say application with network security what do you mean that's an interesting argument I don't know I've heard people either way the advantage is if you exploit it the fact it will get fixed if you watch the Linux security updates occurring in the kernel constantly that's not because they've been exploited someone found it, they fixed it, it's done so you're kind of getting past all of that the counter argument is there's a really large networking company that lives down the street that is super secretive about all of their security exploits and it'll happen some customers get it fixed immediately because they found it and there's other customers that are just kind of trudging along with all these exploits lying around with all of their transparency as to what to expect and my first question was about whether they've been deploying that in an editor environment they can be you know we're you know obviously Telco is always a tough one we're working with some Telcos on deploying this for kind of data center type needs where CDN infrastructure and some of them want to kind of do some cloud over the top services so they're working with them in that context and trying to figure out how to use it for like their main switching fabric I'd say that that's a way there's no fundamental reason why not over time but it's not quite ready yet we're doing work on MPLS and kind of bringing that up to snuff as well and I think once we get there it'll start to get more interesting I like your automation and configuration doesn't usually employ these things there's some interesting screw ups and it's all wrong you know trying to I mean inevitably when you do deployments things go weird like you know you get this ticket that says hey you know I can't ping from this host to this host what the hell's going on realistically so far most of those are things where we've gone through and found that they've just configured their hosts you know dumb things like if you guys are familiar with deploying servers they inevitably have some sort of an IPMI interface on them those get IP addresses and those much like even though the server itself might be a Linux platform that you know how to automate around the IPMI platform like a Dell DRac is it's kind of Linux-y but it's kind of like embedded OS-y and so it doesn't follow a lot of these paradigms so it's really easy to mess up and misconfigure those things I haven't really had anybody yeah I mean there's you know inevitably you have the opportunity for human error I haven't seen any human error go super haywire what signs and names to the switch inevitably part you know it's part of the provisioning process is when it gets its name so ONI or ONIG well the ONI itself is an install environment so it'll go through and you know like I said it can install the cumulus Linux it can install the OS it'll go through and do that install process and then as part of that it also allows for a post install so that post install will do things it can be as simple as statically assigned an IP address or a host name or run install I think I showed it earlier meaning install puppet and let puppet come in and do all of this and so follow up on that it's kind of an interesting problem that you've posed here which is that I have a bunch of switches presumably I wire them all up correctly and then you know when they boot it seems to me that each switch has to determine which switch it is and where it actually sits it's got the topology spec but it doesn't know which switch it is so you must have some it doesn't know which switch it is what level am I in it must be part of that provision but can you clarify what you mean by that because they all boot off a single image and when they come up they don't know where they are or what they're connected to or what they should call themselves and that's therein lies the one place that you actually know what you're doing so what we've typically seen is people will have some mechanism or either I'm going to be kind of eager we have an extension to DHCP for instance that you can plug a DHCP you can run DHCP on a switch and plug it into another plug a port into that and you'll always get the same IP address and host name out of that so you can actually, you talk about water falling if you really want to go to the death on water falling management network out of all these things that are all actually tidal waves straight forward and it bootstraps itself in exactly the same process going forward from one back in instance and every device downstream is given the name based on what it's plugged into downstream so you're able to go through and it workings out that way okay so you can base with new things depending on what they're plugged into the management network but I want to tell you where they belong they're topology and topology what do you need to know for the topology file I don't know when it breaks it's not just when it breaks it's just to do the test I need to know in my S1 or not in my S1 the interesting thing is I think what you end up getting is this should be this is kind of a graph matching problem but I think you end up getting it together with a topology graph of what's actually installed and the dot file of what should be installed and you have to match those together and it's an interesting kind of algorithmic problem which you have to solve somehow which is certainly eminently soluble as long as you have it totally messed things up and then you should basically say well it's like graph on graph okay well this part of your graph matches but we'll assume that the matching parts are this and then based on that you can say well these are all the errors right in a normal deployment scenario normal people have an asset but I have some asset put in where there has to be a counter it's a serial number you can easily go from there or a MAC address in your DGP setup because you associate a certain configuration with that MAC address it's thus painful like you have to know the MAC address but they have to do some action it actually can be much more powerful and when you get into it as part of the only bring up it passes a whole bunch of system information back as it's in the initial requests which are things like this is my serial number, this is my manufacturer the information I can lean from myself I will pass back so we have a demo where we go through just completely based on serial numbers exactly what you said this serial number is supposed to be this device and the whole thing works forward based on that but that's terrible all different strokes for different folks some people want to do it they want to scan the MAC address and do it that way it really depends on who you are and what you want to do if I'm Facebook I probably don't want to do that I probably don't care about the serial number I probably just want to replace the device in the rack if it has a red light on it I can't speak for Facebook but I can speak for a couple others that disagree with you they want to have somebody it's just like I said a lot of the big guys do things that you wouldn't expect them to do and you do want to do it but accounting want to do it it's not really that but if you say you want to match that kind of topology there but there are two ways I can look at that topology I can look at it this way and I can turn it around and both in both cases you'll get the same result for the topology so if I want to know which specific switch that is I can have to have some kind of an ID type you can just pick one either one works so at a even at an order of magnitude level have you established that the things that you still need to do right for the management network to come up so that you can bootstrap all the rest off of that management never bootstraps in a bootstrap tidal way exactly the same way but you still have to have these keys of it's either based on this guy's MAC address or IP address or something that says I have a different config file for Peter versus Paul or for the position of where Peter is versus the position of where Paul is what are the precursors that you need in order for the whole machinery that you've opened up for the production network to come up with nearly zero touch and it'd be interesting to see that somebody had to do without which you can't even bootstrap get some idea of whether that's a small fraction of the whole old way of doing it versus tough no matter what this is I hear what you're saying we don't have that, I haven't studied that it's an interesting question to ask what we know anecdotally is that it's substantially less than half some of our customers are pretty secretive as well but we know just from head count and operational head count that they use to deploy these things how much better it is so how about BAGA and board you said people can install they're now matured enough and people are using them in a very skillable protection environment and that opens up the one that you can download and do that then people customize it people use both of them BAGA and Bird pretty extensively in different environments I'm talking about scalability of your high capacity how does this compare to people doing an SDN centralized controller I'm not going to do that fight sorry it's a different person than today regarding your title wave is there, do you see this changing how people structure their network or do you really think it's feasible to have a production network where some guy with a cart just plugs in a switch and self-validates production I mean if you talk to a risk to people I'm sure they're seeing it in certain places we didn't venture a touch for divisioning we're working on it really hard this is something in case where the switch fails somewhere on the network instead of having one of the networking people coming in and replacing we basically just talk to people and say even though they're not that experienced in networking it's going to pick this piece of hardware off the rack plug it into there change this IP address or this MAC address and just boot it and all the rest gets helped and what we're seeing is that one last step you don't even configure anything they send someone in, they swap it out everything in back the way it used to be and it all comes up and if it's not right they send people to go figure out what that person plugged in wrong you know what I'm saying? so when you want to upgrade the device you have to go through ONI again and just go through the whole process when you want to upgrade the firmware on a device you have to go to start through ONI no no no well you can some people like that premise of going through it saying I want to reinstall everything completely from scratch but it's a Linux platform so there's a lot of different ways to upgrade so like app to get works on these things you can do we also kind of maintain that premise that most networking devices have having kind of golden image slots so you can put a new golden image in a golden image slot and then reboot so that we kind of try to keep this as flexible as possible and all the technologies we've talked about right here they're not specific to ROS, they're leverageable and usable by any OS so your original question was 100 switch network in 5 minutes my take is that maybe you've solved the easy problem but I wonder if you have any what I consider the hard problem is actually plugging the things in getting them all power actually plugging in all the cables now I've seen some really new things like at O&S they had an open flow controlled robot which went down into the racks and actually plugged in optical cables and used knot theory to make sure they didn't all tangle themselves so my question is that's pretty cool have you tried to address the problem of actually wiring stuff which would be really hard in many hours no, we haven't the reason why we haven't is let's go all the way back to here when we look at reasonable modern distribution one of the reasons why we chose the kind of model of our business people want to buy that thing they want to buy that whole thing pre-set up from a supplier whether it's Dell or Quanta or Acton or Supermicro or whatever but those guys still have to be plugged in somehow and that's probably someone goes around and plugs this thing in and this thing is delivered you know Cynics Hive they do all the stuff for OCP and Facebook they do it for the kind of people you just go through the list and that's how people are buying these things I guess there aren't that many connections to plug in when you get a new rack you have to do that right at this level here this is kind of the top of the rack switch at this level someone had to wire that stuff through a cable tray or something it's not like it's some loose cable lying around so the number of errors start to fall off as you go so we have time for one last question so you spoke about 100 switches how do you keep it going past that for operations can you speak about diagnostic tools past that by any means um you know kind of say it right probably best yeah so you know I talked about monitoring the industry is kind of too long to continue on that kind of going the monitoring architecture for monitoring people in a movie too is kind of I call it parallel frames I want to create a frame to monitor something like you know physical things power supply status, environmental I want to create another one for things like link or maybe optical transceiver power and so what you end up doing is people are able to you know on modern networking equipment put together a management framework that makes sense for anyone disciplined to troubleshoot that set of problems and people are pushing that stuff back into collectors one of the advantages again of that modern system is that a lot of the same collector basis or back ends are getting information from applications as well as hypervisors or LSS and the customers are able to cross-correlate events so it's kind of the simple way of looking at it app works fine at 1pm app is broken at 4pm what the hell happened between 1 and 4pm right and if you look at the historical way the way that worked was app at 4pm's got messed up connectivity things, I'm going to call some networking engineer hey networking engineer, my app's messed up networking engineer starts from scratch tries to figure out what the heck went on but by folding back into the same framework everybody else is going into you can actually go back and recreate timelines and so there's a lot of I'll just read one number of people I know the MSDCs are doing homegrown things, there's some people working on startups and commercial offerings around going through and pruning connectivity graphs, like if the app was talking to these other endpoints what other things along that way might have been affected across that time frame, so now I can call what I'm going to pay attention to and find the one thing that caused my app to be messed up so just as important is the acts too I understand the behavior so the question is as you go forward with this is there some standard diagnostic orchestration framework that every one of these can actually plug into any of that right now what I'm still seeing what I'm seeing is the classic data structure and everybody's going to go back and look at it and the reason why is it's kind of like herding cats, every app developer wants to push back some information whatever they want to push back I don't want to know about your framework, I'm going to push back this stuff and the network person is going to do this the hypervisor is pushing back information trying to get all those people to standardize on some format is kind of a never proposition so I think realistically what I'm seeing maybe the venture can speak to it some is people are just realizing how to normalize normalize the content as it comes in or if you're going to go back through and you go through a search on the things that are meaningful to you and create plugins for what your different endpoints are ok, thank you very much