 Hello, everyone. I'm Baptiste Jean-Gray from Inria from France. I'm going to present you what I call our journey into neutron and network engineering switch. So what we did with this, and we are still doing it, is to have Bermeter network reconfiguration for large-scale research platform. So we're going to do a bit more details. But first, what we do at Inria, and stack. So Inria is the main national research center in France for computer science. So we have almost 4,000 researchers in all areas of computer science. And I'm part of the stack research group. So it's a small part of Inria. And we do research on infrastructure. So OpenStack is part of our research. And I just think we, people from our group, did presentation at OpenStack Summit and then OpenInfra Summit since 2016. And I discovered that there were apparently two summits in 2018. I'm not completely sure about these, but so we tried to do applied research and to give back to the community. That's an important aspect of our group. So this work is part of a research platform, which is called Grift 5000. So we just celebrated the 20 years of the platform recently. So it's a long-term effort. And the goal is to provide experimenters, scientists with a tool to perform large-scale experimental research. So in practice, we provide bare-methode access. So researchers, they can use the platform, reserve some resources, get actual physical servers. We do energy monitoring. You can change the OS. You can reconfigure the network. You can do a lot of stuff that is useful for science. We have many different hardware, ARM, PPC, FPGA, FPGA, whatever you want. So since the start of the project, there were about 2,500 scientific papers using the platform. So it has an impact. So the scale may not seem that big now. We have 750 nodes. But for research, we still call this a large-scale platform, even though private clouds can be much bigger than this. And there are similar platforms elsewhere. This is just for friends. Okay, so we are now transitioning from this 20-year platform, Grift 5000, to a new one, which is called Sleissi's Research Infrastructure. So this is a much bigger project. So it's a European project. We got funded for, I think, 30 years. So it's really the next generation of large-scale platforms. And one change is that Grift 5000 was mostly focused on servers, so doing VM research and so on, cloud research. Well, here we want to experiment on the whole edge to cloud continuum. And that means be able to do an experiment like we do on our current platform with complete access to the hardware, monitoring everywhere and so on. But on the whole infrastructure, from sensors, small servers, big servers, network, everything. So this is not yet, this does not yet exist, but it's an ongoing project. So just an example of what people do with Grift 5000 currently. So this is a colleague from Grenoble at CNRS that did some network booting optimization. So he wants to boot physical hardware with the method and to optimize network access. So he has an experimental system that is NFS with caching, let's say. And so he used the platform to experiment with this tool and different methods. And this is something you cannot do with virtual machines, for instance. You need actual physical hardware. So another experiment, which also involved networking, this was a project that used a platform through a European portal, let's say. It's a next generation internet architecture. And they designed some kind of new way to do routing on the internet. And they wanted to test that at scale. And so they used some resources on Grift 5000, some resources in Germany, in Belgium, in Amsterdam. And they could interconnect all those resources using dedicated networks. These are research facilities. And so they could test routing in this complex network with actual latency and so on. So it's really important to be able to do this kind of research on the actual system. And this is an example where actually the user needs to have dedicated networks. So this is the goal of the presentation, actually. How do we provide to researchers, like the two examples I showed, how do we provide them isolated networks? So we do that through network reconfiguration. And the goal is to make that fast. As I said, we provide that metal. So if you want to isolate some nodes, what we do is we provide a VLAN to the user. And we allow the user to put its physical nodes into this VLAN. And to do that, I mean, we have our physical network and we need to reconfigure it dynamically. So we have a pretty large network, let's say. And what's important is that, so there are many small experiments that only need a few nodes. But some of the large-scale experiments, they can need 50-100 nodes, which means 50-100 physical ports to reconfigure. And users, they don't want to wait 10 minutes for this reconfiguration to proceed. So we'd like, ideally, to do everything in less than a minute or even less if possible. So we already have a tool to do that. It's called KVLAN. It uses SNMP or EC7 to reconfigure the switches. It's written in Erlang, so it can parallelize things and so on. But we ended up being limited because you need to define everything statically. So there are not enough VLANs for some experiments. If you want to put more VLAN, we need to do a lot of reconfiguration because DHCP, DNS, it's all static. So that's a bit of overhead. So we wanted to modernize this tool and to make it more dynamic, like instead of having, let's say, 10 static VLANs for everybody, now we would be able to create VLAN on demand and potentially as many as people need, and to automate all the DHCP and DNS parts. So we evaluated a bit. So when KVLAN was started, there were basically no other projects doing this. But now, we looked a bit at what exists today. And actually, Neutron is a pretty good fit. So the way we map our problem with Neutron is that people need VLAN. So we just say that, okay, Neutron network, that's a VLAN. That's already supported. When you want to put a machine into a VLAN, actually, it's a Neutron port. You can re-print it like this. And then from here, you have subnets, IP allocation, automated DHCP, and so on. So this fulfills our need pretty well. But there's one nice thing about it, which is the physical reconfiguration. And Neutron does not do that. So we formed three plugins to perform this kind of reconfiguration. One using Ansible, that did not look very active. One which relies on NetConf. We are not very familiar with this, so we did not use this one. And then the last one, which is networking generic switch. It's actively built. It uses simple technologies, let's say, so just NetMico SSH. Supports many hardware vendors. And it's used by Ironic. So that means it probably will stay maintained. So we decided to use NGS. So just a quick overview of this all works together. So we don't use Ironic by the way. We have our own solution for Bermeter deployments. So we directly use Neutron. So the clients, they call Neutron to create networks, ports, whatever. Then through ML2, Neutron asks NGS, okay, can you bind this port? Can you create this network? Do whatever you have to do. And then what NGS does is connect through SSH to the actual switch and perform the action. So that's pretty simple. So the thing is in the API call, you have to define which switch you want to reconfigure. So here we added a layer of security, let's say, so that you can only access the switches and the ports that are bedroom to your experiment. And then you can see also that you could have several changes to the same switch. So this can be a problem with some hardware. So there's already something that exists in NGS which is locking. And basically you can say, okay, I don't want, so here it's locking only one operation per switch. So if you have several one, then the first one will proceed and then the second one will take the dock afterwards. But you can also define a level of parallelism. You can say no more than three concurrent operations. So that's already there. So we decided to test it. So just configure it, give it access to a switch, create a port. And that takes 24 seconds. So we're like, wait, we need to create potentially hundreds of ports. That's a bit too long. So we optimized things. So NetMico is a library that does actually the communication of SSH. So it was not that well optimized. So we went on to eight seconds by optimizing timing. Then SSH was actually, the SSH connection was slow because, I mean, a switch is not very fast. So we changed the key exchange algorithm and that's a bit faster. And then, okay, we did other things. Nine months later, we went back to the project and it was back to 16 seconds. So, okay, what changed? Actually NetMico was updated and they changed everything related to timing. So we had to optimize again. And we pushed that upstream so that we hopefully won't have regression like this in the future. And so we went on to two seconds, which is pretty good. But for our needs, it's still too much. That would take almost two minutes for 50 ports. So we needed to optimize even more. And now, I mean, to get better performance, we need to change a bit the design. So just to understand what happens, we did some measurements. So when you create a port, here is what happens. So Neutron does some stuff. Okay, we don't really control that. Then you establish a SSH connection. Okay, that takes a bit of time. Some internal NetMico stuff. Then the actual configuration. And then at the end, you want to save the configuration on the switch. And that takes also a long time. So the idea is to avoid doing that for every port individually and to do some batching, right? So ideally, we would do all steps only once except configure in a port. And we could configure 10 ports at the same time. So that should improve things a lot. So we are not the only one thinking about this, obviously. So in the meantime, people from StackHPC did some work on batching as well. So this figure, I did the figure by looking at their code. So hopefully, it's correct but no guarantee. So the way their solution works, they leverage ETCD to have some... Okay, so the difficult part is that you can have several different processes trying to do things in parallel. And you want to coordinate that and to make sure that they use a common SSH connection, let's say. So to do that, they basically decide that one neutral process takes a lock on a switch and every operation for a given amount of time will go through this process. So once this process, we know which one it is, everybody pushes a common queue and then this thread will look at the queue and execute many commands in a row. Okay? That's the idea. So we did our own design at the same time, let's say. So our design, it's both simpler and more challenging, let's say because what we did is that we introduced an agent. So let's make sure that the coordination is much more simple because before you had several processes trying to coordinate. Now, all processes, they took to a single agent. So the agent gets all the requests and then it can optimize anyway it wants. So basically, we have a queue for each switch and we have a dispatcher that tells each thread, okay, this is what you have to do, this is what you have to do and each thread, it just spins and performs all the commands that are in the queue. So it's actually a bit more complex than this, but I won't go into the full details. So to optimize things, even further, we decided to use the bulk API in Neutron. So that means you can do a single API call and create a 100 port if you want. So we thought it would be more efficient instead of having 100 of parallel requests. But the problem is that you need to have asynchronous operations. Otherwise, it's not really useful. So we implemented asynchronous operations. Okay, you'll see it's fast, but in the end, it's probably too complex. So we see. Okay, so this was about the design. And so we implemented the batching solution from StackHPC. This is upstream since a few months. And this, we have implemented this, but not yet upstream yet, since I think it's been six months, but it's working quite well. Oops, that was a bit too fast. Okay, so to test how fast we can go, we used Rally, which is a renaissance project. I did not know about Rally before, but really it's working really well. And Rally is talking to Neutron to ask him to reconfigure switches in parallel, and we measure the total time it takes. So here, it's important to understand the measure we take because it's not that obvious. So what we do is we create a lot of ports at the same time. And the question is, how do you measure the time it takes? Okay, do you look at individual ports? Do you look at the sum as an average? I mean, it's not that well-defined. So we decided to look at the total time it takes from the first request to the last answer, let's say. And the advantage of this is that we can compare many different design solutions, asynchronous, synchronous. In the end, what you care is you start sending requests how much time it takes until you get the last answer. And that's it. So just to make sure everything works, we create ports sequentially. So it's not efficient, but tell others to see what happens. So default is plain NGS without doing anything. Lock. So with a lock, we make sure that there's only one operational at a time. Batch is a code from Stack HPC. And agent is our solution. So here we are a bit faster, but I mean, that's not really significant. I mean, since we create ports sequentially, what I believe happens is that, yeah, we are able to do some batching because we try to save the config. And that overlaps with the next request. So that's a bit, basically, we're a bit lucky here. Okay, so the same experiments configure ports sequentially on a single switch. But this time we don't save the configuration. And here you see that, obviously, if you do no batching, you save a lot because you don't have to save the configuration each time. And all solutions are the same, basically. Okay, so now the interesting part. So we also create five ports, but with some who create them in parallel. And so here we can optimize, we can batch, we can save the config only once at the end and so on. So obviously, if you look to do only one operation at a time, you get the same performance as before. Okay, so you make it sequential. If you increase the number of locks, you get some parallelism. So lock two is you get two parallel SSH sessions that do some reconfiguration and so on up to five. So that speeds up things a bit, but I mean, it does not scale that much because at some point you cannot add more parallel SSH connection, obviously. So the batching code works quite well. And so we have two variants here. One which is our design with parallel requests. And the last one uses a bulk API. So basically the client does a single API call to configure all ports at the same time. And this is more efficient as we expected. Okay, so now we move to a bit larger scale, so 20 ports. And yeah, basically we get the same results. Batching is pretty good, much better than doing things in parallel or secondarily. And our solution works quite well, especially with the bulk API. And so in the end, in this experiment, we took, let's say, 13 seconds to configure 20 ports on a single switch. So we believe that's pretty good. And if we compare to our very first try, which was really, really poor, it's 40 times faster. So we think that's quite good. How much time do we have? Okay, so the next steps, obviously we would like to... So we upstreamed some stuff, but we would like to upstream the big design change. And to be honest, as I said, asynchronous operation, that's probably too big of a change to be upstream. Because like, Aronic would need to adapt to this new design. So probably not easy to do. And also the implementation is kind of difficult. Because you answer the client before you actually do the changes. Then you change the status of the port afterwards. So yeah, waste conditions, this kind of thing, that's not easy. So probably we will try to go back to synchronous model. So the public API is nice, but less useful if it's synchronous. And then about the design of the agent. So for now we use RPC to communicate between the agent. Maybe we'll switch to ETCDQ, like in the stack RPC design, if it's easier, we'll see. And then one thing which is running we have to watch is NetMico. It's quite complex, so you can quickly have performance regulations. Okay, so all that what's about creating ports, which is the main operation we want to do. But actually, when you create a network, you also have work to do. That's because you create a network, it allocates a VLAN, and you have to configure this VLAN on the switches actually. And the thing is you need to add the VLAN on all your switches because you don't know in advance which one will be used by the user. And currently this is completely sequential. So remember we have 32 switches that just take edges. So in our agent we parallelize all this because we have one thread to configure each switch, so this is fully parallelized. But still, if you have a failure somewhere, then it will... I mean, our network is distributed on different sites. So if one site goes down, it will affect all other sites, so that sounds really good. So we have a zone mechanism that is still not completely finished to be able to only create the network in a single site only when it's needed, let's say. So again, I won't go into that much details. But actually you could also think that you could just pre-create all the VLANs everywhere and that would be even simpler. So we have still not decided which way we'll go. Okay, so this is what we upstream on the left and what remains to be upstream on the right. Yeah, that's it. Well, so this work was done by me and a colleague at Inria. So we both do the design and the development. We are a small team, actually. And I'd like to thank John Garbut because we talked about batching and the design. And Thomas Guaron because he helped me to make the batching code work two days before the presentation, so I could compare things. So thank you. Thank you all. Any questions if you have time? Oh, yeah, there's one. How do you know what it was to make it so great? Oh, so actually we cheat a bit. It's because we have another API in front of Neat1. And this other API has a mapping. So the user says, I want to configure this node. So there's just a name of the node. And we map this to a physical switch on the physical part. And that's where we do the access control also. Neat1. I think I've heard of it, but I haven't looked... Isn't this based on C++? No? Cool, thanks. You mentioned DelOS 9 and was assuming you're writing Del switches. Yeah. Have you tried Sonic? That's... I think we are supposed to get one to test, yes. So the problem with Sonic was OSPFv3 for IPv6. IPv6? Yeah, there's no OSPFv3. It's not even available. I'm doing just BGP, but not OSP3. Especially IPv6. Sorry, go ahead. Maybe I missed the beginning of the presentation. I'm wondering why you are not using MacDriver for this. MacDriver, not MacDriver, to build network boards and stuff. So is that big stuff? No, no. MacDriver, I mean, it's an extension for the Neutron to manage the physical execution on how to create boards, networks, and stuff. I never heard about this. It's like there are vendor drivers for a network, for example, so Cisco is delivering MacDriver. Of course, for delivering virtual networks, not for ironing. Oh, okay. It's a great virtual environment, but I'm wondering if it's possible to use MacDriver? So we have a constraint which is we need to use VLANs. So I'm not sure MacDrivers can handle VLANs if it's virtualized networking. I don't know. So actually, the requirement is to give layer 2 connectivity to the user across all of the network. So we could actually use virtual networking. We could use VXLANs and so on. And the reason we don't do it is because we have several different vendors because it's a national infrastructure so we don't decide what each lab buys, basically. They have public contract and so on. So we need to support everything. So the simplest command generator is VLANs. I mean, every network admin knows how to use VLANs so that's really the simplest way. But that does make things more complex because you need to configure each... Yeah. Thanks.