 First of all, for anybody who knows me and I know there are a couple in the audience who do, I apologise for wearing a shirt. My wife made me promise not to look too much like a geek up on stage. So, my name is Dave Pickett. I am the tech lead for the Lenaro Lava Lab project. I'm going to be talking today about the challenges in deploying test devices from every scale, from very, very low end up to very high end, from IoT all the way up to server class. So, the first thing I'm going to talk about, I'm going to talk about Lava itself. How does it work? How do we deploy our test images? What do we do if a board gets bricked but where we deploy a test image? The issues surrounding connectivity, the physical constraints on deployment. Everything in the Cambridge lab is in racks and that produces challenges. I'm going to talk about the administration, how we manage the devices, manage access to the lab. Then look at how we test out before we actually deploy new software, new boards, whatever. And then looking at the whole concept of very large scale deployments going forward. So, what is Lava? It's the Lenaro Automated Validation Architecture. The first implementation started late 2010. It's basically just, it does not define what tests you run. You can run any test you like. It just allows you to deploy a test image. So, a kernel and a root FS or if it's an IoT device just the Zephyr image or whatever. And then to specify what tests you want to run and run them and gather the results. The first iteration was, it was successful, it was good. It had some limitations and a guy called Neil Williams joined the company back in 2014. And Neil said, we need to completely change this. So, the second iteration, V2, you can't believe how many meetings we had about what we should call the second iteration. And we came up with V2. That was rolled out fully into the Lenaro lab in Cambridge in, well, last year. And we've got rid completely of the history of V1 and everything. There is a database archive so we actually do have access to the data if we want it. So, what are the challenges when you're trying to do automated testing? The device, ideally, should boot when you apply power. Because if you have to press a button, that gets a little bit boring, challenging. We had conversations going way back about having little robot fingers to press buttons. That's not the way we solve that sort of problem. You're going to have multiple devices connected to one server. Lava is a sort of, no, it's not quite federated, but it will be. But it's a master slave environment, so you've got one master which dispatches the jobs to slave servers. Dispatches, in fact, we call them. And if that device, let's say you've got ten of them connected to one server. You have to be able to uniquely identify that device in several different ways. One of which is if it's a USB serial device, the USB serial needs to be unique so that you can get a unique serial connection. If it's a fast boot type USB device, that has to be uniquely identifiable, the fast boot ID. We use serial concentrators, which allow us to just tell that onto a board through the serial concentrator, but not everything has just a standard old RS232 serial. Serial connectivity, you need to have a serial connection to the device so you can interact with it. Lava does everything, okay, there's a caveat to that, but I'll come back to it. 99% of all Lava interaction is through a serial interface. We don't rely on SSH, we don't rely on a network, we just have to have some sort of serial access. And that has to be reliable. That in itself can be challenging, but I'll go back to that caveat. With IoT devices, you don't have an interaction layer. All you do is you flash an image to it, you then look for the serial output and look for the results and you parse them and put them, but they get collated and put back up. So these are the things that we have to address with every device. So, if the board gets bricked, the first thing we have is power control. We have remote power control, we use software controllable PDUs with a PDU control abstraction layer, which I wrote, it's an SNMP client which talks to our PDUs, but there's no reason why you couldn't add support for other PDU types. In fact, there was a predecessor to the PDU control that basically did everything over a serial connection to the PDUs. Written by Matt, I don't know if Matt's here, but I shall name check him. The other thing, when you submit a job to Lava with some tests, the job can fail for a number of interesting, challenging reasons. One of them is an infrastructure problem. If Lava detects that there is a problem with the serial connectivity with the fast boot flash with the network, whatever, what happens is we run a known good job with a known good image, we run a health check and if the board then fails the health check, we take it off out of the pool, take it offline. I mentioned the whole idea of a robot finger to press buttons. What we've ended up doing is actually sourcing some Ethernet control relays for emulating the push of a button. So basically wherever the button is, you just put a wire each side of it, put it out onto the relay and then there's again another abstraction layer. You control the relay, you say I want that relay to go off for two seconds and then back on or the other way around depending on the device. So the other thing is what if the board is completely brick and even the firmware is not reliable? Well, you can reflash the firmware on some devices, not all. And again, some devices just give it you out of the box. Some of them again require relay connection to put it into a recovery state and other devices it's just not possible. So it is completely board dependent and we have to come up with solutions on a per board basis. And all of that feeds into the fact that we have to do solder mods, which if you're going to deploy as we have around 200 devices in a lab, it doesn't scale well. Lots and lots of solder mods doesn't scale well and there's the danger that you might somehow break the board. There's the other danger that solder itself can be flaky. And this came up yesterday. There's a thing called the SDMux has been floating around for about all the time I've been at Lanaro, which is eight years. The idea here is that you have something plugged into an SD slot on a device and you can write the image onto this from the server and then switch it so that it just looks like an SD card. I don't know how many iterations of this we've had. It started with in Orlando in 2011. I won't mention the company or the name of the guy, but he turned up with this thing and it was like, oh, this is brilliant. Actually, we've solved all our problems except it would work with one board and no others. We had another iteration the next year, a thing called the Lava Multiprobe, the LMP, which had an SDMux. It worked most of the time and then performance degraded over the period of a couple of hours, not days even. And we've had other iterations sent to us, contributions. There's another one which has just turned up. I heard about in the last couple of days. Maybe that will help, but if we could find the perfect SDMux, that would solve an awful lot of the problems for our low-level deployments. But we're not there yet. It has been a bit of a nightmare. It's been seven years, eight years of trying to get something which works. The other thing we've learned rather painfully over time, and particularly, I want to mention here, a lot of this has come up in the last two years. We've had this deployment called the LKFT, the Lunaro Kernel Functional Test. And it's an isolated instance of a lava lab in Cambridge. And we had a relatively high failure rate in terms of infrastructure. And those failures were serial, USB connectivity for a fast boot, networking, you name it. So one of the first things we did was go and get really high quality serial cables. I hate to name check somebody, but FTDI are the ones who we buy FTDI cables. And they are so reliable. Likewise for USB connections for fast boot and the like, we bought... They're much more expensive, but shielded USB cables. Who knew? We reduced our infrastructure error rate when we first started the project was running at around 30%. It is now below 1%. We have a 99% infrastructure reliability, which was like a dream when we started this project. And the other thing, we had a challenge which goes back a long, long way. The USB hubs that we used, we have spent a lot of money on USB hubs over the years. Because all the devices need to be connected, they need to be available on USB. And we had reliability issues. It didn't matter how much you spent on the USB hub. After a period of time, the kernel on the server it was connected to would just start to go there. I don't know that there's even anything connected anymore. And the other challenge we had, particularly when the 96 boards project started delivering hardware to us, was that there's only one USB controller on the consumer edition 96 board. And there's no ethernet, so we needed USB ethernet. But you also need USB to control the on the go port for flashing your images. And the only way of doing this, if you physically have a board at your desk, that's fine. Because you just go, oh, I just unplugged the OTG and that flips over and enables the other USB ports. You can't do that in an automated environment. And so going back nearly two years, somebody at arm said me an email saying, oh, there's this little company make really good USB hubs. And they're here in Cambridge and having bought numerous USB hubs over the years. I thought, yeah, OK, I'm sure this is going to be fantastic. We'll take a look. So I contacted this company. They're called Cambryonix. They're up on St John's Innovation Centre in Cambridge. And I asked if we could borrow one to test it out. And yeah, sure, no problem. I said, so they sent it. It was coming up to Christmas and I remember, oh, I'll just unpack it and take a look. And it's a very industrial looking thing. It's got 15 USB ports. The claim on the box is it guarantees 2.1 amps per port maximum load. It comes with a very, very big power supply. And then all I did was I plugged it in to my laptop. And lo and behold, it turned up as a serial device. OK? So I just did a little hackery with a certain net config and telneted onto it. And I got a command line from a USB hub. So I typed help. And it came up with a load of things I could do, one of which was to control the power on that port. On any port I could control the power on all the ports or just one port at a time. So a little Python later, I had an abstraction layer that allowed us to say, I want that port on or that port, sorry, in sync mode, or typically we power it off or we put it in sync mode because we need data, not just power. So you put it in sync mode or you put it off. There are all sorts of other things that that script allows. You can go and find out the state of a port. For example, just to tell you that you're very sure of where you are. What's more, it's unbelievably reliable. I have never in a year and a half, two years, had any problem with the kernel growing. I don't know anything about any USB devices anymore. Not once. So I then went back to Cambrianics and I said, this is great, I'll buy 10 and he then told me how much they were. They're a bit more expensive than most USB hubs. I did buy 10 and we now have something like 30 in the lab. Spending ostensibly probably about eight times more than you would on a high quality USB hub has been absolutely worth its weight in gold because the reliability. So it is worth if you are trying to do any sort of large scale deployment, spending that money because if you don't, you're going to end up with 30% failure rates, that sort of nonsense. So then there's the physical aspect to how you're going to deploy something. You get a wide range of form factors. IOT boards can be like that big and you've got servers which come in two, in one case, three U forms. So anything that's on a relatively small form factor, we tend to just use monitor shelves. We just put monitor shelves in and then fix them onto the shelf using various techniques. In the cases of boards that come in odd form factors, an example is, well, okay, the Versailles Express platform comes in a box, it's a very nice looking box, but when you're trying to do large scale deployments, it's not great. So what you do is you re-factor it into a one U case which costs you about £40 and you end up being able to deploy many, many more into the equivalent amount of space in Iraq. One of the things we are doing now because of LKFT, we had to look at how can we get more boards if we've got the small form factors, how do we get them into multiple into one case and we started doing this manually. I'll come to the future of that towards the end of the presentation, but the next problem is we've got a large, actually there's five, six lava instances in the lab. One for networking, one for just general day-to-day testing, LKFT, power management group have their own. We have the light, the IoT group have their own instance and then we've got a couple of staging instances and you have to manage the configuration of all this because all those tools I was talking about, the SNMP, PDU control, the USB hub control, all of that stuff, you have to get that on to each of the dispatchers so that it's available for lava to use. So all of that server and lava configuration we hold in a salt repo. I don't know if everybody is aware of salt but it's a very good configuration management tool. Basically you have one central repository and then you salt the changes across all the different dispatchers from the one master server. We use Ansible for user account management. I've got the links to the various bits and pieces. If you look at lavalab.git you'll see all of those, they're in shared lab scripts. There's all the stuff specifically to do all that lovely control and monitoring. In Ansible there's all the user account management specifically as well added to that is VPN access because sometimes people don't just want to submit jobs remotely or using a bot or whatever. They want to be able to actually talk to a board and there's a thing called a lava hacking session which allows you to do that. You submit a job as a hacking session, it deploys the image you want, gets the board powered up and then gives you SSH access as long as you are within the lab network. For that you need VPN. You can do it with remote SSH access into a gateway box but the principle is still the same. For people who want to do other types of testing where they may want to do lots of reboots because a hacking session stops the moment the board reboots and you're kicked out. If you want to get onto a board and play around with a lot more we give you SSH access through VPN onto your control server and then there's a thing, a develop LXC it's called script. It just runs a container, brings the device up, passes all the information through to the container and you can do what you want then. You can tell that on to the board so you can get serial access. You can flash anything you want onto it, anything you want. That goes outside of lava but it allows another layer of interaction and our power management working group tend to use that quite a lot because they're testing out all sorts of weird things that happen during boot as they're doing power measurement. I mentioned it briefly, staging instances. We have two staging instances, one for the main lava production and one for specifically for LKFT, the functional test. It's very important because you can't just, there's a new release of lava every month practically. You can't just install the new release and go oh I'm sure everything will be fine because often it's not. Something fundamental may be broken so we have at least one instance of every device type in our staging instances. That way we get to test out the new releases of lava. Also, and this is really important for LKFT, when there's a new firmware available, new bootloader available, you don't just put it on a production instance and just go I'm sure everything will be fine again because it never is. It really isn't. We have had so many, oh no this fixes everything. Don't worry and we've gone okay, it fixes everything. It has broken everything but the one thing it fixed. So we always, if we get new firmware, we test it out in the staging instance and we test it to destruction practically because we have to. We have to provide a service that is high reliability, high availability and so we cannot take those risks and we're risk averse, we have to be. We're providing a service. So I mentioned large scale deployments where we're looking for ways to scale up massively. So we're working with a third party to fit 16 of the 96 board CE form factor into one U and that is more of a challenge than you think because you've got to get all the power in there to power all those boards. You've got to have all the serial and all we want is the ability to plug one network cabling, one power cabling and one USB cabling and everything else is done within that instance and we have that that's actually the last design template I got from the third party. If anybody wants to talk to me about that in detail I can do so outside of the session. If we can do that 16 boards in one U we could have hundreds of boards in one rack. The scalability, the reliability is going to be key in this but I have great faith in the company doing this but the scalability is going to be enormous. It's going to be a huge benefit to us going forward because we have requirements where we will need hundreds of a particular board type or even mixes of board types because it doesn't have to be just one board type in there as long as they're of the right form factor. You can have a complete mix. That's where we're going. Those are the challenges we faced and that is the end of my presentation. Are there any questions if there are these microphones here and here? That's on. Is there any particular reason why you used separately salt and ansible? I believe you can use only one of them, right? It's history. Originally we used salt for everything and that was initiated by Andy Doan going back a few years. Then ansible became the thing that was being used by other areas within Lunaro. I have a project to migrate everything to ansible. Everything is insult at the moment and works. It's one of those, yeah, I don't want to change. Anyone else? Thank you very much.