 Welcome to my talk about automated testing and continuous integration delivery lessons learned. A quick introduction. I joined Toradex in 2011. I spearheaded the embedded Linux adoption there. We introduced an upstream-first policy, and we are amongst the top 10 contributors, the new boot and the Linux kernel, like Tim confirmed again. We also have an industrial embedded Linux platform. It's called Torizen. It's all based on mainline technology. What will we cover today? We'll quickly have a look at what the goal is of any such ATC ICD stuff. Then we look at the landscape we have nowadays at Toradex in that area. Of course, we look at the lessons learned. Then I will do an open discussion. I'm also interested to learn what you guys' experience is. Let's have a look. What is the goal of this whole automated testing CICD stuff? When we set out like three years ago to improve on that, our goal was to deliver more often and better tested software. Of course, also in that process, catching any regressions. When the automated testing is concerned, of course, any kind of manual testing is very cumbersome. Also, another thing was, as we upstream first, we wanted to make sure that the quality of the patches we send upstream on the mailing lists is halfway there. You don't annoy all the maintenance all the time. Such things can really be easy done with some such infrastructure. Let's have a look at how that infrastructure landscape looks like. For source code management, SCM, we actually use GitLab. As a build infrastructure, we use Jenkins, also pretty standard stuff. Then for the whole DevOps artifacts, binary storage, we actually use the JFrogs Artifactory. Then for the automated testing stuff, we have a lava board farm. Of course, that infrastructure is not on its own. You can also link that to further infrastructure. As this picture here shows, for example, if you use stuff like Slack, you can further integrate that so you get nice notifications whether you hold builds and tests and stuff. All this kind of working or not. So in the morning where you come in, the team already knows what to look at and where to concentrate on. We have further infrastructure. We use the whole Atlassian stuff, Confluence Chira. Then, of course, we also have some integration. So you get some emails automatically. Like shown here, the Slack integration. So far we actually run this whole infrastructure on premise. One reason is, for example, the whole board farm stuff, it will anyway require some local attention monitoring. It's also extended. We have new SKUs all the time. We have to really maintain that. Another reason is also when you're running hundreds of Yoctoproject builds like we do every night, it's not for free if you run that in a commercial cloud. So you can as well just buy a bunch of servers in your data center, in your own server rooms and do that there. Another reason is also you basically lose the whole knowledge about maintaining such infrastructure if you run it in the cloud. And that's at least us, where we're actually a hardware vendor. You basically, you will need some of that knowledge because you want to maintain that on your hardware. You anyway have to maintain it yourself. So for GitLab, it's one of the widely used SCMs nowadays. So any source code and configuration change that will require a merge request. And such merge requests, we have a review process, so it at least needs a reviewer and approver. And one thing we did with our upstream-first approach, we also do an internal pre-approval process, like I mentioned earlier. That way you can easily run like check patch and all this stuff automatically on it. And also your whole team knows who is working on what and can look and give feedback on those patches before we even send them to the mailing list. Then of course the GitLab CI stuff, it can catch regressions. Nowadays most upstream projects already do have some form of automated testing or CI CD stuff available. For example, Uboot is also using GitLab, so they even provide you all that. One thing that you have to be careful, we have to cope with various branches, versions of stuff and then it might get tricky to have kind of one CI thing that would work for everything. You probably have to integrate and quite, you know, put some integration effort, different tooling, maybe different toolchain versions required depending on what downstream versions of stuff you might also still have in there or not. Then I have some screen shots here, but my plan is actually if the VPN holds out to show you life here. This is, for example, I'm here logged in over the VPN. This is showing our GitLab instance from my team. For example, I can show you here, like I said, for the upstreaming stuff, we have like an X project that basically synchronizes the latest upstream master stuff all the time. In here currently there is no merge request open, but in the history of the merge requests, you basically see all the stuff we submit. For example, for the latest module, the Texas Instrument-based one, I worked here on the initial support. So you can find a merge request and usually when we then submit it, we also link it to the, so you actually see where it went on the mailing list. One advantage is it's now all nicely documented at what point. And one idea is that we would also maybe make more of that information at one point even available for the public so the customers could see what we're kind of working on. Then two actual builds. I mean, of course, in GitLab you can also build stuff, but we don't build entire Yocto project images from within GitLab. So we have Jenkins build infrastructure and it basically builds from source code to the artifacts which then also triggers automated tests on those builds. Again, I have here some screenshots, but we can check here also logged in over VPN that this is our Jenkins instance. You can see here we have different projects. Right now I'm showing the Toradex reference image which is basically the lower layer BSP Yocto project stuff. We of course have various branches and for example we can here look at the master stuff. This is like cutting edge master branch Yocto stuff. If I click here you can see I actually run one while at the conference I logged in at one point and run one. Usually we run the master stuff once a week while kind of the stable branches we run every night. So I can click here on the concrete instance and you can basically see all the targets we have and all the builds that have been run. And then we will see later. Let's see. Like I said the whole artifacts they are then stored in the Artifactory. This is mainly very useful as we also do now this industrial embedded Linux distribution. Of course that has automated all the error updates, all this kind of stuff integrated and that pulls all the things from Artifactory. It of course can also handle stuff like S bombs and also we store all the test reports all this kind of stuff in there so you have this all nicely together. Again here some screenshots. So when I basically here now live again when I click here for example test reports so you get nicely this is all pulled together and stored in the Artifactory. And it's also linked so you can actually further click. So here you can for example see on a Quad X plus there was some issue with audio. Let me see we get to that now. So for the actual testing we have a lava automated testing board farm. It's basically we organize those in shelves as we have different families of SOMs. We basically have on one shelf then a whole bunch of from the same family carrier boards. And such a shelf is then controlled by so-called shelf controller. Right now with this is still a Cypress USB microcontroller which takes care of like power recovery mode and reset signal and all this kind of stuff. It also integrates a USB hub and or even FTDI USB to serial adapters. Then we have usually an Ethernet switch on there and like I said the carrier boards. And the goal of this design was basically that one such shelf only has one Ethernet cable and one power cable and one USB cable and otherwise is self-contained. And we also use the same infrastructure for validation verification purpose in like temperature chamber. So as a hardware provider we of course want to make sure that our stuff also runs across the whole rate of temperatures. And then the power supply of course you need and there is a lot of company infrastructure for testing. So every hardware interface that you want to test you of course you need some kind of either a device loopback or something that you can actually test against. And in our lava farm we basically have the goal that we start off in recovery mode so we really start from scratch. There is nothing on the modules and then we really flashed the entire Yachter project image and make it that way that it really boots basically the real life as close as possible not. Again here a screenshot, but I can if I now click here for example where we left off you basically get to the lava instance and it shows you exactly for example what went wrong here. So you can see that this audio test had some kind of a problem. And if you hear scrolling this so the whole test that it runs basically you can see that it starts if you start from the top it actually starts with even flashing the stuff it creates of course the whole lava overlay things and then eventually here it does the recovery. Let's see. Basically here you see that the you use stuff and then eventually it will actually have the regular boot. So you see that there is a lot of setup basically also involved in this. Let's see. Here I have some impressions also from the port farm here that there is one such shelf shown. This particular one is with all the heat sinks and stuff these are basically Xora carrier boards with the Apalis modules in them. This is shown the top view of that on the left upper corner you can see that shelf controller which currently actually we are in the process of redesigning that like I said currently this is still very kind of low level with some Cypress old school USB microcontroller the new one we will use some STM32 which will also feature Ethernet of course that all USB stuff is a little bit of limitation in this case here. And then it also integrates some USB hub like I said the goal was that we only have like one USB cable per shelf. And then the other cables you see they actually do the whole power on off recovery mode all this kind of stuff reset. Here it's a close-up of one of these Xora modules you see this guide tightly all the cables that go to it so for example on the left here it is the UARTs that the upper left is the UART for the console that will go to an FTDI the second UART below that that basically has a loopback so you can do loopback testing and then we have further interfaces signals pins here that connect so for example actually up here we have audio you can just loop input to output not I mean kind of line out or something like that back to microphone or line in on USB you can just have some some USB storage device the same SD card which is plug in an SD card on the HDMI we actually have one of these fake e-dit dongles so actually it even brings up the whole graphics stack and here on the right side there is another connector with various signals which we loop back for example SPI we have some stuff on I2C all this kind of stuff this is the whole power supplies of course you need quite some couple of kilowatts of power supplies we have now close to a hundred Teske used basically in there so a hundred modules that get tested in every round that is basically was shown here all these various shelves okay lessons learned so we are now doing this quite intensely since two three years so one thing is so 99% of reliability is not really enough so the whole processes and infrastructure must be reliable you will always have some kind of sporadic issues on some devices the problem is if you now scale that up and you have really like hundred devices running that basically such failures will also scale up and then if you have problems with your stuff the stakeholders will lose trust in this whole infrastructure then one thing to go about that is that the whole infrastructure at least it needs to be very resilient that means even if some hardware issues remain I mean like I said you will always have some issues I mean even SD card eventually they start failing you will have to replace them and so forth you will always have stuff but the overall test infrastructure that really needs to be resilient so it basically copes with such single failures it can't be that then the whole test would abort or something you just have to overcome it and it has to continue as much as possible then another thing that we saw is of course that we now massively scaled that this is not totally linear so the more ports you will have there will be different issues that you will see and so that starts for example if you have on your table a single device it is very easy you can flash it no problem but if you now have a hundred connected to a server I mean USB is nice it just takes a bunch of hops and stuff it all up but you will see that eventually the USB stack for example might go bad we saw that just every once a week it just kind of that servers USB stack just kind of crashed and then yeah all your tests will not really work and there you have to put a lot of effort to basically investigate and make sure that it basically recovers one thing we did there is you have this kind of health step so you have to make sure that in such a health step you for example reset the whole USB stack things like that that you really recover from any kind of situation that can arise and of course that at the end you only find that out with real world issues that you see so you will have to put a lot of effort and time to actually investigate every such issue and fix those then a very important thing is that you manage this whole infrastructure just as you manage code so every device you add to the fleet everything really needs to be nicely a process and the other advantage is if you do it like that you also basically even document your whole setup by design basically so you can just go in there you see yeah on that day we added this device it's all nicely just document and not we also use for example Ansible to set up the whole infrastructure to manage it so you have to make sure that basically it's very reproducible even the setup of this whole infrastructure and then of course it's also about people just like anything else so that is a lot of teamwork basically you need different people with different skill sets that work on such an infrastructure and you cannot just kind of blindly outsource that to somebody you really have to work together and one key thing was that the software and infrastructure teams they really had to work closely together and kind of collaborate very tightly and only then we got to a very much further kind of step where the whole reliability and everything improved dramatically not if you kind of you know when you just ping pong that between teams and stuff then you likely will never succeed we will really have to work together and just really do the stuff and then of course such test infrastructure it's basically it's not just a project it's an entire process it's kind of an organic system and you cannot just set it and forget it you will have to constantly work on it basically you have to maintain it you know versions of stuff changes you will have to adjust your tests because maybe upstream even something changes I mean that happens of course theoretically ABI never changes but that's not quite how it works so we will have to constantly maintain it and yeah it needs love over time basically that's basically it and I would also actually want to open I'm also interested to hear feedback from the audience about kind of similar setups what is your experience what are your lessons learned things like that Anybody? Maybe we'll start with one question from the stream Sure, yeah The question is why are there two build systems GitLab and Jenkins is that historical or is there some technical reason for that? I think it's actually a little bit of both we had Jenkins build infrastructure even I don't know 10 years ago where we still use just like regular C Git or something like that so we didn't really have any of these modern SCMs in that sense another reason I guess is that yeah of course one could do also full Yachto project builds even in GitLab CI but I think it's not, I mean Jenkins it's just really nicely suited to do such stuff Okay, thanks Thank you Yeah, I think there are some Let's start here Hi, thanks for your presentation We'd like to know what's your approach regarding caching on the CI for example Yachto caches Yeah, I mean we do like say caching that we definitely do and we also do some kind of priming so if you run, I mean you saw the the whole landscape we have not I mean we have so many let's see if you run for a massive amount of a massive amount of different targets one thing that you will see if you would just, even if you have a huge cluster and you start now building for all of them they basically will all build the same thing which doesn't make much sense so we also prime it so you start building just for like one 32-bit architecture and for one 64-bit architecture and that will then put all the stuff in and then only then you start much more parallel to build further things I mean that is also things that you find out over time basically, you know, you can put twice as many servers and start building everything together but it will actually not be faster if you do it a little bit intelligently like I said you prime some builds you will actually be five times faster so even not having double the servers I mean it's just the regular estate cash stuff in the Okta project I'm not fully sure to be honest on that detail but I think it's regular estate cash stuff yes thank you for your talk can you give us a rough estimate on how much coding to testing you are doing right now like you seem to have a very mature testing suite and if I would like to bring that to my company like how much resources exactly I know what you mean I mean it definitely needs a serious amount of resources basically money I mean we had at times like 3-4 people more or less full time working on this infrastructure stuff so I mean Toradex nowadays has of course lots of people but they're not all R&D people I mean like on the software side we have maybe 20 people in different teams but now you need maybe 3-4 people that will have to start working on such infrastructure otherwise you will never get to such a state so I have two questions yes in case there is like a bug detected in your CI infrastructure do you also report it to upstream in case it affects upstream yes of course nowadays with these master builds here it can happen that the new RC comes out and basically the next night or even that night we already have that tested and that way we sometimes found stuff even quicker than the Linaro guys so we had for example some SPI box that showed up on our forum in the morning when we came we were we haven't seen this thing quite yet and then you look into it and you send it to the mailing list of course nice I have one more question about the estate cash do you have a per server estate cash or do you share it between the different I think we share it between one entire build yeah but the build can run on multiple servers right yeah I think we have some shared storage okay thanks hello thank you for the talk do you have any experience with simulating hardware interactions for example if you need to test the scenario of plugging or unplugging a USB stick into the device under test do you do that and how do you do that if you do now this is actually one of the further things that we want to have a look at this is definitely I mean we test USB but more or less in a static fashion right now and it is all I mean you maybe also saw my USB talk I'm a big fan of this role-switching stuff and it would be very interesting to also automatically test such things but that requires more hardware and that is also one reason we are now working on a next generation shelf controller which will allow much more such advanced use cases hello thank you for your talk first of all and you say that you put the system in recovery that reflects the whole system and this is for me is correct when you test the distribution but you also have OTA and so do you have any clue on how to test them that's actually a good point and I don't think we do any special kind of automated testing right now so you would mean that you have a certain state and then would kind of incrementally update just like the customer will also update I mean I know that our I mean I'm from the like the low level BSB team and I think our Torizen team they do some level of that as far as I know I mean one thing they for sure do is for example this whole container deployment stuff because in Torizen you have the whole application layer stuff in containers and so you basically start off with a minimal image that is actually also a Yocto project built and then you also deploy further things but I'm not for sure I mean I don't know whether Drew would know whether we do any OTA kind of automated testing yet so they're building it up yeah I think they're really working on that exactly but it's a good point of course I mean the goal is that you if you really have a product that is at customers of course you want to test as close to the real life kind of use cases as possible yeah first I'd like to say how Yocto upstream is using a state cache we have Michael Halstead actually he knows better about it than me so I'll ask him for details but it's a shared NFS server which every build machine reads and writes to so it's completely shared between all the builds and it's kind of continuously growing and that's what everybody in the room should set up their Yocto builds to really get the most out of the cache and my question is actually I'd like to pick on Jenkins just a little bit more if you were building this kind of setup today would you still choose Jenkins or would you choose something else? it's a cool question yeah I haven't thought about it to be honest yeah I mean I'm also open to hear some other approaches maybe that kind of a state of the art approach would maybe be different yes I mean what would your suggestion be for example I don't know I just know that Jenkins does not spark joy but still everybody is using it because they set it up 20 years ago and they're stuck that's the only reason maybe there should be some next generation project that would do some kind of more modern approach I agree but I'm not aware that there is something I can only say that Yocto upstream is using build bots but I don't know if it would scale to a real product organization thanks for the talk you said that you were re-initializing the bots on every test so I was wondering how you do that with lava because I was under the impression that it's mainly focused on uploading Linux and the root file system yes we basically had to add our own kind of wrapper for that there that is true I mean the classical lava approach is not like that I believe it's regular stuff I would just like to point out that we are actually using the GitLab CI for building open embedded initiatives I guess one could definitely do it it is definitely possible and I'm not seeing any significant limitations with that I think like I said it was somewhat historic not that we had this Jenkins stuff already when we then introduced the GitLab so we just left that whole Yocto building stuff I mean it's working fine we already went through that pain basically that you said but once it's running I mean basically don't touch a running system I mean yeah you have to maintain it but on the other hand question at the front here now it's really boring because well I use GitLab CI for Yocto builds as well there's absolutely no reason not to do that it's perfect very good okay I think we anyway yeah thanks for the talk I would just continue with GitLab CI CI as a single person company without employees just use GitLab CI connected to my own Kubernetes cluster and then I use a virtual network to my development boards and with a network controlled power supplies I can even power them on and power them off automatically and I'm good really good experience with that very good yeah thank you great thank you very much