 Thank you, everyone, for coming and listening to my talk. I'm going to talk about a bit about how to do long-term maintenance for embedded Linux system. And by long-term, I mean something like 10 years or more. So we understand a bit more where I come from. I say a bit about myself. I've been doing Linux for about 20 years, first as a user, of course. And then in 2008, I started as a freelancer by OpenMoco. OpenMoco is a company who built a Linux-based phone. And that did turn out like Android or iPhone or something like that, but it was very interesting and there are many things there. And I also used OpenEmbedded for that. And since then, I've worked at a company who builds GSM networks for cruise ships. At first, I thought it was really interesting because I could spend my vacations there. But in the end, it turned out that the software I deployed there over satellite links, it didn't fail, so I didn't need to go there to debug it. It was a bit sad, but nevertheless, it worked out well. And now, since four years, I work at Pegatronics. We do embedded Linux systems for customers in Germany and Europe and so on in many different areas of markets and so on. So to get a bit impression of the audience here, please raise your hands. Who of you has developed embedded Linux systems? About 90%. And who of those has systems who are now in the field? Oh, maybe half. For longer than five years, the same system in the field? 20, 10 years? Two people? So it is a small niche. Of those three people who run software, kernel version, G-Lib C, and so on on those systems, all of those which are still maintained by upstream, no one. So that's what I feared. But yeah, we can't do this anymore because those people probably had to fix vulnerabilities, but I don't need to ask this question anymore because they probably didn't fix those vulnerabilities. It didn't take a day or a week or a month, maybe at the most would be acceptable, but it didn't even happen. So some context about what's following in the rest of the presentation. So our recommendations and observations are based on smaller development teams, like our customers, like our own teams. By smaller, I mean less than 10 people working on the kernel and on the platform. Larger teams work a bit differently. Some of these things which are going to follow may apply there as well, but not necessarily identically. So most of our customers have custom hardware. They don't run simply on PCs. So you need to have some customizations at the lowest level of the system. And those move away from the mainstream server distributions like Debian and Red Hat and so on. So you need to maintain those customizations. And in most of our customers, they don't just build one product and sell it for 15 years. Now they update the product every few years and have many products in the field which need to be maintained and supported. So those are lessons learned in the last 15 years at Percontronics, where we've mostly focused on doing development mainline based. So having as little as possible difference between the product that is shipped and what is in the mainline kernel and the other upstream projects which are used on those embedded systems. So if you have any questions when something is unclear, just raise your hand. I think we'll have enough time for discussion later. So a traditional, I would say, embedded systems lifecycle which we've seen at many of our customers is that you just take a kernel, you get that from your vendor or you take mainline. It's a sock that is well supported. You would take your build system, you add some user space software, you customize that, you add your application and do some testing and you're done. That was, that's what most people hope. But what then happens is they have a maintenance phase of 15 years and they hope they have no platform changes. Maybe they think about, they want to update their own application because they want to add new features, have some regulatory changes they need to apply to the existing products. So they have a plan to update their application but they usually don't have a plan to update the base system. And that's necessary. Those are some statistics from the CVE, the vulnerability database for some important products which are probably in all of your embedded systems. So yeah, I won't go over all of them, obviously, but the takeaway is that you have about 100 CVEs in those central components per year. Obviously not all of them are remote code execution but many of those which are classed as denial of service actually are critical for embedded systems because if you remotely can crash the kernel, your product doesn't work anymore. Maybe it performs some critical function. So it's a critical bug for you and those denial of services are much more prevalent than remote code execution. And we've also recently seen those very large denial of service attacks which happened in the last few weeks. Originated from digital video recorders, security cameras and so on. So all Linux-based embedded devices which haven't been updated in years and are vulnerable to automated exploitation so people can get hundreds, millions of those devices to perform denial of service attacks. But it's not always those unintentional errors which also happen in the main-line kernel, obviously. They need to be fixed but in vendor kernels which have never been through the main-line kernel, main-line in process, so there has been no review. There haven't things like this backdoor which was found in the all-winner kernel. So all-winner posted their source code on GitHub and people build systems with that source but they never really looked at what all-winner changed in those systems. And then the default configuration, you had a file in the proc file system, you just wrote this, root my device, string it with and your process got changed to root. And that was not found in about one to two years and it was all there in products. So what I want to emphasize with this is that you cannot trust that your vendor will do the correct thing because there has been no one else except maybe one or two engineers who changed this that has looked at this code and that would never happen if this patch was posted on a Linux kernel main-list. People would notice that this is not acceptable. So be wary of what your vendor gives you in your enablement BSPs or demo BSPs. Don't just use it for products. Yeah, some observations which we have seen in customer projects, yeah, as I said, other vendors don't care about security but they also don't care about maintenance. Usually they design a new SoC, they develop the Linux software for that because they need to have some sort of Linux support to sell those chips. But as soon as the Linux support is far along to say that all the hardware in the SoC works and that is shown by the vendor, the interest drops rapidly. So maybe you get another update after one to years to a newer kernel version but even then they usually take years from the point where they start developing on some fixed version to the point where they declare it stable and by then it's already obsolete more or less. So if you then start to develop on that base at your own patches, maybe half a year a year, then it's even more obsolete. Then some customers thought, yeah, you use Red Hat, I use Debian for the base system at some customizations that can work for some things, especially if it's based on x86. But those projects don't have the established workflows to maintain difference based on those server distributions. So each of those projects is basically on their own to build infrastructure to keep those changes in sync and to update to newer Debian releases and so on. Then what you also often see what's also been discussed on the long-term stable kernel initiative list before the last kernel summit is that, yeah, we have these long-term stable kernels or the stable kernels released by Rekora-Hartman and others and embedded developers usually select one of those to build their product on, but after it's released, they don't follow this stable release chain anymore. So they think they select a stable kernel and that's some benefit, but they don't actually apply those security patches, which will be released based on that stable kernel. So you basically get the worst of both worlds. You get an absolute kernel when you start and you don't get security patches either. So you're one of the only people who have selected those kernels, you don't get the benefit of testing by many people, so that will fail as well. Then even those people who have found that they need to update are afraid of it because they don't have a process to test that their product still works if they have updated the kernel or updated the G-Lipsy. So they are afraid or they can justify the amount of work they need to put into update this system, so they don't. And we have systems running two, six kernels in the field. And we've also seen this happening more often now that vendors basically read about their device in the news and it's usually not good news. So yeah, but then your reputation is probably damaged for years to come. So yeah, well it's all down to that we need to do continuous maintenance because we have critical vulnerabilities, which we need to fix. Usually one to two persistent which actually are critical to that product because many don't apply because we don't have certain configuration, maybe we don't use those network protocols which are affected and so on. So they don't all apply, but still some are left and we need to handle them. Also most upstream projects like the Linux kernel, like OpenSSL, like the G-Lypsy and so on, they have maintenance releases, yes, but they only maintain those for two to five years, which is not enough for those 10 to 15 years which we need in those projects. So we need to move to newer versions, to newer stable releases. Again, the previous distros, they've been Red Hat and so on, those are not built for this kind of systems we need to have like we heard in the last talk which need to be updated without any intervention. They need to keep running for years. There is no admin that will regularly log in into that system and check if everything is all right and fix things if things are broken. So their focus is simply different than what we need here. So we need to continually follow the newer versions. So what some people suggest we do is backfoting. So we select some kernel, we test it and it works for us and then we monitor what happens upstream. So maybe they have new features, maybe they have some security fix and so on. And in the beginning, it's easy. You can apply those changes to our older kernel, get the fixes and everything looks all right. But what happens in practice is that after some time, you're no longer in the maintenance when the upstream projects, so they don't tell you if your product is, if the version you use is actually affected by this bug. You need to figure that out by yourself. And at that point, it becomes much more difficult to find out if a fix is really relevant to you. And you pile up patches and changes. These backpots accumulate on your kernel, on your GLIPC and so on. And you have to maintain those because you have done it yourself. Nobody else is using those patches. These backpots have been done by yourself so nobody else has tested them. So all the benefit you actually get from using open source software that other people are using the exactly same software and have tested this software is completely lost. So for all bugs you find, you're on your own. So yeah, that's unsustainable. So let's take a step back. What do we actually want for those systems? Yeah, we want to have, from the moment we notice that we need to do something, we need to be able to apply the fix and get it out to all the devices in a rather short time. So I'm talking about, yeah, probably a week or hopefully less, maybe more, but not months. Because the time from announcement of such a vulnerability to automated exploitation is getting shorter. So yeah, should be as short as possible. Obviously if you don't want to break things, when we deploy fixes, maybe the fix is invasive and it could break things, but we need to be able to test that it doesn't break our application. And obviously, as small as our teams are, we need to have enough resources to do that. So it shouldn't be too much work and it should be predictable. So we know for months and years in advance, how much time we need to invest. And when we have multiple projects, we don't want to do this work for each project individually. We want to share the work. Otherwise, after 10 years, if I release a product every two years, I have five products and if I have five times the work, I can't handle that anymore. Yeah, and obviously the upstream communities are not lazy either to develop new features and that might be interesting for us because we can ship new features to existing products. Keeps our customers happy. So from the projects we've done with our customers and what we've seen, what doesn't work, basically only one approach remains, which is that we always need to stay on releases maintained by the up-to-date process because we don't have the manpower to do that ourselves, so we need to rely on the community and do this work together by using those current stable releases by upstream. And we don't want to have a large delta against the upstream projects. What you obviously can do and should do is this every feature which you don't need because if those are affected by some vulnerability, you're not and you can skip those updates and the kernel now has lots of new hardening features which reduce the impact of those vulnerabilities and you should enable them. You use security announcements regularly that can be planned maybe one day every week or something depending on how large your project is and you can plan time for that and it's not too much work. The communities who do that, the CVE community, you can subscribe to the main list, you get updates on that. If your system is current enough that you can simply use those announcements, it's not too much work. And for everything you do on your system, you have processes which are well tested during the development which are proven in maybe older products especially for building your software so that you always know what software is running on the system. If you don't know what's running out there, you can't fix it anyway. So you need to be sure that you can build all the identical software, apply or fix and ship that fix. You want to have automated testing because if you keep repeatedly updating your system, you don't want to do that manually, so automate. And you want to do automated deployment but because everything you do manually just takes time and yeah, will go wrong from time to time. Yeah, I said that already each software release should define the complete system down to the last bit. That doesn't always work but that's the goal we should head to because then we know if some vulnerability affects us. And ensure that we can update everything in the field because if we maybe are not able to update the kernel or are not able to update the bootloader and we do some verified boot and find that we have a problem in the bootloader and we can't update it, then we're screwed anyway. And it's not that difficult but that's something you need to do at the beginning. So that's my suggestion how such a workflow should look and we've done this with customers and that works. So the basic changes for most customers during the development phase, not during the maintenance phase because this timeframe you would basically lay the groundwork that the rest of the process will work. Other people have said this, I said again, submit your changes to the kernel mainline. On most SOCs which are not deployed in mobile phones because they have lifetimes of two years or so anyway, the long-term systems are well supported in mainline. Most systems which don't use 3D graphics can run with very, very few patches and only those changes. That's the part that you need to maintain, update, verify on every update and you're going to do many updates. So reducing that amount is the key to having the rest of the process work. Automate the processes. It's easy enough with continuous integration like Jenkins, automated testing and so on. Do that during development so that you know that it works and you're familiar with it. Run that for as much as possible to get, yeah, to prove that it works for you. And don't choose a stable kernel when you start developing. Choose the most recent Linux kernel Git, base your stuff on that and target a stable kernel only when you go into testing and up to the product release because then you basically save one update cycle and they are current at that point. That means you don't have to spend that work just immediately after product release to just get onto the current state. So our suggestion is, yeah, I say every year seems to be a reasonable timeframe to update all software in your system, kernel, build system, user space, G-Lipsy, OpenSSL and so on to a version that is supported by the upstream projects for the rest of the year. Because otherwise you have some timeframe where you are on your own. So update your system, do it once per year. Look if you have some unfinished patches which may be possible to get upstream at that point because the kernel has evolved, the subsystems have improved so you can get your remaining changes upstream and run your automated test suite. So that you don't have any regressions and maybe improve the test suite if you find some. Just updating at that point doesn't mean that you need to deploy that system. Maybe over the year you haven't had any security vulnerability that affected your customers and you don't see any during updating. So you can decide to just put that in your shelf and have it ready for the possibility that you need to update. So more often you need to check that your system is secure. Look at the release announcements from upstream projects, maybe do smaller updates if they release a stable kernel fix and so on, integrate and let your automated system test that so that you always know that your source code, your BSP and so on is in the state that works. And if you see some security announcement, look at it, check if it actually affects you and then you can decide if you need to do an incident response, so if you need to do the fix in the field, but then that step is easy. It's just a small patch, you add it to your BSP, build again, test again and click on the deploy button because you've tested that this pipeline works so many times during development that's not different during production than during development. Yeah, this is just some tools we have used or could use for some of those parts. Yeah, so just some words for each of them. Yeah, Jenkins 2 has updated Jenkins with the new workflow which makes it much easier to do system integration in Jenkins. You should have a look at it. Maybe it takes a week or so to get acquainted with that and it helps a lot. It seems to be the standard system for doing continuous integration and basically that's key to software quality. Test automation, there's a nice project by the Leonardo people called Lava, I think it stands for Leonardo Automated Validation Architecture which is basically web-controllable server which connects to all your boards and can run tests on them. So we need to connect those boards to some power switch to serial to LAN converter and some other things optionally and then Lava can deploy images on that, which will build Jenkins, test them, run your integration tests, can run your unit tests on those systems and automatically tell you overnight if something broke or if your fix is correct. Yeah, like we heard in the last talk for updating, we want to have a redundant boot so we can switch between two systems for back if we find some problem in the field we haven't seen in testing that can always happen so we need to have a way back. Can do that with many bootloaders, with barebox, there's integrated system which has some algorithms to decide basically when the system is good and when it has failed, it's customizable. Can do similar things with U-boot and grab through scripting. You can use UEFI, they have a boot order defined over those variables. It's possible on all current PCs and also on RM64, at least on the server systems. Then there are many, many projects. I think we also have talks about those in the next few days about software updates and recovery. Here are just some of them which I've looked at. Just choose one of them that should be enough to get the process working. And in addition to the installation on the system, you want to have some system centrally which updates all your systems. So you can do things like Google does with their deployments at the beginning you push an update only to 10 systems, check if those all come back and work fine, then 100, then 1000. So if something has slipped through testing, you don't break all your customers devices at the same time. And those systems do that automatically. Can also do something custom, depends on your requirements. Yeah, so in summary, we've seen many approaches fail. We can't ignore it. People do, but that will bite us. Atok fixes for outdated systems work once, twice, but not for 10 or 15 years because it becomes an unmaintainable mess. We can customize server distributions, but it's always custom and not something you can replicate easily. So, and I hope I've convinced you that if you do it right and plan from the beginning to have such a process and plan time for the regular updates, it's not that much work. Do the upstreaming, automate your processes and develop a sustainable workflow that some people are scheduled to do that work every month, every year, have time for that. And yeah, I think we don't have any excuses for leaving systems in the field with outdated software that's broken for years. Yeah, thank you for your attention. I hope I have inspired some of you and maybe some questions or challenges. Yeah, so the first part of the question was how to work with application teams that have problems with the platform changing under them. So yeah, my response to that is that Linux is our hardware abstraction layer and platform. So when you develop at the beginning, choose APIs like GStreamer, OpenGL and so on, which are already stable, which are well supported and won't change significantly suddenly and get involved with those projects. So you can push back if you have deployed products on those APIs and say, people, you can change that under me. I depend on that. Those projects will listen to you and those features, APIs and platforms will keep working. Maybe they will tell you, yeah, you need to invest some time, but that's a reasonable amount of work. And the other question if I understood that right is that some hardware windows don't provide software releases based on current maintained releases. Yeah, that's true. But I would say you don't want to trust them anyway. So we have had that for one customer who said, I go with the FreeSkill IMX6. I think FreeSkill has a 4.1 BSP release. Yeah, okay. That FreeSkill probably has done enough testing that would work somehow. And then they went and decided on a Wi-Fi module and the Wi-Fi module window said, okay, we have tested the software with 3.8. So now what do you do? One window says, I only support you on 3.8. The other says, I only support you on 4.1 and you're screwed. So then you need to decide. You can do some back porting, forward porting, or you get the stuff mainline. Then you get the testing from other people. You can work together with other people, find the same bugs. Those other people will find bugs in your work before they affect your customers. So get that stuff mainline. It's not that much work, especially if you build a project on something like the IMX6. If you don't have graphics, it's getting easier as well. You don't have that many changes you need to do. All the basic stuff works. And if it's just a Wi-Fi module, then you have one specific part of your system that you need to port mainline. You can do that yourself. You can hire a consultant. That's the only way something like this will work. Otherwise, you're basically stuck on that version for 10 to 15 years. And you know that it will be outdated. And it's only getting harder to update. So, yeah, it's just... At that point, you can decide if you want to have something you can fix or if you don't need it. More questions? The super long-term support by whom? I don't know if I know about that exactly, especially, but there's the LTSI initiative by the Linux Foundation, which is, I think, yeah... I think if you're interested in that, read the discussion on the kernel summit mailing list for that topic. Yeah, so... I think Gregor Hartman in his mail on that list said it best, because one thing what happens is people choose one of those kernels and then don't update. As I said, you get the worst of both worlds. And it seems what the intention is good, that you work together, have some stable base updates, some required features by all your partners, but it seems what actually happens is that there's not enough manpower, so that they are not able to keep up with their release schedule. And, basically, we have, again, what happened with the Linux kernel in 2.4 and 2.6, where we had distributions running 2.4, because that was the stable kernel, they modified it and it got worse and worse, and they were not able to maintain it anymore, and there was a heartbreak when switching to the 2.6. And we also have it for Android, where we have a large fork with many, many changes, and they have large problems with maintaining that as well. So I'm not too hopeful that this will solve this problem. So, at least for systems where you can use Mainline, and I think that are most of them, that is the more promising approach, because you basically use the larger community, which runs Linux on all of the systems, and not some specific community, which cannot keep up with testing that all and maintaining it all. Yeah, it's not a complete solution, but yeah. Yeah. So the question is, how much work is it actually? Yeah. Let me go back to those slides. So every year, updating such a system, if you don't have too much changes in your system, have standard open source components, maybe your application based on Qt, on Gstreamer, in Python and so on. So not too much low-level customization, like those mobile found vendors do. Then I would expect that every year, you need to do something like two to four weeks. The main effort is testing. That's why I said automated testing. So, yeah, two to four weeks, I would guess. Yeah. It depends on how well your automated tests work. And then every month, I would say something like three to five days, spread over the month, looking at that stuff. But that easily scales with multiple projects if they all use the same base. And yeah, you can do that with something like Yocto. You have different layers. They can all run the same kernel, run the same base layers, have just some customizations, and then all the testing effort you spend on one system also improves the quality of the other systems. So, yeah, a week maybe per month. Yeah, incident response depends on what actually happens. But ideally, if it's just another open SSL release or something like that, it's probably less than a day. Apply it to your BSPs, let the tests run overnight, and yeah, it shouldn't break anything at that level, and then you can release it. Yeah, the comment was on that this will only work on well-supported CPU architectures, and that's right. So, my suggestion in that way is that you should look at what is supported well now, because people will not throw out SOCs like IMX or OMARP and so on out of the Linux kernel, because so many people use them and maintain them, so they will keep working. And yeah, we had systems which where people said, hey, this architecture is not in use by anyone anymore, let's remove it and somewhere removed, but you can bring those back by reverting that commit, and if someone is interested in maintaining those architectures, nobody in Mainline is going to stop you. So, if you're the last one in this community which actually uses that hardware, yeah, you're stuck with maintaining that by yourself, but at least you can get review and comments on your fixes for free from the mailing lists, which you don't get if you just do it for yourself, and those people will find fixes or bugs in the patches you submit, even if they work for you use case. Yeah, the question was how do we convince the managers and customers and maybe non-technical people? Yeah, I would say in long run, this is cheaper. Yeah, that's my opinion, they will say, but we've seen it in the Linux community many, many times, and this pattern is repeating in people who do the maintenance. It's not too much work. For some customers, it seems they need to fail once before they listen to you, but yeah, we probably have time to wait for them. The question was if there's some paper or statistics on how much impact those vulnerabilities have, there probably is, but I don't have any handy, so I don't know, probably not by the Linux Foundation, but there is definitely research on how much these are security vulnerabilities cost. I haven't any ready, sorry. Yeah, so the question was how to handle legal requirements for certification, where, for example, in the medical space or automotive, the certification authorities are, yeah, they're not accustomed to having software change, because in the years back, you've got some microcontroller system to control something, and you tested it and you released it, it was not connected to anything, and if it worked once, it will work for 10 years. So the process is adapted to testing one version, releasing it and not changing. So getting changes certified is very, very expensive, and yeah, I don't think that model is going to work anymore. So basically the only recommendation there is talk to those certification authorities and convince them that we need to have some process to update the software. We need to do it reliably, obviously, but we cannot stay on the old version where we know it's broken and or where we know it will be broken in five years. So that's maybe it's better to the letter of the certification authority, but in practice it's not much better. So something there needs to change in the certification. That's not solved. Yeah, the comment was that maybe we want to minimize the change. I don't think that that's correct because when we minimize change, we actually move away from all the testing that is happening in the community. And we do some custom changes which nobody else is testing, so the risk is much higher that we introduce problems that nobody else will see and fix for us. So I think the focus must be much more on keeping the difference between our version that is released on the product minimal compared to a maintained upstream version because we can expect that problems in the upstream version are found and fixed, at least critical ones, quickly. And then we need to audit the changes we've done. There's some project going on at OSRDL, the open source automation development lab for certifying Linux for ASIL. They are developing a process to get that through certification authorities that might be interesting to look at for those use cases. But it's not solved yet, obviously. How much time I've left? I think that is probably, do we have still time? Could you speak up? Yeah, so the question was, what's my estimate for basically just rebasing my patches onto the newest Linux kernel or also submitting some of the work upstream? Yeah, that depends on the amount of changes you have. So if you have done a lot of work initially and maybe just have 10, 20, 40 patches left, which are not too difficult, then you can do some additional main learning in those two to four weeks. If you have a lot of complicated patches, you will spend that time forward porting those alone. So there's no general answer to that. Yeah, I have some suggestions for further talks, which will happen here. The slides are online, so you can check them or make a picture or something. Those basically talk about related stuff, updating continuous integration, updating your kernel and so on, which is all things you will need to do if you want to do it that way. Yeah, thank you very much.