 Hello everyone, welcome. The 7 year leap. This is about what happens in a project to develop or maintain a product when the decision is made to update the kernel from what is by now a very old version, several major versions at once to what was at the time, the latest stable release of mainline. So, my name is Ed Langley, I'm from Koblon Embedded Software Consultancy, and as I say, this talk is based on an actual project that happened and when I said real world in the title, I don't just mean it really happened, I mean in the real world when you're developing a product, sometimes things don't go so well, devices just won't work. So, I've tried to convey some of the actual challenges because having a nice code is great, but when you've got a ship, then sometimes you have to make compromises to get it out. So, for this case study, what were the issues that led to finally upgrading this old kernel? How were the upper management sold on the idea? What kind of issues ran into along the way? And if anyone else here is ever tasked with estimating or implementing a similar exercise, how long did it take in this case? It might be a useful reference for you. So, if you're updating from 2.6 straight to 4.0 something, that kind of implies that the product was running a pretty old kernel by that point. So, this is not the first time that the topic of dealing with old kernels has been covered in these kind of conferences. So, I've just referenced a few here. Will's talk about coping with an old kernel, fixing up a BSP that had an old kernel in it, and just maintaining software over a long term and in that particular talk, making sure you have a ready upgrade mechanism in place, which implies you have a newer kernel to upgrade to in the field. So, the only thing I would add to the topic of older kernels, it would be to ask, where are you on the kernel age spectrum, or at least where is your project? Because people can find themselves working on a product or a project that has an old kernel in it for a long time, and it kind of ends up being that area of expertise. So, this is a scale of kernel age that people might work with. Right over there on the left, people creating the very newest version that hasn't been released yet. And then system-on-the-chip vendors might take a new-ish mainline, put it in a public repository, add their supports for their chips or platforms. And then you've got new product developments. So, if you get a Yocto BSP today, unless you do anything weird with your recipes, you're probably going to get a 4.9-ish kernel come out. And then, going right the way along here, you get to projects that have been in development for a long time or in maintenance mode, maybe still adding a new fix here and there or a feature. And that's kind of where the example in this case study sits, about number six there. And this discussion is about moving that project right the way over to more like number three, at which point it's much easier to take any fixes that are helpful for other people, feed them right in at the left, i.e. upstream. So, in this example for this project, it was a handset for specific users running Android. The user didn't know it runs Android, it just boots into a custom app specific for the purpose of that device. And to the customer's credit, they had used a hardware design that very closely followed the Beagle board XM. Anyone remembers that from several years ago now? That's based on a Texas Instruments OMAP3, which got rebranded to Da Vinci. And the only major difference was they were using the display through what Texas Instruments calls the remote frame buffer interface, which is actually just MIPI DVI. So, there are already a few problems that got solved before upgrading, just living with the older kernel. And the major one was the real-time clock alarm needed to wake up the handsets after a fixed duration, so it could do a GSM sync. And yeah, that took a series of commits back ported from three points something and a lot of fixing up because it was quite heavily integrated into the platform code for that chip. And then the project started running into issues that were not so easily solved. Well, that wasn't easily solved, but issues, a series of issues that altogether was not an easy proposition to solve. The main requirement feature was the battery life. The suspend current draw needed to be a third of what it was. And in order to get that, a lot of power management code from later kernels would have needs to be back ported. Again, that's very heavily integrated into the platform code for the chip. The SD card, you take it out, after boot you take it out, you put it back in, nothing happens. At first thought I thought this sounds like just a pinmuxing issue on the card detect line. Run the function tracer, I can see the interrupt firing going off into the OMAP hardware mod framework, which if it works with an OMAP on an older kernel, there's a whole ream of platform code in there that deals with OMAPs and plugging all the on-chip peripherals together. And it's soon transpired that was, again, going to be a lot of very complex back porting to get that working in a 2.6. And then the big problem was the suspend lock-up, what became known as the suspend lock-up amongst everyone. So very briefly goes into suspend, it's coming out of suspend, gets stuck somewhere. It's very intermittent, eventually realised how to reproduce it more reliably. Even just last week, rehearsing for this, I came up with some further ideas, I could go back and try to solve it again. But as a consultant, you feel obliged to offer good advice to your clients, and at this point I said to them, I could keep spending more time investigating all these issues, but what are you going to have at the end of it? If we solve the lock-up, you're still going to have too high a current draw, and your handset battery life is too low. So the upgrade had already been raised as a suggestion before, and then kicked around and murmured about and go away again. So I suggested it. And if you ever suggest anything to management of work, typical kind of questions that come back are along the lines of, well, how long is this going to take, Ed? How much is it going to cost us? So it's very difficult to estimate an entire kernel port. So the tactic in this case is to estimate how long it's going to take to get a new kernel running to the point where you can establish while it's fixed the issues, and then we'll carry on with adding the rest of the peripheral support for your custom platform. So that's how I got the buy-in in this example. So that's the lesson. Just if you want to sell the idea, you only need to provide the estimate and the sell for what you need to get the understanding. This is a goer we can carry on. So if you are sort of thinking about the same activity, depending on the system on chip that you have, getting to the point where you have a kernel running and you can do some basic testing might not be so easy. In this case, there's a lot of platform supports for the OMAP already upstreamed into mainline by now and settled there for some years. So it's the plan. Just ignore Android for now. Build a basic crucifest the easiest way possible. Use that to boot the kernel and do some command line testing. Now, with a much newer kernel, you need a newer tool chain, but if you try and then build an old Android source tree with a newer tool chain, you get all kinds of problems just pouring out. So the technique I used for that later was a newer tool chain for the kernel, older tool chain for the Android. So it doesn't really matter. Now, I was like, OK, I'm going to go get a kernel. I'll just Google what's the best way to build a kernel for a Beagleboard XM. Well, not that for mostly, but I got sidetracked by various GitHub projects for his build routes that's optimised for Beagleboard or his Beagleboard kernel on somebody's repository. In hindsight, the best thing to do is just go and get the mainline kernel and establish what's the config, many config file you need, what's the DTB you need. They're all there in the mainline tree and it's much simpler if you just use those and get going on a mainline kernel. So if you get a newer kernel, I was talking about this with someone last night, you do need newer U-boogs, right? Obviously for the DTB loading, you could get by without having to load a DTB, but why would you want to? It's much simpler, just upgrade them as a pair. The consideration I had early on was, I knew, that they would want to be able to insert an SD card and when the on-chip bootloader recognises there's an SD card there, well, unless you open the case and press the user button, it will boot from NAND if there's a valid MLO in there, if you've ever worked with TI chips. So if the MLO sees an SD card, it will boot that U-boot, so they need to be backwards and forwards compatible. So that was just a case of making sure that the newer U-boot could be loaded with the same file name to the same memory address and just made sure to establish that early on so it didn't come back to bite me later. So after a time, I got a new kernel up and running on this handset, had a debug serial cable just pouring out of the thing. That gave me a command line and I was quite impressed with how much just worked, actually, because they followed the reference design for the Beagle board. You could press the power button, it went into the PMIC, and the thing just went to sleep. I looked at the power supply and glorious, the battery, the current draw was spos on the target figure, so it was like ching. The suspend lockup looked like it had gone. The RTC alarm was still there from after I had backported it to the past, and you could insert an SD card anytime you like. Brilliant, welcome to 2018. So, as I say, went through all those issues, checked them off. The suspend lockup, I just knocked up this script that just goes to sleep for 10 seconds, wakes up 10 seconds later, left that running all night, 2,000 cycles later, and then reproduced all the methods that you can make it happen more regularly, it doesn't happen anymore, and I just run that every time I added something new to the build, just to make sure it didn't regress, or there was actually a driver that was just causing it on this platform. So, then we come to adding the peripherals back in, because at the moment it was just a device with a serial port, you need to be able to pick it up, have a display, backlight driver, and so on. So, this is where the seven-year leap idea comes into it, because it sort of feels like if you've been working on 2.6 for ages, it feels a bit like jumping into a time machine and arriving in the present, where everything's mostly the same, but there's a few tricky differences that can catch you out. So, have a good look around what's there before you just start blindly forward-porsing all your drivers. The haphazard manner. This is an example. The backlight had been done by backporsing the Pandora backlight driver to 2.6, renaming it, changing the PWM channel. So, I was sort of in the process of doing that, and after about 10 minutes I thought, isn't there a better way of doing this now than adding a whole new C-file to change one value in there. And it turns out there is... you just specify in your DTS, you want to use the generic pulse width modulation backlight, and you just give it a PWM instance, which is the one off the PMIC on here, and you see TWM1, you just change that number for the channel you want. So, that was... a lesson learned is don't rush to forward-port customisations from your old channel. Have a good look around, make sure there's not a more elegant way to do it in the much newer version that you're going to. So, on that topic, it's a really good idea to just learn how to use device tree properly if you've been working on 2.6 for three years and you actually want to have a play with a modern kernel for once, because I see a lot of people copying and pasting stuff out of similar DTSs, and it's a really good idea to understand how to go through the documentation and see exactly what attributes are available to you, not just rely on, well, this looks like the right thing, but I don't really know how to customise it properly. So, that was an example of don't blindly forward-port. This one turns out the best solution was to just forward-port the changes. So, the lesson there was take everything on a case-by-case basis. Sometimes it's easier to just take the customisations for a platform and just move them forward. So, the display, and come back to that RFBI display interface. This is where I'd said, yeah, I think we'll be adding the display and get you the next build next week. After going to turn it on in the menu config and not being able to find this, and then establishing that when my market is unbroken and it doesn't build, I was like, I'm not going to have that ready after all for the next build. So, yeah, it works through. It's basically a case of working through fixing up the RFBI driver to work again in 4.14 because I think it just fallen into disuse. Not many people use that display output on that platform. And looking through the commit history, I did see that actually all the changes are now supposed to be going on in the direct rendering manager where a lot of the own map display drivers have been copied over to a new place. And that's where I should have been working on it. But again, in the real world, the best way to do this in a timely manner was just to fix it up where the driver had been left behind as is. And it would be nice to upstream that change. But in order to do that, I will probably have to move it to the right place before it will get accepted because I say, why are you changing it over there? That's old. There's no active development going on there. And also, again, looking around the source tree to make sure you're not reinventing the wheel, I was like, well, I've got an HX83, something, something, which outputs to an ILI98, something. Oh, there's some drivers in the staging tree here, the chips that sound like that. And looking through the data sheets, yes, some of those chips do have MIPI DBI as one of their interfaces. But the way that's all of those drivers kind of intended for SBI decoders hanging off of smaller micros and the way they're all plumbed together in software is like a world away from how an OMAP goes through its display subsystem to do its graphics. So I sort of thought, okay, I don't feel like I'm reproducing anything major here. But the point is to just make sure that there aren't already some drivers hiding somewhere in the tree that you could use. So, yes, once I got it building and the driver was reporting it was happy, I was expecting something to come up on the display and there was nothing. So, yes, I went through the usual troubleshooting approach. As you can see, I was really concerned about those clocks, came back to those three times. To be fair, the second time I did find some of them weren't actually turned on. At which point, yes, it was counting down the pixels remaining to be output to zero and doing an interrupt. Eventually, I found, looking through the schematics again, which is a popular hobby for someone who can't get their driver working. I did find... Now, I should note that on a real schematic, they don't put it all on the same page. I'm sure you've seen this, right? Probably look at them every day as well. So when you've got a level shift there and it's got two voltage rails coming out of it, it's, you know, if you're busy, it's easy to assume they're just fixed power rails, right? And when you come back again for maybe the eighth time, you realise, oh, actually, they go into the PMIC and you need to turn them on. And you can see here some 2.6-era platform code, everyone's favourite kind of code, right? There it is, a boot time that would turn on the regulators in the PMIC. So the point is, a really good thing to do when you embark on forward-porsing a long way in terms of a kernel would be to go through all your platform code that's custom to your device and audit that and make sure it is all reproduced correctly in device tree source. Because it's like doing a code review. It's more optimal than writing the bad code and debugging it afterwards. It's better to just make sure you've turned everything on properly instead of waiting for it to not work. So then it came to putting Android back on. And, yeah, it's not quite as simple as just, yeah, take that test-router first away and put the Android images back and it just boots. To start, I hadn't done anything with the GPU yet. So I just told Android to render in software, which I expect it would be a bit slow. And then when I did put the Android images back, they didn't just boot. And you've got to unpack your RAM disk, keep adding more utilities in there, eventually putting a whole shell in there so I could just boot into the Android RAM disk, try and manually chive it into life. And slowly established, yeah, the kernel is missing various things that Android needs, obviously, in hindsight. So the first two, that was just a case of forward porting those as is from the latest kernel where they were present. They'd been taken out because Android had become more better in later releases at just working on a more vanilla kernel. Android Debug Bridge wasn't working and this is an example of, again, don't blindly forward port because I started to take the latest available version of the Android composite USB gadget driver, tried to force that into the new kernel. It turned out it wasn't going to be an easy job. And I thought, well, how does the newer releases of Android deal without this driver? And after reading up, found, well, actually, it uses the standard USB gadget config of S framework to create a composite USB device on the fly. Or when did that get added to Android? It happened to be in the release that I was using. So it was just a simple case of telling it, do that instead of using the old driver and everything was okay with a bit of system configuration. So again, take everything on a case-by-case basis and publish what the power of release resistance is. Sometimes it is easier to forward port, other times it's better to do it a much newer way. So it got to the point where the test team had an Android device that they could test with the same test cases that the old kernel was using, the app team had something they could plug into Android Studio and load their app onto and run it. And if you're wondering, if you upgrade an Android KitKat device to a 4.x kernel, it doesn't just print the kernel version there, there's like a regex in some Java somewhere, which doesn't work with that version, so I had to go and manually alter that because I'll be damned if I'm going to update the kernel on this thing and it's not going to print it in my about device screen. So the feedback from the test team initially, apart from, hooray, well done, was, it's a bit slow. And I was like, yeah, that'd be that GPU then. So it didn't quite get the break I was hoping for, like the rest. So this is a graph of age trace, which is an Android version of system trace. It basically comes out of the kernel function tracer and makes a nice chart for you. The solid bars at the bottom are the graphics taking all of the time before the next display updates to fill the frame. And on that platform, at least, that would just slow the whole system down because it would all seem to be timed off of the display update. So I'll just summarise reenabling the GPU support There is a binary library and daemon in user space and a kernel glue module. Lots of versioning compatibility and it's just a big pain. I think that's all anyone wants to know about binary only GPUs, right? In Linux. Now what makes a product a product and not just a board with a few drivers on it, it's all the stuff in between, right? So if you work at the product end of Linux on devices, you'll be familiar with system integration, which is getting it ready for a member of the general public to use who doesn't want to plug their debug serial port into the thing. So, yeah, I sort of played around with the NAND to get that more optimally set up, had some trouble with the GPIO numbering, just randomly changing every time you boot the device. And, yeah, the ADB and the GPU kind of took a few more goes around, or two more goes each, so to get them working properly in Windows for the ADB stuff. So, yeah, as I said, as a reference for anyone, whoever wants to, or whoever gets tasked with doing something similar themselves, it might be useful to have some figures about how long this example took as a reference. I appreciate it. It might be hard to read those numbers, but the Gant... You can see from the Gant chart there's 20 weeks shown there, and the longest bar at the top was that RFBI driver, and then the longest bits at the bottom were the GPU stuff once, and then, again, when the v-sync wasn't firing properly, the v-sync interrupt. So, I'll just leave these up for a few seconds. Now, in reality, because of the scheduling of this work, it was done on a part-time basis, so what I did was take the hours and re-plot them on that chart as if they were done as eight-hour workdays. So, it was 20 weeks based on a 40-hour week. It was 94 days, well, about 800 hours of work in total. So, yeah, just to summarise again, the main lessons learned during this particular example. If you want to broach the subject of... Come on, guys, let's update this kernel already. It's much easier sell if you say, well, let's just establish it's going to fix what we need first and just estimate on that basis, and then they can decide from there. Take each driver or piece of platform customisation on a case-by-case basis. Some things are easier, so just forward ports, like the GPU drivers in this case in the kernel. What to leave behind that Android composite USB gadget, the backlight driver, what to extend the new implementation of the backlight, as an example. Go through all your platform data at the beginning and plan out re-implementing it so that you're probably in your device tree. Now, this is a more of a personal lesson for me, because, yeah, like I said, there's some changes that I would have liked to upstream out of this. It's much easier, and I'm sure many of you know this, to do it while you're doing the work. Don't let the project gather dust and come back to think, oh, it wouldn't be great to upstream some stuff ahead of this conference. You've got to get it all out again. You actually have to do some changes to make it acceptable upstream. So that's kind of planned out for the future. And, yeah, every project has different challenges. In this case, getting a new kernel on there was actually relatively easy, because a lot of the platform support was already in the tree. Your mileage might vary depending on your system on chip. Okay, I think it's quite in time. Unexpectedly. So does anyone have any questions? Yes. It looks like I'm doing my own mic running. Come on, guys. Thank you for the nice talk. I didn't get how you handled specific, I mean kernel, Android kernel features like Ashman. I'm staying with the old Android user space. How did you handle that? Oh, well, for the Ashman, I literally just took the Ashman.C file, copied it into the new kernel tree. And I think there was maybe a few little bits, five lines of changes just to get it to work. Thankfully, that's quite a self-contained driver. So it really just worked as is across the kernel versions. So that was that one. When I rehearsed this talk, my feedback was, well, we're not really interested in Android. So just talk about generic kernel stuff. So I tried to sort of gloss over the Android stuff a bit to make time. But come and talk to me afterwards if you want to hear more about running old Android on a new kernel because I've seen a few mentions online about it, but I don't think many people are doing it. But the thing, you can update a kernel on a device. I'm sure many of you know this doesn't really impact system performance. If you update Android from KitKat to say to release is newer, your device is going to slow to a crawl. It's just not going to happen. That's why I stuck with the older KitKat Android release. All right, thank you. You anticipated all my questions. Anyone else got any questions? No? Okay, well, if you just feel shy, come on up to the front and ask me or send me an email. Get in touch. Okay, thank you very much.