 All right, we probably better get started. So my name's Joshua Watt, and I'm here to talk to you today about sweetening your Yachto build times using ice cream. A little bit about myself. I've been working for Garmin for the past 10 years down in Kansas City, and we've been using Yachto to do embedded Linux for about the past four years, and I took an early interest in making our builds faster, so we've been using ice cream for most of that time. There's my email addresses if you'd like to contact me. There's a brief outline of what I'm gonna go over, and I'm gonna cover what ice cream is, why you should use ice cream, how to use ice cream, how to maximize the performance when you're using ice cream, and then things that we can do in the future with ice cream. First of all, what is ice cream? Ice cream is a distributed compiler very similar to DCC. However, unlike DCC, it uses a single scheduler to dispatch the jobs between the various nodes. The advantage of a centralized scheduler is that it allows it to quickly make selections about which nodes should compile given jobs. It's also able to easily distribute the jobs across the entire cluster to prevent nodes from getting too overloaded. Additionally, it's able to factor in information about all of the nodes, such as the CPU usage and memory usage and things like that when making scheduling determinations. Additionally, cluster administration is easier because there's a centralized scheduler that you can talk to instead of having to go out and find all the individual nodes yourself. I'm gonna give you a brief overview about how ice cream works. In this example, we have two nodes, node A and node B. Ice cream itself has two components. There is the ice cream client compiler shim, shown here, that is invoked in place of DCC when you want to use ice cream to compile your source code. There is also the ice cream daemon, ICCD, that runs here and talks to the scheduler and compiles source code on behalf of other clients. The ice cream client shim will directly invoke GCC in a number of cases. The first is when it determines that a compile should happen locally and will simply pass the compile off to GCC wholesale, let it compile and wait for the result. The second case in which it uses GCC is when it wants to pre-process the source code for remote compiling. It does this so that there's a single self-contained source file that it can send over to be compiled remotely. When the ice cream client shim determines that a compile should happen remotely, it will first talk to the daemon running on the local host and ask it to select a node for compiling. The daemon will in turn talk to the scheduler. The scheduler will look at all the nodes on the system and select one, in our example, node B, and then report that back to the daemon. The daemon will in turn report that back to the shim, which will then connect directly to the node B daemon. If the node B daemon does not already have one, the client will send over a tool chain to be used when compiling source code on behalf of node A. Finally, it will send over the source code to be compiled using that tool chain and wait for the result to come back. One of the advantages of this setup is that the client shim only has to know how to talk to the daemons. It's actually not really aware of the scheduler. It only speaks a protocol with the daemons. It's also important to note that your compiles do not flow through your local daemon or through the scheduler that goes straight from the client shim to the remote daemon. So why should you use ice cream? To try and answer this question, I've done some performance analysis using the bitbake command shown here. I analyzed the results using the build stat tools from OECore, for example, build stat diff, and then a couple of custom scripts that I wrote to do statistical analysis. The reason for splitting up the command like this is that the dounfetch and dopack commands can be highly variable due to caching in your internet connection speed. And so pre-running them like this and only analyzing the results from the second invocation of bitbake helps remove them as noise in the output. If you would like to try these tests on your own cluster, you can download the scripts at that GitHub link. Here's the environment I use for testing. It's the cluster that we use every day at Garmin. We have about 21 compile nodes and about 184 total job capacity. And there's my test machine. It's not particularly spectacular. This first chart is the build stat diff output showing the CPU time from the various tasks in a single example run. The CPU time is the amount of time that the task actually executed instructions on the CPU, meaning it does not factor in time that the task was waiting on IO or when it was suspended while other tasks ran. The CPU time two column on the far right here shows the amount of time that that task took with ice cream enabled. And the CPU one time column next to it shows the amount of time that that task took with ice cream disabled. The absolute and relative diff show the difference between those with negative numbers, meaning that ice cream was faster when enabled. As one might expect, shipping off most of your compiles to be done remotely drastically reduces the amount of time that you run on the CPU. With you can see some of these tasks have up to 90% reductions in CPU time. The tasks that seem to benefit the most are those that have a high ratio of compiling to preprocessing time. For example, the kernel, QEMU, and most C++ time spends farm or C++ code spends far more time compiling than it does preprocessing. Not everything gets a net benefit though, as you can see down here at the bottom, there are some tasks that got longer with ice cream enabled. Some of these do configure tasks are the result of do configure executing a lot of small test programs and small test programs don't play to ice cream's strength due to not having a very high ratio of compiling to preprocessing time. Additionally, sometimes source code can't be compiled remotely, but ice cream doesn't realize this until after it's tried, and so it has to give up and redo it locally, and that can also inflate those numbers. Overall though, you can see down here at the bottom, there is a 40% reduction in overall CPU time down from about eight and a half hours to about five. Note that this time is not the same as the total elapsed time due to tasks executing in parallel. For a more rigorous analysis, I averaged the total CPU time over 15 builds and divided it up per task. Ice cream directly affects four tasks, do compile, do compile kernel modules, do configure, and do install. All other tasks that were executed are in the other column, and this overall column shows the total CPU time for the build, which is basically the sum of the other five columns. The difference between having ice cream, or having ice cream disabled and having ice cream enabled was shown to be statistically significant for all changes, meaning that we can be certain that these changes were actually caused by ice cream. To try to understand this better, I've normalized the results from the previous graph, showing the percent change for each of the tasks. This is just like the build stat diff relative diff column, where negative numbers mean that ice cream was faster. As you can see, all of the tasks accelerated by ice cream had at least some improvement. All other tasks in the system had a very minor 1% increase in total CPU time, which is actually good because it matches the expectation that tasks not directly accelerated by ice cream should have no change since ice cream isn't helping them. It's actually a good indication that we have correctly partitioned our results. Overall, there was a 44% reduction in CPU time. We can do the same thing for the wall time. The wall time measures the total amount of time that the task executed, which means it includes the time that the tasks spent waiting on I.O. and when it was suspended while other tasks ran. Due to non-determinism in the builds from parallelism and I.O. delays, there's a lot more variance in these numbers from run to run. Generally, however, tasks that have a reduction in CPU time have a reduction in wall time. There are a few downsides down here at the bottom again. Some of these are due to the increase in CPU time we saw from the CPU time slide. Others are due to a bug that I have not quite tracked down in ice cream where sometimes jobs will get sent off to be compiled remotely and get lost somewhere and then a really long timeout has to elapse before it figures it out and recompiles it locally. Overall, this build had a 30% reduction in wall time. This is less than the CPU time, which is to be somewhat expected since ice cream does have some I.O. overhead. It is interesting to see, however, that there are some tasks here not directly accelerated by ice cream that are faster and substantially so. The question is, can we attribute this to ice cream? Does ice cream make the CPU bound tasks that it accelerates I.O. bound enough that other tasks can run while they're waiting on their I.O.? If that were the case, we would expect to see a reduction in wall time for the other tasks while the CPU time remained close to zero. And if you recall, the CPU time was very close to zero. So did the wall time change? To try and answer that question, we can again look at the average total wall time over those 15 builds. Again here, you can see that the due compile, due compile kernel modules, due configure and overall differences are statistically significant. However, there's too much variance and due install on the other tasks, which means we can't say if any difference in them was the result of ice cream. I've again normalized the percent change. First thing to note is that the error bars are much larger because there is a lot more variance in the wall time. There's generally less improvement in wall time than CPU time because ice cream does have some I.O. overhead. But there still does appear to be net benefit. Because of the lack of significance, we can't really say if this number is the result of ice cream, we may need more tests and then we could say so decisively. We can also look at the average elapsed build time and average CPU usage over those 15 builds. For the average elapsed build time, there was a reduction of 1,100 seconds approximately, a 20% change. The CPU usage had a 22% reduction. It's important to note that this is actually the straight difference between these two columns and not the percent change because I think it's weird to take the percent change of something that's already a percentage. It's also important to note that the CPU usage is actually measured over the elapsed time. So the 22% reduction in CPU usage is actually a much larger reduction in total CPU cycles due to it being measured over a shorter period of time. If you do some rough calculations, you can estimate with these numbers that there was about a 49% reduction in CPU cycles which fits fairly closely with the 44% reduction in CPU time we saw earlier. So why should you use ice cream? In our example, we saw a 20% reduction in build time. That isn't earth shattering by any means, but as one of my coworkers says, 20% is 20%. More importantly, however, I find that it's a lot more pleasant to do builds when ice cream is enabled due to the reduction in overall CPU usage. My computer doesn't just lock up for hours on end when I'm doing a build with ice cream enabled. Additionally, many recipes do much better than the average, such as the kernel, and rebuilds of those individual recipes if you're doing iterative development can be much faster. Finally, it's free and Open Embedded does most of the hard work for you in setting up a tool chain. So how do you use ice cream to accelerate your builds? To enable ice cream, you have the following two lines to your local.conf. The ICC class replaces GCC with the ice cream client shim for the do configure, do compile, do compile kernel modules, and do install tasks as we saw earlier. ICC parallel make controls how parallel builds are when ice cream is enabled. I generally find that it's best to set it to three to four times the number of CPU cores you have. This variable is analogous to parallel make when you're not using ice cream. It's obviously not quite this easy. You do have to have ice cream installed on your host, and you do have to be part of a properly configured cluster. There's also a couple of other caveats that I will address. As a side note, enabling ice cream for your builds also enables it for your traditional SDK. When you install a traditional SDK that was built with ice cream enabled, it will try to detect ice cream on the host when it is installed. If it finds it, it will configure it to use ice cream in place of GCC in the SDK, and then automatically create a tool chain for you to be used. There are a few things you should be aware of if you try to combine ice cream and S-State. The first is that you should always get S-State working first. S-State will give you much better performance improvement than ice cream ever will because S-State allows you to skip entire recipes and tasks instead of just accelerating a few tasks. The second thing to be aware of is that you can combine ice cream and S-State, but ICC.bb class changes the task caches for the tasks that it accelerates. As such, you can't share S-State between a host that has I-Cbb class enabled and one that does not. The solution I found to this problem is to always inherit the ICC.bb class in either your distro.com for your local.conf and then use the ICC.disabled variable to control if ice cream is actually used. The ICC.disabled variable does not change the task caches and therefore allows you to share S-State between all the hosts. So how do you maximize the performance of your cluster? As you've seen, some recipes don't build well with ice cream and some recipes don't actually build at all with ice cream. You can blacklist these recipes using the ICC user package blacklist. There is a shared system blacklist in ICC.bb class but it doesn't have very much in it and it needs to be managed better. So for now, you just need to add this in your local.conf to get everything to build. Additionally, in my example tests, I didn't blacklist any recipes that weren't performing well. So one of the better ways to improve performance would be to blacklist some of those recipes that didn't build well. Network performance is crucial to getting good performance out of ice cream. You need a fast and low latency network. In the cluster that I ran our tests on, all of the nodes have a gigabit link between each other and they're all on a single subnet. I don't recommend using anything less than 100 megabits and even that's pretty slow when transferring the tool chains. I also don't recommend Wi-Fi as latency and drops cause a lot of problems. It's also important to try and keep up to date with the upstream version of ice cream. Newer versions of GCC generally tend to require newer versions of ice cream and Open Embedded adopts new versions of GCC fairly quickly. One of the problems with this is that ice cream generally only updates about once a year and it takes additional time for those to get into the latest version of your distro if you're even updating to the latest version of your distro. Fortunately, however, the ice cream client shim uses a stable and backwards compatible API to talk to the daemons. This means it can pretty much be updated independently of the daemons and it works. At Garmin, we have a fairly ingenious method for doing this. We're doing all of our builds in a Docker container anyway. So we've included a patched client shim in that Docker container and it talks to the host daemon running outside of the container and this has worked very well for us on a wide variety of distros. I also highly recommend using a dedicated scheduler. You can set up your cluster such that all the nodes run the scheduler daemon and they'll automatically elect one from among themselves but this means your scheduler can change or disappear without much notice or can end up on someone's laptop running Wi-Fi with bad results. It's fairly easy to install the scheduler on a box and throw it in the corner and forget about it. We don't even have the scheduler participate as a compile node on our cluster because we don't need the extra capacity. The scheduler also uses a backwards compatible protocol to talk to the daemons so it's fairly easy to keep a dedicated scheduler up to date with the latest version of Ice Cream. You also need to be careful about who's on your cluster. Virtual machines in particular tend to be bad cluster citizens. They're constantly overestimating how good they are at compiling probably because they can't see the host CPU usage or memory usage and they often look completely idle which means the scheduler really likes to pick them. But it turns out they actually compile really slowly. At Garmin we either blacklist or if I'm feeling generous that day mark the virtual machines as no remote which means they can send out compile jobs but not receive them. Ice Cream has the ability to remotely preprocess source code for even more performance. This uses a feature of GCC called fDirectivesOnly. This directive only preprocesses the file enough to get a single source file meaning it expands, pound includes and a few other things but it doesn't expand all the macros that's done remotely. It does have some issues though and it doesn't work for all recipes so we haven't been able to enable it in the general case in OE. If you're interested in looking at some of these issues I have a link there. So what's next with Ice Cream? I think there's a couple things we could do to make the Ice Cream experience better. One would be to try and build the Ice Cream client shim in OE core itself using an ICC native recipe or something like that. This would have a few issues. The primary one being that ICC native would have quite a few dependencies and those dependencies themselves wouldn't be able to be built with Ice Cream. Ice Cream can also support Clang but I didn't add support for it. It also works with Ccash but again I didn't add support for that. The trick there is to make sure that Ccash runs first so that it can check its local cache before passing off the compiled Ice Cream. I think it would be really interesting to gather more data from other clusters and compare. For example, some of the improvements I saw to do install were fairly marginal and other people on other clusters may not see any improvement or it may actually make do install tasks worse in which case maybe we shouldn't run Ice Cream for do install. If you're really interested, you might be able to fix up some of the problems with GCC's F-directives only support to enable remote pre-processing. And finally, as I said, Ice Cream is supported in the traditional SDK but as far as I know, it doesn't work at the extensible SDK so you could add that if you're interested. In conclusion, I've shown that Ice Cream is a distributed compiler and how it works. I've shown that you can use Ice Cream to accelerate your builds, what kind of acceleration you might expect and how to measure any improvement. I've also shown you ways to contribute to the ongoing development of Ice Cream and open embedded. Additionally, I encourage you to try it for yourself. I think it would be really interesting if people would go try this on their own clusters and see what kind of performance improvements you can get. Here's a list of useful links. That first one there is the open embedded wiki page that has most of the setup instructions on how to enable Ice Cream that you've seen here. There's also some monitor programs there in the middle if you're interested in monitoring the status of your Ice Cream cluster. And finally, there's the GitHub link for my tests. Special thanks to Garmin for allowing me to run an Ice Cream cluster at work and to give this talk. The upstream Ice Cream developers, I think Ice Cream's pretty awesome. All the open embedded people who've helped me on Ice Cream and my wife who helped me significantly with the statistics. Any questions? Okay, how many nodes does it take to see a win? Yeah, I think probably if I were gonna guess, I'd say as few as three or four. I think it really depends on how many builds you wanna do simultaneously. So I've definitely seen just from my builds, three or four extra nodes would allow me to do builds. If you have two or three simultaneous builds, you would need more nodes. But I think a lot of that also has to do with how parallel you set your builds to be. Any other questions from anyone? No, I didn't look, okay, so the question was did I look at it from my cost directions? I gotta repeat it for the microphone. I look it up from a cost perspective of the cost of using Ice Cream versus getting a more powerful machine, I think, is what you're asking. No, I didn't. Yeah, no, the short answer is no. I mean, one of the advantages of Ice Cream, I think, is that you can just install it on the machine. So what we do is we actually just have a script that everyone runs when they get their Linux machine and just sets up their machine and puts it on the cluster. And so when they're not using their machine, it compiles other people's stuff. And so, I mean, other than the cluster maintenance, there's not really a lot of cost associated with it. At least from my perspective, I definitely don't spend much time administering the cluster, so, yeah. Right, so the question was, do we use it for our release software or just day-to-day? It depends on where that release build is done. So if the release, we don't enable Ice Cream for our CI servers, for example, because it doesn't really make sense. We have kind of an internal cloud-based service. And so we're just spinning up machines and tearing them down into our builds. And in that case, Ice Cream doesn't really make sense because why would we spin up a bunch of machines just to run Ice Cream? The machines are either fully utilized doing the build or not running, right? And so we don't do it for our CI, but sometimes we do releases on our desktop PCs and then usually that does use Ice Cream. That's a good question. Anyone else have questions? Yes? Yeah. I don't know if it's just because we have a reasonably-sized cluster, but I've never noticed that really being a problem, I guess. Yeah, the nodes do keep track of how utilized they are and the scheduler won't pick them if they're being heavily utilized, at least as far as I've seen. So it might just be because we have so many nodes in the cluster versus people actually building at any one time. Also, the machines are a lot more idle than you might think they are. Your desktop sits idle for more time than you might expect. Any other questions? Yes? Does it, by default, run with lower IO and CPU priority? I don't know the answer to that. I'm not actually sure. I don't know. Like I said, I've never noticed it being slow because people were using my machine to compile, but I don't, it might do that. It might not. It's kind of hard to say. You could find out probably pretty easily. Yes? Sorry, what? Yes, that node A tool chain is cached and the scheduler will actually prefer nodes that already have your tool chain. Yes, so that actually brings up an interesting point. By default, the cache is quite small. It's like 50 megabytes, which will hold about three open embedded tool chains. And so, if you happen to run this on your own, you might notice this behavior where your compiles, if you have a large enough cluster, your compiles will always be on those same three machines and someone else's compiles will be happening on the other three machines and you'll just be using those three machines. And that's mostly just because those machines are idle and already have your tool chain, and so it makes more sense to reuse it than to try to transfer the tool chain. You can increase the size of the cache. I think it's 50 megabytes by default, but that's unfortunately a per node configuration item and you can't just change it across the whole cluster. Yes. Does it notice if people know? So the tool chains are just given an arbitrary name and that's how they're identified. And so what we do in open embedded is we generate a fairly unique name for the tool chain that's actually very long and it includes things like your GCC version and your host name and things like that. And so there might be, there probably could be some advantage there to trying to detect the same tool chain for multiple hosts, because you do generally have that, but it's not there yet. That would be a good thing to improve if you could. Any other questions? Yes. What's the question? Yeah. Yeah, it should be doable. I know you can do it, like there's actually an explicit whole thing on the ice cream documentation that says how to use Ccash with ice cream, but we don't really use Ccash a lot. So, and I think if you look at ICC BB class, it actually explicitly says Ccash disabled equals one. So it actually just explicitly turns it off because it's tricky to get it in the right order is the problem. Cause they're both basically trying to do the same thing where they masquerade as GCC and then try to find GCC further on in the path. And you can get into these weird cases where they're each finding the other one in the path and you just get like, they just keep passing it off to the other one and being like, hey compile this for me. And, you know, like they detect it and break the cycle, but that means your compile fails. And so, but it's entirely doable, I'm sure, to get Ccash working with it. It just isn't right now. Are there any other questions? All right, that's all I got.