 Okay. All right. Oh, it's an echo. Got it? Echo, echo. Okay. Uh, test, test, test. Did it go away? Good. Okay. Yeah. Okay. Great. All right. My name is Joshua Watt, and I'm going to talk to you today about the software, bill of materials, and the supply chain with the Octa project. A little bit about myself. I've been working at Garmin as an embedded software engineer since 2009. And we've been using Open Embedded in the Octa project since 2016 to make embedded Linux products. I'm a member of the Open Embedded Technical Steering Committee, and there's all the various ways you can contact me if you would like to do that. All right. So if you are unfamiliar with the Octa project and Open Embedded, Open Embedded is a community-driven project that provides the Open Embedded core layer and BitBake, which is the build system used to be build primarily embedded systems, but all sorts of other things as I'll talk about. The Octa project is a Linux foundation project that maintains the Pocky reference distribution and also provides the auto-builder hardware used to run a whole bunch of QA tests to ensure that the project has high quality. They also manage a release schedule, provide funding for personnel, and provide a lot of excellent documentation for the project, which is awesome. I really like our documentation. You should go check it out. So here's a brief outline. I have a lot of slides, so I'm not going to stop here. All right. So what is the software supply chain and why do I need one? So the software supply chain is really about answering what's in the software that we're shipping. So we ship a binary to someone or use it ourselves. We really want to know what's in that thing that we're using. So we want to know basic questions, like where did the software come from? What version is it? If we have licenses, we need to know if we're complying with those licenses, be it GPL requirements or attribution requirements or things like that. We also want to know if that software has been tampered with, either maliciously or unintentionally, because we don't want to expose ourselves or our customers to unnecessary risks by having tampered software. And the same thing with exploits. Like if our software is vulnerable to exploits, we want to know that to protect ourselves and our customers. Ultimately, the software supply chain tries to answer the question, can the deliverables that we're shipping be traced back to the code that generated them? So at its core, that's sort of the intent of the software supply chain. So I'm going to talk about the open embedded build flow, which is how open embedded builds software and how the way that it builds software has an inherent software supply chain to it. So when people want to build stuff with open embedded, they primarily start out with three things. They have the source code that they want to build, which comes from Git repositories or tar balls or whatever it is. They have some metadata that we call recipes, which describes how that source code should be built. And then we have various policy information, which is like, do I run and run system D or sysvnit, you know, various configuration bits and bobs that make your thing do whatever it's supposed to do. They feed all this into this magical tool called BitBake, and it spits out what we like to call a target image, which can be a whole bunch of different things that I'll talk about in a minute. You take that target image, you put it on your widget, and you profit, right? It's great. And when we talk about a target image, there's actually a whole bunch of things that we classify as a quote-unquote target image that you might not normally think of as a target image. So we do have the traditional embedded flash this on an SD card and boot it on a Raspberry Pi, like you can see up there at the top. We can also generate microcontroller firmware that you could flash onto a microcontroller for whatever reasons. But we can also generate a whole bunch of other stuff. So we can actually generate images that you could put onto a hard drive and boot up a full PC like any other desktop Linux distro. And we can also generate virtual machine images, primarily QEMU, but I actually learned just this week that we can generate virtual machine images that you can import as AMIs into AWS and run them there if you want to do that. Another lesser-known thing that we can generate is OCI-compliant container images. So we can actually build container images that you can import into your favorite container runtime, be that Docker, Podman, Cryo, whatever the flavor is this week, and run those. Now, we can also generate package feeds for the various package runtimes that we support, which currently is IPK, Debian, and RPM. So these allow you to publish package repositories just like you would for any desktop distribution. So if you've, say, flashed your image to your Raspberry Pi, you can then point it at the package repository that you've published and just install software using app or DNF or whatever you would like at a desktop distribution. There's also a whole bunch of internal things that we can generate. The SDK allows you to... Is it something you can ship to customers or use internally for yourself that allows you to compile software or provides compilers and tools and things to compile software against a given image so you could compile a executable, copy it onto whatever your target is, and then run that. We also have the extensible SDK, which is a more advanced version of the SDK that I really don't have time to go into. And Build Tools Tarball, which is really cool for supply chain reasons, and I'll talk about that at the very end of my presentation. So digging into how the build flow works that BitBake provides, so we start with a couple of different things. So over there on the far left, we've got the host tools. So these are the bare minimum set of tools that you need to build the project. This is going to be things like Python because BitBake is written in Python. Git and host compiler. So this is going to be the host GCC needed to compile stuff to run on your host. It does not need to be a cross-compiler. We also have some source code up along the top there. And we've got the recipe metadata that says how we're going to build that source code. So the first thing that BitBake is going to do is it's going to take these host tools and it's going to ingest some source code and process some recipe metadata and produce what we call the native tools and also the cross-compiler. So we're actually building the cross-compiler that we are going to use later on as part of the build step and we're also building what we call these native tools. So a good example of a native tool might be the Google Protobuffer compiler. It's something that runs on your host that you use as part of a build to generate stuff that you need later on, actually on target. So, you know, you use the Google Protobuff compiler to compile protobuff files to C code or C++ code or whatever it is. Using these native tools and cross-compilers we're then going to process yet more recipe metadata and ingest more source code and this is going to produce the target packages. So these are the things that are targeted to run on whatever your final target architecture is, be it ARM or X86 or RISC-5 or whatever. The final step that we do is we have yet more recipe metadata that says what target packages you want to install on your final target image and so we've got a recipe that says install these packages on this image and produce the final thing. So the way that BitBake tracks when things need to be rebuilt is using a very sophisticated method of hashing. So the way that this works is all of the inputs to a given recipe are hashed together to produce a single final hash that we call the task hash and then this hash is then used as the input to subsequent recipes and gets incorporated into their task hash. So you get this chain of hashes all the way through the system and what this means is and that hash includes all of the inputs and all of the recipe metadata itself that is used during that particular build step and so if any of that changes that's going to change the hash and that signals to BitBake that that thing needs to be rebuilt and because that hash changes then all of the downstream hashes from that will also change and so BitBake knows it needs to rebuild all of the downstream stuff that depends on that. So just as an example, if the source code for the Google ProtoBuff compiler for example changes because you bump the version or something like that that's going to cause that hash of that source code to change. When that happens, all of these downstream hashes that are dependent on that are going to change so BitBake will then know that I need to rebuild the ProtoBuff compiler I need to rebuild all the recipes that use the ProtoBuff compiler and then I finally need to rebuild the target image at the end of the day. So because of this sophisticated method of hashing we actually have a really good traceability i.e. a really good software supply chain back to all the things that went into your image because we can take from a target image and trace back through all these hashes to the individual components that went into it and importantly, we can do this not only for the target packages the actual cross-compiled things but a significant portion of the native tools and cross-compilers because we're building them ourselves as part of the entire build process basically everything but the host tools we have very good supply chain tracking on all of these things just inherently due to the hashing mechanism that we use to track dependencies in general. Alright, so software bill of materials so the way that I like to describe software bill of materials is it's the nutrition information for your software if you've been to any talk here you've probably seen something like this already but basically, you know, we've seen these ingredient labels on food they're a standardized way that we can quickly see what we're putting into our bodies the basic ingredients that are in there and then various information about it and an S-bomb is kind of the same thing standardized encoding that allow us to easily exchange information about what's in our software and know what's going on with it and so there's multiple different S-bomb formats out there there's a SPDX and Cyclone DX another one that I always forget because I can't remember that might be used to describe the same actual software supply chain so you can kind of think of the S-bomb as sort of the way that you encode the information in whatever your software supply chain is at least that's the way I like to think about it so the important stuff that's in an S-bomb is what the software components that we have in our system and what are the relationships between them this is a really popular graphic that's used around to describe this and so it can show you the various properties of what's going on and how they're related and I really like this graph because I think it fits really well with this graph you can pretty easily imagine that you know, we have a recipe this is Carol's compression engine and then that is then ingested by another recipe as a dependency that's Bob's browser and we're tracking all those dependencies and all that information just like that so what stuff do we have in our recipes that we might want to include in an S-bomb well it turns out we actually have a lot of stuff so our recipes are fairly comprehensive in how they describe software so we have a lot of information we can include we've got the stuff that you would expect to be there so we've got the versions of the software that we're building we have the source code URLs from when we downloaded it which we have to have because we downloaded as part of the build process we also have pretty advanced license tracking that we do because for a long time the project has had tools to help people do license compliance, either GPL compliance or whatever it is just to know in general what licenses they have on the things they're producing so we've had that for a long time we also have all the build time dependencies which we obviously have to have in order to correctly build the software so we know what all the build dependencies are and similarly the run time dependencies we have mechanisms for automatically determining some run time dependencies or we have to manually annotate for other ones and we have to know what those are and they have to be correct or the software wouldn't run on the final target we also do a lot of CVE tracking so we track the CVE metrics for the recipes that we have and then if there are CVEs that come out we will patch them and we also have tools that can make it fairly easy to figure out if a given recipe is vulnerable to CVEs and things like that we obviously know all the source files again we downloaded the source code, we extracted it we did whatever with it so we have a pretty good idea of what those are and also the files that end up in the individual packages that producing we put them there and then we know what's in them and something that's a little different about all of this information is that we're very authoritative on this information like we know for sure these are what these things are because we generated them we're generating them from first principles which is a little different than perhaps you've seen from some other Sbom related tools where they do scanning of completed images and not to say that that's wrong it's a different way of doing it instead of we're not scanning things after the fact and trying to heuristically figure out what things are we're just saying this is what it is because basically we said so this is what we did and we actually have a currently our policy is that we won't include information in the SPDX if it didn't come from that first principles approach so we're not planning on adding heuristical scanning tools or anything like that because it's not really our place we're just saying what we did in our Sbom stuff that's a little different from what you might see in other Sboms generating Sboms is very easy we can generate Sboms in SPDX JSON format we chose SPDX as the ISO standard and also Linux Foundation project it just kind of made sense to do that and we do JSON because basically all of our build system is written in Python and JSON with Python is very easy so you initialize your build environment right here just like you would for any other build you add this single inherit plus equals create SPDX line to your local COMP file you bit bake your image it will generate SPDX for everything that's in that image and all of the native tools that helped produce that image so very easy basically what this ends up doing is it adds an extra step during the build process at various points where when it does a certain thing it will then generate an SPDX document that describes the information about what happened during that step of the build and then at the very end we take all of those SPDX documents and we put them in a big tar ball so that you have one file that you can use instead of having to deal with many files so what features do we have in the SPDX most of these are just a repeat of what's in our recipe metadata like I said we have pretty extensive metadata so it's pretty fairly straight forward to translate that to whatever SPDX field it happens to be there's a couple things that are worth highlighting so we do we populate the declared license field based on our license metadata that we have if a license is not one of the SPDX known license identifiers it will actually just include the entire license text because that's super helpful it makes your SPDX quite large but at least you still have the license again we can do home page, URL CVEs we know all of that we can list all the source files with their checksums that we downloaded and used in order to do the builds that's very helpful we do scan the source files for the SPDX license identifier and we'll include that information from all the source files that we include in the SBOM similarly with the packages we know all the files in the packages with their checksums we can also do package file generated from using the debug data so this is something that's extremely powerful and basically what it means is we'll look at the debug data which we always generate unless someone tries really hard to turn it off when we compile a binary we we generate the debug data and we'll actually look through it and trace back the source code files referenced in the debug data to the original recipes that generated them which might not be the recipe that you're currently building where this becomes super powerful is with when you're including static libraries because what this allows us to do is trace back binaries that are including static libraries to the recipes that originally produced those static libraries which traditionally is very difficult to do to know if your binary is using a static library in the first place and then also to trace it back to the original source it's not perfect using the debug data but it's you know it's pretty good again we also have all the build time dependencies and the run time dependencies and we can also generate a source code archive if you want to do additional analysis of the source code with phosology or something like that okay so what can we generate spdx for and the short answer is basically anything we can build we can generate some amount of spdx for because it just kind of turns on and happens so all of your on target your cc++ for train your sort of traditional languages if you want to call it that we can very easily generate lots of spdx again all those native tools we can generate spdx for we can also importantly generate spdx for the linux kernel as far as I know we are one of the few projects that can generate meaningful spdx for the linux kernel a lot of some work went into that and it's really awesome that we can do that target images again anything that we classify as a target image we can produce spdx for stks container images so that's a good one so if you want to know how do I build a container image where I have spdx from the build of that container you can try building it with the octa project and try that out so that's really cool and the same with VM images right and I have Rust and Go under construction mostly just because the the core thing that you're trying to build in Rust and Go like what the recipe is actually written for we have really good spdx for that it's just basically the same as most of the other recipes the part that's missing is Rust and Go have their own package managers so getting the spdx from the cargo crates that we pull down is a little more difficult and I don't think that's working I didn't actually try it I should have but so I don't know if that's quite working yet and the same with Go right it pulls in Go modules I don't know what they're called it pulls in the Go stuff I'm not a Rust or Go person so if you are and you want to figure out how to get this to work so that we can actually get the spdx from those crates and things like you know that'd be awesome I would be super excited about that there's a couple of configuration knobs you can set to control the basically amount or style of spdx output you get spdx includes sources as a knob you can turn on and off so this is what actually includes the list of all the source files and their relationships in the generated spdx document and the license information in them it's off by default because it's just huge output the test that I ran the root file system I generated uncompressed was 20 megabytes the spdx compressed was 23 megabytes so you know more spdx documentation than the actual thing you produced when you turn this on so yeah it's big spdx archive sources that's the thing that gives you the tar ball that has all the sources that you can pass to phosology or whatever actually archive packaged will give you the packaged files if you want to do additional analysis on them for some reason those are both off by default just because they take time and then the one that I added just last week because I got tired of looking at the single line of JSON output we produced to keep the output small is spdx pretty that'll give you the nice new lines and indentation if you're manually looking through these things because it can get a little tankers after a while you can publish your spdx results on the internet I'm not going to talk about this at all in time it's there I will publish my slides after this alright so what's actually in our spdx yeah so when you do a build with spdx with the create spdx class enabled you get some output in your deploy directory if you're not familiar with the octo output don't worry about this too much it'll make sense in typical fashion these are actually sim links to the actual time link time stamp version files I'm just going to pretend they're the actual files because it's just easier so there's three files that we really care about here this first file is the spdx file for the image itself so this is the spdx file that's going to say this is what went into this image and basically what it has is a whole bunch of external document references in each package that got installed in that image the second file this is the one that you really are most interested in this is the compressed tarball that contains all the spdx documents for the image so it starts with that top level image spdx document that I was talking about and then it recursively follows all of the external document references it finds starting with that file and pulling in those documents and then pulling in the references to those documents until it runs out of things to pull into the tarball and that is the compendium of all the spdx documents that are relevant for this image and then this last file is a JSON index file that we create this isn't an spdx thing this is just something we came up with so spdx documents are referenced by their document namespace which is like a uid-ish type thing and so that's how you when you reference one document from another you reference it by that this file lets you easily map the document namespace to the file name because our file names aren't named by the document namespace just for reasons so you can use this if you're trying to transverse the large number of documents that we have to find the file names that you're looking for so if we do a listing of this spdx archive we'll see the files that are in it and we can dig into that and see what's in here so this is, again this is that top level image spdx file that you saw in the previous directory and this is the index file that's always the last thing in there the interesting files are these ones now so utllinux lsblock and utllinux unshare.spdx.json these are the spdx files that describe the packages we produced these are going to include the binaries and the files and stuff that got installed onto the image that was produced and it's going to describe, you know, it's going to have the check sums and all the file listing and all there due to some quirks in the way that things are generated we actually have a separate document that describes the runtime relationship between packages and I'll get into that a little bit more later so that's in this file so these files are going to have spdx relationships that describe how packages runtime depend on each other so you can use these to track through the runtime dependencies and figure out, like, you know what libraries are being included and what not at runtime and then the final spdx document is for the actual recipe itself so, you know, one recipe might produce several packages in this case the utllinux recipe itself produces the lsblock and shares packages separately but they both then have a relationship back to the original recipe so the recipe basically describes the source code that's got all your cve tracking in it if you've got that source file listing enabled it's going to list all of the things it's got all the licensing information in it it's basically the things that are common to all the packages that got produced so you can think of it like the packages that describe what's on the root file system the recipe describes how those things were built more or less all right, so as you might have guessed we have a lot of relationships in our spdx documents a lot of external document reference relationships specifically in our spdx documents so this is kind of a chart of what that looks like so if we start up here in the upper right corner that's, again, our top level image spdx and we can track down here what it contains the packages that got installed in that image and so then those packages, again, contain the individual package files that are in them and then if we go down from there we can see that these have a generated from relationship to the recipe that produced them and that recipe again has this contains relationship on its source code and also then this is how we track the build time dependencies between recipes so build time dependencies are a property of recipes whereas run time dependencies are a property of packages so if we go back up here for our package spdx we can see we'll also get these generated from on other recipes which weren't the ones that produced us and that's that tracking through the debug source for static libraries and stuff like that and then if we move up here we've got these run time documents so as it turns out package managers don't require run time dependencies to be an acyclic graph it's perfectly acceptable to have cyclical run time dependencies it just means you have to install all of those packages together in one big lump which is fine however when you are referencing spdx documents you reference them by their document namespace and their checksum which means once you've written an spdx document you can't modify it which is really good for supply chain tracking but it's really bad when you can have circular run time dependencies because it means you can't put the run time dependencies in the package because there's not a single node you can start at in the graph that will give you a directed acyclic graph that you can walk through to write out all of your dependencies so because of that we actually have to write these run time dependencies in a separate document so that the package spdx's that are being referenced have all been written and finalized so that we can reference their checksum sort of a quirk of the way spdx currently works so we have these run time dependencies and these basically just define the run time dependencies between two package spdx documents they're really not that exciting they just have a bunch of run time dependency of relationships in them so future things that we'd like to improve with this some of this looks a little weird I won't say that it isn't I think we're one of the first projects that's really generated this level of spdx output and so that's really good and so we definitely found some edge cases and things in sort of the spdx spec and so we're working very closely with the spdx project and it's good for both of us it's good that we have a use case that's producing these things and finding them and it's good for us to help them find these use cases and we're going to be able to help them come up with a spec that helps us too so this is all really good I'm super excited about that we'd really like to be able to pull in spdx and sbombs from other upstream source code particularly like projects using reuse or something like that we'd like to be able to pull in their sbombs I don't think we would necessarily replace what we're producing with that but we could definitely generate external document references to point to that and pull that into that entire archive and I think that would work really well because then you can see exactly like upstream said this, we said this it's just one more link in that software supply chain which would be awesome we definitely like to include more spdx fields we have a whole bunch of information about how we built the source code that we aren't currently describing at all we have the cflags, the linker flags all of those things we could include we could even include the shell script we actually ran to do the build it's crazy stuff like that all really useful for supply chain reasons okay so moving on to more supply chain related things we've got reproducible builds so I'm going to talk about how the project can help you have reproducible builds in the things that you build so why do we need reproducible builds so there's a bunch of reasons so we need to be able to know what like in order to be able to resist attack we need to know if something is worth looking at to see if it's been compromised and the best way to do that is to have it build reproducibly because then it's very easy to say oh this change that shouldn't have happened if you want to be able to trust your compiler there's techniques for doing that like a diverse double compilation but in order to do that you need your builds to be reproducible there's also quality assurance reasons like the code or the things that we're shipping to our customers we want to be able to reproduce on our desk so that we don't get weird race conditions that only show up in the field and things like that so we want reproducible builds for that reason also smaller binary differences means smaller updates if you're doing delta updates and increased development speed you know if something hasn't changed it doesn't need to be rebuilt and that would be awesome I highly recommend you go check out reproduciblebuilds.org they have a whole lot of information on reproducible builds that's awesome just from like a project standpoint we really like the idea of reproducible builds because when we create these task caches what we're really saying is like we're expecting this output from the recipe and really ideally there should be a one-to-one correspondence between the task cache that we create and the binary output that we generate so just kind of from a intrinsic project perspective we really like reproducible builds for that reason too so how are we ensuring that builds are reproducible so the Yachto Autobilder does a regular tests for regressions in reproducible builds and this is really awesome you can see the results there on that link if you would like we're currently testing about 11,000 target packages we're not currently directly checking the native builds for reproducibility directly we're sort of indirectly testing them by ensuring that they don't change the target output obviously and we're doing this across the three different package formats that we support IPK Debian and RPM we also do this build across multiple build hosts so we'll do builds on Fedora Ubuntu, CentOS Debian and then compare the results between them and see if they change so this is even ensuring that if you switch build hosts you'll still get the same output there and we have tooling in place to do automatic Difascope output Difascope HTML output if we find something that isn't reproducible so this it makes some things very easy to debug I'm not going to say diagnosing and debugging reproducibility problems is easy but this can help with some of them if you want reproducible builds you do need to test your own it's great that upstream is doing reproducibility testing and making sure that packages are generally reproducible but if you're not testing it in your actual images you don't know for sure if they're reproducible or not fortunately it's actually quite easy to test this you can write this three line python file right here in your own layer and basically just replace this my image with whatever image you want to test for reproducibility and then you can run this command down here and that will run the reproducibility test go get lunch or go to bed it takes a really long time alright so build tools tar ball this thing is exciting so this gives you s bombs all the way down so this is that build tools tar ball that I was talking about so there's actually two different build tools tar ball there's the build tools tar ball which is sort of a minimal set of tools that we use in order to sort of paper over some of the host differences between builds and then there's a bigger version of the build tools tar ball called build tools tar ball extended which basically includes every host tool that you need to build so build tools tar ball and build tools tar ball extended are sdks at the end of the day they're just sdks but they're designed to replace your host tools and so if you use the build tools tar ball extended which is the one I'm going to talk about now you can use it to replace all of your host tools and what that means is that all the things that I've just talked about with reproducibility and s bombs and all of that stuff can now apply to that last little bit of host tools that are obviously covered so if you do this you can now trace your target image supply chain all the way back through your target packages to your native tools and cross compilers and even into your host tools if you want to get like really crazy you could have an air gap to build host where you go through the I won't call it not painful effort of building build tools tar ball extended on that air gap host build tools tar ball on your CI and developer systems they could use that to build all of their stuff and then when they build a final target image you would be able to trace that all the way back to the air gap system in your supply chain so super powerful really awesome if you want that really deep supply chain alright special thanks to Saul Wald I did a lot of the generic spdx generation support and he did the stuff that was necessary to make the Linux kernel have spdx support which is really awesome Ross Burton did a bunch of licensing work to make all our licenses use spdx identifiers which is awesome Andres did the SDK support which is the thing that allows the build tools tar ball to do what it does and Richard Purdy just does a whole bunch of stuff with the project and lots of other people that contributed to the project I probably missed a bunch of people yes if you'd like to get involved with what we're doing here we are on IRC and Matrix I think I think it's Matrix there's a bridge you can find us at these channels there's also a weekly technical meeting that you can attend where we call in and talk about technical project stuff there's also a weekly bug triage meeting where we go over all the bugs that have been submitted the last week and things like that there's also an open embedded happy hour last Wednesday of every month the next one is next Wednesday and there is a yacht project summit which is twice yearly and I'm sorry I couldn't get a link for that they didn't really have a generic link yet alright questions oh how do we generate spdx for dynamically generated kernel modules dynamically loaded kernel modules so I believe the way that that would work is they're basically treated so they're going to have a package that gets installed on the root file system so you're going to get a package of spdx for them and then you'll be able to trace that back to the recipe and that recipe should include all the source code for that kernel module a lot of the stuff is all just it's all very similar our recipes are all operate they do different things in different steps but they're kind of all doing very similar things yeah that's how I suspect it will work I haven't actually tested that I could any other questions how much time we got two minutes okay sorry can you say that again oh how do you take all of the spdx output that we generate and distill it for your company lawyer so like I said we're a fairly early provider of this level of information and so I'm positive the tooling will catch up with us oh Kate's got oh okay so apparently you can translate the JSON to spreadsheets which I hear lawyers like so you can do that it is a lot of output and yeah I think we're a little bit ahead of the tooling at this point but I'm positive it'll catch up yeah I'm like I said we're working very closely with the spdx project to get some of the tooling I think oh thank you um yes this the question was when was this released um I know it's it's definitely in 4.0 um I should have checked on the timeline this actually went in like a while ago so it's probably I'm sorry say that again yeah yeah the the past release so the 3.2 I don't remember I don't know if the I don't know if the previous LTS dunfell 3.1 has it um I'd have to check but if if it doesn't the release after it does it's also um it's it's not really that complicated it's just one file um so it's potentially something that could be easily back ported like you could just you could probably just copy if it's not in whatever release you need you might just be able to copy the bb class honestly um it's really not that much to it and I see we're done so any other questions feel free to come talk to me I could talk about spdx for probably hours