 Thank you for coming on such the end of your day. I know it takes effort at this time. Today I'm going to be talking about our experience with implementing a software build material system internally. And something we've learned in particular that was very important was this idea of making sure we created this core pillar piece called the software parts catalog first and we did it well and everything else followed from there. If that piece wasn't built right, the rest of the system could have been at risk. I don't know if you're familiar with this famous elephant parable but basically it goes where they had six blind people, originally that's how it went, came out, didn't know what an elephant was, each one went and touched a different part of the elephant and they all came to slightly different conclusions on what it was. Well, I sometimes feel like the elephant is an S-bomb and working with a lot of different teams, within our company they all have slightly different requirements sometimes about what they want in an S-bomb and we know that there are different activities going on and we know that clearly security is driving a lot of it today, licensing started it and what I'm in a position is we have to be able to support all these different teams and their different requirements around what an S-bomb is. I think I'm going to just briefly start off with this notion of some evolution that we've gone through with S-bombs. A lot of it started with licensing and we were driven by that reason. Back in 2006, I had to generate my first list of open source components in a Lenox distribution and then SPDX came along, we adopted that because it solved a lot of our problems with sharing the data but then even along that way, we actually got hit with the export compliance bug or issue and we were forced to generate a bill of materials for that and basically a lot of these activities they'll have something very common, come up with a list and then find out what data you need and metadata on top of that. Well, producing that list is very important and that's one of the number one things we focus on in producing a high quality list of stuff. But then from there, obviously today what's driving it is security. And I think that kind of got going with the Heartbleed issue and then SolarWind didn't help. Well, it helped a lot, it accelerated everything, right? But what we're also seeing is functionally safe space is also going to need a good quality list of open source. Our company has products that are functionally safe certified. If you don't know what that means, it simply means that if you have a piece of software, hardware like a robot being controlled by your software you don't want to main or kill people, okay? Or if you're getting to an elevator you don't want people to get hurt or die, right? So you have to have your software functionally safe. I'm not saying that that's driving anything right now but that's definitely on the horizon. The point is that we're seeing a lot of different needs around that. Another kind of evolutionary thread that happened in concurrence with that was this notion of in the beginning I remember when we required all our developers when you grabbed the binary off the internet we said you must grab the source from which it was built, right? We didn't want them to have a binary in-house and not be able to reconstruct it. So in my eyes that was really the first need for building materials with respect to the kind of software it was. Clearly what took off after that was the notion of a package and apps. And that's partly largely what people think of a part or a component. But we're seeing a lot of push around treating other things like parts or components such as a Lennox runtime. Not only do we have a bill of materials for a Lennox runtime but we have products that actually ship with an instance of a Lennox runtime and that becomes a single line item in that bill of materials. So it's not just a package but it's an entire product, is one line item even though if you look at Lennox bill of materials it's very extensive. So we need to start thinking about we're forced to start producing a bomb in that context. Clearly containers have really accelerated things as well. A lot of people again, we're really focused on packages but we definitely have a need for treating the container as a single part item and I'll give you an example of that. And clearly this notion of collections of containers I have two products that have north of 100 containers in it. And we all ship and obviously containers are comprised of packages and packages are comprised of files and we need to be able to model all that. Okay, now there is obviously a lot of focus on creating bill of materials based on the formats and that makes a lot of sense. We need to be able to exchange things seamlessly, frictionlessly but one thing that we don't have a lot of discussion about is how do we get there? That's usually the output, the final step. We need to construct an internal data structure of that product or that thing that we want to model as provide a list of stuff for. But do we ever have that discussion around what exactly is that data structure should be? And clearly if you're having all these different disciplines driving what requirements are they're going to be slightly different for different disciplines. Well naturally everybody's probably producing some kind of what I'm going to refer to as an S-bomb portal, some way of driving your portal generation. I'm kind of working backwards into how I see this problem evolving and that as a portal could mean many things to many people within our organization we ever thought of what that is, right? But basically it's that tool that helps you generate stuff in a central place to go. But before that we realized that we needed an internal component catalog database from which every single component that ever ends up in any product is somehow registered there. And again those components aren't necessarily always packages. It could be an entire Linux runtime. It could be one single component that gets put into that catalog. So the main point of this talk is to focus on that piece right there. And I felt like after all our experiences going through this process developing and redeveloping this kind of system we needed to step back and get that part right before we could move forward. And that's what I'm going to focus on here today. And for the starting point one of the things we really want to think about is what is the definition? What is a component? Do we have a formal definition? I know a lot of community groups are talking about this and there's different ways of thinking about it. We wanted to build a data model that can support all the different ways of thinking about it. Identity is clearly a big challenge for all of us when it comes to identifying a component whether it's a package or whatever you have. Storing and retrieving obviously that data has to be seamless. And finally we are seeing this need for a lot of metadata. We see that with the SPDX development and their introduction of profiles are really important to that. So I'm going to draw a quick analogy here using IMDB because that's what I think this catalog is for us. Everybody I assume who here is not familiar with IMDB? It's that single source of truth for most people for their movies. And if I clicked on this, do a search for the Godfather. There's several different versions of the Godfather just like you have several different versions of a part. You have Father Father 1 and 2 and 3 and then you might want to click on your component, the component you want and up comes a whole list of very rich information about that particular movie. Clearly you're going to find out the directors, the writers, the stars, you scroll down, you get all the information about the actors, the awards they won, you can get the actual storyline. Even I remember one time as a parent I would actually have to check parental guide before I had my son watch a movie. It's a rich place, a central place of single source of truth. And I think when we build our S-bombs, I think about that as having that single source of truth in-house and think about what's the analogy to that is you might want to create a whole list of movies, your favorite comedies, your favorite documentaries, each of those are releases. And those lists that you build, the whole list management part of it is the S-bomb portal. But you need a central database to start with and you can work from there. Okay, okay. So let's jump into real quick and see a quick demo. Okay. Suppose I typed in Izzy and it'll do a search, it comes with a bizbox, let's choose this component or it's like choosing a movie, right? And then you're going to come up with, okay, this is not as sexy as IMDb, granted. But the idea is there is that you can at your fingertips grab all this information about the part. Now keep in mind we're going to talk about parts largely of different types, but here you have archive, which is the most common one we've been talking about. You can have some other basic information like licensing, you know, file counts and we're going to talk about why that's really important. Obviously description and then we have profiles. Profiles are very analogous to what we're seeing in SPDX. In fact, a lot of the data that's stored in these profiles will be taken and be able to help produce profiles in SPDX. And you can add any number of profiles. I'll talk about how this is really a hybrid database of both SQL and non-SQL data. Those would be more like documents. But we'll get to that later. So that's just the basic introduction to the way things are working. So in this presentation what I'm going to do is I'm going to present the case for why this is an important thing to have, sorry, for parts catalog. It's designed independent of all the other pieces you might have in your bomb system. We talked briefly about how we'll talk about definition and identity storage retrieval metadata. I'll present the data model that we use. It's going to be always an evolving thing. But the data model is there to help us represent all those different kinds of parts you might have, whether it's a package or it's a container or a collection of containers or an entire product. Then I'll discuss this notion of a catalog which is the S-bomb portal versus your actual S-bomb. And finally I'll talk about the code and how we implemented it and then give a summary. Really quick, just to sync on concepts. This is an assumption. Most software, whether you're talking about an application, you're talking about a library, a container, or an entire device run time, all have this kind of composition, right? A notion of you grab a lot of open source, you pepper it in with some, you know, you have your proprietary software and then you pepper it in with some third-party commercial. But then we also have the notion of containers which just creates this situation where it becomes very exponential and types of things you have to accommodate. So this is what we've seen. Another trend we've seen over the last few years is the number of disclosures we're receiving from our team. Basically back in 2022, we had about 120,000 components. That means of all our product lines, we're submitting those disclosures to us about what they think is in their product. Okay? All right, now I'm going to switch to that idea of defining exactly what I mean by software part. I've got it my way to use the word part because I don't want to be confused what we traditionally think of as a component. And in particular, it's a nice analogy. So I always think of the file as the most atomic unit. You might think of it as a simple screw or bolt, whatever. But then we know we have slightly more complicated things like a library, which is a collection of these smaller parts, right? And we typically might represent that as a package. And again, you can have an application where you have multiple applications and multiple libraries in a package, and then you start to get something like more small sub-assembly. And then we have this notion of a Linux runtime, which if you ever look at the bill of materials of a Linux runtime, could be quite extensive. Then obviously we have the notion of a container that we want to represent as well, and then the entire product. And again, we need at times to represent an entire product because that becomes, even though it's complex, it becomes a single-line item in another bill of materials, which we can then drill down on. Now, let me just give you a quick demo of that. So I'm going to put in our system is this is actually based on a real internal solution we have. It's called Over-The-Air Updates software. So basically, you can imagine cars these days, everybody can upload updates to your car while it's sitting in your driveway. That's the Over-The-Air aspect. It doesn't matter exactly what the application is, but I want you to understand that there's basic two components to it, a client and a server. Now, if you look at this one, this particular part, it has type logical. What that means is we decided that we're going to make this new product called OTA system. That's logical, but it's comprised of two containers. And if I go to the subparts of the container, you can see a client and a server. Those are actually container images. Again, this is logical, but if I click on one of these, you're going to see this is a container image within our database. So it's not just a package. And then it's only consisting of one file and one image. But again, a container can be comprised of a lot of other pieces. So if I drill down and I select the client source, it's going to come up as container source. And if I ask for the subparts of that, I'm going to see a lot of open-source packages which comprise that container. And if I want to open up one, you'll see that's actually an archive. So what we're saying here is, like, I can have a container be a component, right, as well as the archives making up the container, as well as the container image, but back all the way back to the logical component. Okay? This is something that I realize most people are not going to have that need today. But I believe in two or three years we're going to really have that need. Particularly when we have so many products right now made it comprised of containers. Okay. I want to briefly touch on this one. I know there's a lot of discussion goes on. We're not the experts on this area. We needed a way to have an identity within our own system. There are a lot of good solutions already. But these are the solutions that people use today, and we think they're good ones, right? Yes. Well, first of all, we've built in a lot of history of building the system over and over again over many years, right? And we've taken all that knowledge that we've learned about what works and what doesn't work. So we sat down, I would say, about a year ago with all that experience and said, all right, let's lay these requirements out, and we put it together in about a year later. We're here. Okay. I'm not going to claim that we saw this problem. We just needed a way for us to uniquely define something when it's in our system. And the good news is all these ways of identifying components today are very popular, and we use them within our own system. We'll keep track of them. But we also needed a way to identify a package uniquely, and I'll show you why that is true in a second. And the solution we used is based on something we borrowed from SPDX, which was the package verification code, which I'm not sure how useful, how common that's used, but we found it to be extremely useful in our context. Okay, and what that is, is basically, the algorithm is really simple. We assume, first of all, parts can always be broken down to files. So if you obviously have a container, you can get the source. You can then get from the source of the packages and then from the packages, you can get those files. Same thing for any package and so forth. So that's an important assumption. And what we simply do is we then say, all right, for all the files for that, let's say, take a package, take a busybox. We'll then gather all those files. Again, this is an algorithm taken from SPDX. We'll sort. We'll compute the SHA-256 of every file. We'll put them into a list, an assorted list, and then take the SHA-256 of that, and then we'll have a unique identifier that we know, no matter what the package is called, whatever the name is, it doesn't matter. We can determine its identity based on its contents, and that's the important thing. We're not depending on an external source, although external sources are useful too. Then there are good ways of doing it. But we found that we needed a way that if the world outside us changed our identifier, it didn't. Now, I know that the Surfer Heritage Foundation also has a unique identifier, and they're trying to do something similar with being very intrinsic. It's called intrinsic kind of identifier. But we'll use theirs if they become more popular. But all the other ones that are being used today, we will represent in the catalog as well. So you can look up anything by their respective identifier. So for example, if we had a pearl in there, you had a pearl, and we started with the part, you could look up the part based on the pearl. Okay. We're not yet, but we can. Yeah. Yeah. So let me just go quickly into show you how that works. So I'm going to click on SSH. Now, one thing you'll notice again is, this is the file verification code that was computed. Now, this is just the last 10 digits. It's an abbreviation. You can copy and paste it here. If you really wanted to see it, I could show you. It's not pretty, I can just paste it here. And what essentially, you know, is a very long hex string, but ultimately that is used. And what's interesting is if I click here, you're going to notice a bunch of packages come up, archives. These archives were uploaded to our system by different people, by different teams. They had slightly different names. Interesting that some were called client, some were called server, some weren't even called anything. And it's very common for us to end up with an archive, the same content, given a dozen different names. And if someone hacked the name, it doesn't matter. As long as you're looking at the content, the files, the file verification code will always be the same. So you'll notice here that just shots are all different, right, as well. And so are the names themselves. So that's really important for us. Okay. Finally, something I've already talked about really briefly, and it's a very common thing we're seeing, especially coming out of the STDX group, is this notion of creating profiles for different disciplines. And again, what we want to do is not only do we want to support a whole collected set of parts, but we want every part to have a collection of profiles, and you can define your own profiles if you wanted to. Again, you have licensing, you have security, you have crypto. Crypto really came out of the export world, because you had to understand the crypto in a given, say, package in order to understand how to classify for export purposes. Quality simply bugs as opposed to vulnerabilities. And then the build profile is something that we had to produce, actually, because one of our clients asked us, because I know they're getting certain regulations from certain governments. Obviously, there's a whole set of things that, in the future, we can support by simply creating a new profile. So we can extend what we store in the database. If I went back to, I didn't really show you that when I went to the busy box one, but if I went back to home to busy search. So licensing is really straightforward, and again, these fields can be added. This is just a JSON document, okay? And you can put arbitrary fields in. It doesn't matter. But here we had an analysis done where, you know, we had two different analysis done. One was through automation. One was through a human to validate it, and we can keep both of those. You have copyrights in your actual license. Notice, if in case you wanted to create a notice file, you have security vulnerabilities. This is something we just recently added. We're going to probably extend how comprehensive this could be. We'll probably have Vexes as well. Yes? So what we're presenting here is what I refer to as the top-level license. It's a good point. We also store in every archive every single file, and every single file will have its own licensing information. So if you wanted to get a more comprehensive list of all licenses found, like, for example, when you produce an SPDX file, you need to know that information. So, like I told you when we started in the database, every part is pointing to every one of its files. And each file has its own profile, too. I'll talk about that in the data wrap. We purposely did that because the atomic unit is important to have as much information as possible because if one file is used across multiple packages, different versions of that package, which is common, if you update that file record, it propagates to everything. Okay, so the next... I just want to go over a very quick workflow. Everybody has their own workflow. This is not the workflow, it's just a workflow. I just want to give you a workflow within our side of our company to see how the catalog fits into the context of that. Okay? So, really quick. Developers are responsible for disclosing, and eventually that disclosure has to impact the customer. And so what will happen is as they start to disclose, we start to construct the bill of materials. And then we have an API, also so they can batch submit hundreds of disclosures if they want at one time. But most importantly, then we'll have a good analysis composition go on, a composition analysis go on where they'll look at the integrity of the bomb and to make sure that the quality is high. Keep in mind, again, we want the best, highest quality list possible and then all the metadata will follow. But if you miss something, you're not going to have it in your list and it doesn't matter how good the metadata is. And then we have an analyst, depending on what tasks they're working on, will reach out to different team members working through the bill of materials to construct a set of reports. Those reports will go out to the release team. The release team will then bundle them up and then they get sent out to our customer. Now, the most important point I want to point out is that when something comes into our front door, our API, as I was mentioning, that one component will get stored in the catalog if it's the first time it's ever been seen. But if it's been seen before, file verification code will tell us that. It'll just say, oh, here's another archive that maps to that file verification code. There's one record for the part for all those archives. And then what we'll do is it'll actually get into the SBOM portal and it'll be assigned to a release. And I'll tell you how the portal differs from the catalog in a second. But the other thing that I don't want to go into great detail is we'll have this, what I call the source forensics analysis engine. That's where we can perform all kinds of analysis and it'll populate the data to the part in the catalog. Again, remember, if one release points to that part, they all benefit. That part actually belongs in a variety of different releases. And that's natural, right? And so, again, it's only going to be one record in the database, but we know it's peppered throughout all our releases. And then, obviously, from out of that, you're going to get a bunch of reports. I just want to draw this distinction between the parts catalog and the SBOM portal. Again, it may be natural for some to want to mix the two together, but we felt it was really important to keep them separate. How do they differ? The most basic sense is the catalog contains what I call all intrinsic data. That's independent about a release. There's nothing about the release about its usage, because keep in mind all these releases are pointing to it. It can't know about a release. It shouldn't know about a release. But the release knows about it, right? But this is all I would call it intrinsic because they're factual, true for that component, regardless of how it's used. And so this is the data that will go into the catalog. And that data may get extended over time as long as it doesn't violate that rule that it's independent of a release. However, the SBOM portal, whatever that means to you, that to me is where you're going to store extrinsic data, release dependent. And that means clearly if you modified it, you patched it. I want to keep separate. Keeping this relationship between all the parts and components within a release is complicated. So that's why that kind of portal is going to be useful for that. Clearly, whether it's linked to other things like proprietary code or not, right? Depends these used locations within the product. We'll have three different instances of Zlib using a very complicated product. What happens if there's a problem with Zlib, that particular version of it? You want to know where is it? Where are all the different parts? A lot of times people want to keep Zlib as one instance in the list. We want to know that there are three instances that we want to know every location within the product. Obviously, a number of instances will be kept. Who disclosed it? Tool chain used. You can then start to record that information. Files used in the build as well. And dynamic depends. These are simply things that get sucked in during build as opposed to things that are just hardwired in. I just want to make it clear that we store a real important separation between these two sets of data. And it was really important for us to maintain that as part of our architecture. Okay. What I'm simply highlighting here is that clearly also the parts catalog has a potential role to other things in your organization, independent of the SBOM portal, right? So for example, we will have the build pipeline may want to have access to a component in its data. You might have the security team have its own set of tools. That one access to the catalog. And it shouldn't be hard, you know, to sew it in tightly into the SBOM portal because this is a universal, central source of truth within your organization. Okay. Now, one thing I want to point out, although I showed you a UI for the catalog, that's not its primary way of accessing it. The main way of accessing it is through GraphQL, a very well-designed technology for accessing queries instead of using a REST API. You can access data very easily. And most of the activity on the database happens through the GraphQL API. We have a CLI as well. And as I mentioned, we have a UI. Okay. Finally, I'll talk about the core technologies we used. And first I'll start off with the data model. Someone was asking about the file level. I think he left, but he... We actually do store everything. Every single file is stored for that component. And we store profile information and everything. We treat every file as a first-class citizen because if we learn something about that file, we can store it there, and everything that inherits that file will benefit from it. Okay. The thing to understand here is it is the most atomic unit, but also at the part level, we're going to also have profiles. We have this data model allow you to support, as I said, an eclectic set of parts by simply setting the type field. When I say we support multiple aliases, there's a lot of different ways people want to refer to a package. We want to make sure we can capture all those aliases. We can store those as well. Those aren't necessarily identifiers. They're just nicknames for things, right? We see identifiers or locators as a kind of a special class of stuff. And then finally, again, profiles are both the file and the part level. What we did to implement this, we chose Postgres. It's a fantastic database. The one benefit I'll call out here due to time is that we chose it largely because it supports SQL structured data, which a lot of the part data is stored that way. But we also can store JSON objects or JSON documents if you want to call that. That gives us those profiles, and we can extend that very easily. It's very dynamic that way. And I thought that this was a huge benefit for us. Okay. I'll go through these quickly because you could read the slides on some of these benefits, why we chose them. Go is clearly a growing in popularity. It's a fantastic backend language. The concurrency support, it's a better C, C++, basically highly readable, highly stacked type checked and so forth. GraphQL, it's been a blessing because if we had to do this through a REST API, it'd be much more rigid when you're dealing with accessing data. REST APIs have their benefits. I'll say one drawback to GraphQL is that if you want to change data, it's not as good. It's a little more awkward to call it and use it. But if you're just doing a lot of accessing of data and you want to create endpoints, GraphQL is a fantastic way of creating queries and accessing the data over the Internet. Okay. Vue.js. I know a lot of people like React and Angular. Those are fantastic UI components. We looked at those. We looked at Vue. Vue is the up-and-coming guy. I think it learned a lot from Angular and React. It's clean. It's elegant and simpler. It forms really well. I don't know if you're familiar with UI stuff. If you're not, that's not your thing. But this is one of the three, but the younger one on the block. And we found it to be really mature for our needs and really a good choice. Now, if you ever looked at Market Share, you'd see React is at 40%. Angular is at 23 and it's only at 19. I would say Angular is kind of stagnated and Vue is on the way up. Okay. Finally, we put this out. We just took this as an internal thing. And we said, all right, let's give it out and try to see if we can grow it with a group of people community. We just opened it up. It's under Apache 2. You can go to the GitHub link if you want. Things to think about that we haven't... When it comes to contributing is, you know, the basic stuff that we need to do better at. You know, identity access management. Whipping up a quick and dirty S-bomb portal so someone can have a free quick and dirty one built on top of it. We have an internal one. I'm not sure that one is easily given out, but people can build their own and they can share additional... Adding additional standard profiles that come along. One of the things we want to do is allow people to easily create a user-defined profiles and the database will have a little schema type checking. Extend the API. Advanced part searching is going to be really important and I'm extending the CLI. So in summary, as I mentioned in the beginning, we are preparing for this onslaught of requests for S-bombs that are just going to be so varied and we're going to be slammed if we don't do something now. And that was to redesign how we store components and then start building the system out towards the formats as opposed to starting with the formats and going backwards. Okay? We recommend starting there. Making sure it doesn't have to be this catalog. You can create your own catalog, but think about it and get that piece right. I highly encourage you to keep the separation between intrinsic data and extrinsic data. That was one thing we've learned that's been a real benefit. It's important to have in your company a single source of truth, a place you can go where every single thing must go in there. Even if everybody else wants to replicate it somewhere else, who cares? But you need a place where all components that go into your products can be stored so that you will be able much easier to manage that and actually you can build many more scalable solutions that way. As I explained throughout the presentation, we support this notion of varying things that could be parts and finally we talked about the metadata and the technologies used. Okay. With that, I'll take questions. Yes? You mentioned the different profiles, including a crypto profile. I will come to steal that from you for the export compliance working group reputation. Sure. I mean, it's just a list of cryptography algorithms found within a part. Yeah. Mm-hmm. Sure. And yes. I don't know. Any other questions? I also found that terminology is really difficult to align on. In the comparison you had between extrinsic and intrinsic data, you had hardwired dependencies versus dependencies. Can you just explain that a little bit? Sure. When you build a piece of software and you embed inside your... you can pull in libraries into your code base and fix them and put them into your code base. They're not going to change on you, right? Then those are considered hardwired. But if you have something that's dynamically determined at build time and it can change, and especially if you use the word latest and you don't know what you're going to get, that's dynamic, right? So you would only store in this thing the things you absolutely know are absolute truths. And if you know that this particular third-party component was put in a sub-direction, I think in Go, they call them vendors, like this is a vendor directory, then if you embed it inside your component, it's intrinsic to the component, you can't change, it's factually what it is, then it would be considered hardwired. But if you're dynamically determining the dependency at build time and that can change every time you do a build, that would be dynamic. And that'd be extrinsic. So like ranges of major versions or ranges of minor versions would be... Yes. By the way, we consider every version... Independent. Independent. Different. They're different parts. Okay, thank you. Any other questions? Okay, I want to thank you guys for joining me on this late panel. Thank you.