 Okay, so thank you very much for coming, sort of the last session in this room anyway for today. We're going to be talking about how to take an S-bomb and use it for AI applications, because there's a few differences that we need to include. I mean, the S-bomb today is quite good, but when you build an AI application, there's a few differences that we need to be adding some additional fields. But we've been doing surveys of some experts in the S-bomb area as well as the AI area to make sure that we've got coverage. So if you're interested, we'll talk about it at the end of the talk as well, but we'd love for you to come and join our group. So today, myself, I'm Karen Bennett and this is Kate. I think everybody knows Kate for S-bombs. Myself, my background is I used to be part of Red Hat, so I'm very familiar with open source, but for the last 10 years I've been building AI applications and Kate and I have worked together in the past and I just approached her to say, hey, your S-bombs are great, but they don't really cover all the things that I would like to capture as part of documenting our applications. So this is the result of that. As the world is changing, AI is pretty mainstream, it's coming online with a lot of applications these days. So it's fully in your supply chain and I know there's been a lot of good talks about sort of cybersecurity in your supply chain, but AI, we'll get to it, I guess it's the next slide. Oh, maybe not. So again, this gives, what's that? Just stay next and you want me to advance it. Okay, go next. Go forward or stay back, go back. Go forward. Okay. So why does it matter? AI is growing big time. Like it used to be, when I started out in the AI space, that people were just doing it sort of in research and prototyping, but now it's pretty mainstream. The big companies are using it for, I work with a lot of banks up in Canada and they're using it for cybersecurity, fraud detection. I also work with companies like Amazon with the recommendation engines, but my primary focus of work that I do is on the self-driving cars. So I work with three of the five manufacturers of cars, Ford, GM and Mercedes-Benz. And people keep saying, oh, it's not here yet, but it's coming pretty fast. And one of the problems is, especially if you're taking open source parts and combining it with proprietary stuff, having an S-bomb or some way to document what you've got is going to become very, very important for our AI applications. So the next one. According to some of the statistics out there, I mean, I've heard a lot about cybersecurity in the last couple of days, but most and like 96% of this survey for cybersecurity is using AI and machine learning today. So it's already integrated in these apps. So we need to figure out how to document them. And yep, I'll pass it over to yourself. Okay, so I'm just going to give a quick intro about everyone sort of, before the meeting started, sort of raised their hands roughly on about S-bombs. But what I'm talking about when I talk about S-bomb is a formal record containing the details and supply chain relationships between components they're used for building software. And components could be source files, they could be libraries, they could be packages, they could be full distros. All of these are valid components and they could be proprietary, they could be free. This is all part of our software supply chain today. And so some of the work that happened in NTA was just to define a basic little supply chain. And there are things that are known and there are things that are unknown and being very clear and explicit about do I depend on this? Do I include it? Do I know everything that's coming in or are there things that I just don't know? And being able to articulate that is at the heart of what is a minimum S-bomb today. So we're finding that transparency is going to be key to improving the software supply chain security and AI explainability. So we need to be able to document this effectively. So last year, the Alix Foundation did a survey and two of the main actions that came out of it were make sure you've got a vulnerability reporting system so you can track your vulnerabilities. Number two, use S-bombs. And so this survey was global in nature and this is sort of the advice that came back as to how we need to move forward. And as part of that, we were asking, okay, who are you using up today? Well, actually 47% are using them today. Now, I'll talk a little bit about the challenge with the definition of what an S-bomb is because people are talking past each other a little bit still. And based on that survey, 78% said they'd be using them this year. So it's coming fast. And we'll have S-bombs there. However, there's a missing piece of having this data and the AI aspect for documenting that we have to talk about. So applying the S-bomb to AI applications and training is one of our challenges. Yeah. So for those that have written AI applications, you're probably already aware of this. But the S-bomb today is very good with the colored circles. But when you bring an AI application, it needs to be trained and the data is really key to your application on whether it's going to have good predictions. So we're going to have to have a way to say, how did you clean the data? How did you take all the empty cells or out of a spreadsheet that were of use? Did you get rid of duplicate records? I heard a couple of people asking questions about single identity. Like these types of things, if you can get rid of and clean it, your AI application will actually be more... accuracy will improve hugely. The next one is, I mean, we talk about data again. But data, some percentage of your data that you're going to document, you use it for training purposes only. Then you have data that you use for testing. And then you have your live data that's in the outside world. So there's sort of that three levels that we're in our S-bombs are going to have to be able to document. The other is sort of the training of the model. So before you can go to production, you actually will get to the next slide in a second and it shows it better. But you need to train your model with data. Sadly enough, for the most part, I've written just supervised machine learning apps to date. Although we're starting to experiment with unsupervised learning apps. And this, the computer is actually learning by itself. It no longer needs a human telling it what data to look at. So this will be a challenge for S-bombs in that we're going to have to somehow figure out how we're documenting that, you know, the data that an AI application is actually using. The other thing with data is a whole idea of consent. So I'm sure folks have seen, you know, many articles on they didn't have consent to use my data. And right now in the world, you sort of have different views between, you know, what China considers consent, what the EU does, and what sort of North America does. But again, think of surveillance video or facial recognition. People are collecting your data, and you don't even get the opportunity to say whether you gave consent or not for that use. So again, how are we going to document that in an S-bomb? We'll need to be figured out. The other couple of bullets. So again, with an AI application, it's constantly the data is changing. As I said, in the self-driving cars, we actually have sensors within the cap of the car that is monitoring your biometrics, your facial recognition, you know, your breathing, the temperature of the car. But this is constantly changing. And so that whole feedback of you had data that you started to do the training with. But it's actually, they call it, I'm sure many of you know, data drifting. So at what point do you flag the data drift and document that? So again, it's a new challenge for S-bombs. And then the last one is, again, sort of models need to be tested. So the application that we've been building is you deploy your model and your data that you're using. But when data changes, we actually have a mechanism that actually can change the model itself. So the code is constantly live. Again, a concept that I don't think S-bombs today are taking into consideration. So just some things to think about. So let me just repeat that because my instructions are for the folks on virtual. The EU has a criteria with their definition of models and the standards of the sort of high-risk, low-risk, critical risk. And would that be part of the bomb? As it turns out, we're going to show you that was one field that was flagged that we're going to have to, especially because the EU has already sort of made some good roads in that direction. Once you're going to have to flag whether your app is high-risk, low-risk, according to some definition of those, it would make sense to start with the EU definitions. So yes. And as you're probably aware, facial recognition is critical risk. So there are some individuals in the world that think you should ban those type of applications. But at the minimum, you need to be documenting that you're a high-risk application. So yes. Oh, wow. I was thinking, you're talking about the very important learning where the system is learning full-time. But isn't this from the purpose of the world, the type of approach that when I buy a car, it's the response to that? Or is it for the payment system? Or is it for the care of the data like a paid system? Since they are interested in that, about my health, I'm going to report that in the response. Great. Next time I'm going to hand the mic to the person next to me. That's a great question. It really is about, you were right, reinforced learning is constantly learning. What, from my opinion, I think S-palms have to somehow document the ingredients of your product. But they also, or some way, has to happen to tell you the impact of your application, whether you're high-risk or low-risk. I personally would like to see it included in the S-palm as part of sort of, there's only one place you have to go to understand. It might be, as Kate was going to talk to in a minute, there's sort of your build of your S-palm, and then there's the deploy of the S-palm. So it could be that there's different fields depending on what phase of development you are. So, again, so I work pretty heavily with IEEE in a lot of their AI ethical standards, as well as ISO, as well as I work with the Chinese AI group, trying to understand a lot of, what are the rules that we should be applying as part of an AI application. And interestingly enough, I got tired of counting, but there's over 2,000 standards out there. So I don't know how anybody is following any of these. And then, I've heard Kate repeated in a couple of talks, but they're not free. I mean, the reason I joined IEEE was so I could get access to these. The beauty that I've seen with Linux and with the S-palm, is that it's an ISO standard and they have figured out how to work the system so they can be free. So I think longer term, if we really want developers to follow these standards, we're going to have to figure out how to open this information so that people truly understand where the rules are. And interestingly enough, at least I'm heavily involved with the, they call it the P7000 group of IEEE, which is AI ethics. What do you do about transparency, bias, emotional data? They are all putting a rule in there that you have to document your ingredients of your product. So that is a rule that's going to be coming out. A lot of these are still in development. They haven't hit the street quite yet, but again, I would encourage you to come and join some of these groups to mold it. And then the other thing that I've seen because I do go to different companies and help or whatnot is everybody has their own way of documenting. Like there's data sets, there's scorecards, there's fact sheets. And what, to me, what SBOM has the opportunity to do for an AI application is unify this, so that we're all talking apples to apples. Instead of, oh, you know, on a data sheet, I specified this. Having been a developer, if I can't just have one place to go document, I probably won't document for you. I'd let somebody else document, you know, the product itself. So with that, and then again, with a life cycle of sort of machine learning app, there is a lot more sort of phases to how development is done. It's primarily around the data, is where the differences are. But that whole monitoring once you're out at, you know, you've deployed it is a feature that AI developers need to be accounted for. And in fact, I've been talking to a number of the tools groups downstairs trying to figure out how we can more automatically monitor when data crosses a threshold of some sort and it flags to a developer. Personally, I don't think it can be automated without a human yet, but at least having a human involved or get an alert of some sort when it crosses some boundaries. And then for those that are, you know, build AI, it pretty much is, this comes from an IBM paper, but it has the data, the model, and then there's actually code associated with an AI application. So again, it's the mixture of all these three components. So when we go through the SBOM examples, you'll see that we need to have a package associated with at least one package associated with each of these areas. And so again, for those that have built AI applications, you're probably very familiar. But really what you have is your data. You have your, you know, your Python code. We show, for the most part, we use a percentage of the data to train our model. So, you know, we'd say 20% of, you know, the facial recognition in the cab of the car is actually used to train the model. And then you come out with your objects that are going to be created to become the product itself. Oh, I'm still on. Okay, so again, standards, I talked a bit about it, but if you, the standards associated with AI right now, they come in groupings of sort of what's ethical and what's the rules that you're going to pay of penalty or go to jail. There isn't so much on this side yet. I mean, I think everybody in this room is familiar with GDPR and the fines. I mean, that's definitely sort of the law. But the standard space gets even more complicated. So you have your global sort of standards, you know, like the S-bomb. Then you have your regionals where Europe is quite different than, say, I'm from Canada, like our data privacy laws or guidelines are quite different than the States. And so you've got the regional level and then you actually have what I call domain level standards. So again, being in the car industry, you have to behave with the traffic rules. So if a red light says stop, that self-driving car has to stop. Like it actually has rules that it has to abide. And so figuring out which standards you have to adhere to, comply to, is going to get more mucky as we produce products in this area. So again, the S-bomb will be a great aid for many of us as developers to figure out what standards are you following. And I can look at your S-bomb and say, hey, you're not following GDPR and you're going to ship in Europe type thing. So that is another field that we're proposing that you actually have to document what standards you're following. Okay, so one of the things that we're going to go through is we picked an open source project on the internet. So MIT has a simulator model that simulates self-driving cars. And then we used the Waymo data sets that are out there as an example of how we might document an S-bomb. One of the exercises that we're doing with the working group right now is we have, Kate will go through it, but we have a list of fields that we think are good enough to be the minimum viable documentation that you would do as part of an S-bomb for an AI app. And then we have a longer list of fields that we think you should document, but it's more optional. And so we'll go through the example of these two things. So just at a top level, the MIT simulator, I wasn't actually familiar with it because more the car manufacturers have their own proprietary systems. But one of the things, and it will be debatable about potentially saying what type of algorithm are you using in the model, like the decision tree regression is what the MIT simulator uses. Some folks think that's proprietary, but the good news with S-bombs is you have this no assert. So if I don't want to document it, I don't have to. But on the flip side, as an internal documentation, I probably would want to document that so that my new hires were at least familiar with all the ingredients of the product. And then with Waymo, so we'll talk about it in a minute, but again, they have two data files, so we're going to propose that there's a package, an S-bomb package, and then you can go down to the file level to know the specifics of what's in the file. But the licensing for Waymo is something I have never seen before, so we'll talk about that in a minute. So talking a bit about adding this type of AI information to our S-bombs, one of the key features we need to keep in mind is that there's a software lifecycle out there and there's a data lifecycle out there. And within the software lifecycle, we have different types of S-bombs occurring out there in the wild, and people talk past each other all the time. So by being explicit about the type of S-bomb, we can help tease apart some of the misunderstanding and make the conversations more productive. In the security space, for a large part, they're talking about a build S-bomb, where they're basically looking at the dependencies that have been built together and so forth. But there is also a source S-bomb, which is the sources, and so the build S-bomb depends on the source S-bomb to some extent because you need to know your sources and what's made it into your binary and what's in your final executable image. Also out there, though, is something like an analyzer, a binary analyzed S-bomb, where someone's given a random blob from a third party and it tries to figure out, are there risks of using it? Is there dependencies? Not necessarily, it'd be the same sort of thing that would come out of a build S-bomb, you don't have the information, so you use heuristics to try to find it. So a build S-bomb would have a lot more confidence in use than an analyzed one, for instance. But all those are there, and then there's, quite frankly, the deployed S-bomb. So what configurations have you deployed your system with? And so when we're talking about those loops of learning with the data, that's all going to happen and be potentially documented in a deployed S-bomb, where it potentially refers back to built S-bombs and data that has changed over time. So these are sort of the, how should I put it, these are sort of the working definitions we're trying to get clarified with some consensus in the industry. But I think it's important to understand what part of a software life cycle has someone built it, has the S-bomb come from and what is the expectations around it? So, because there's different tooling in all places as well. And so that is a factor and they'll have to be more tooling potentially as well when we start adding in the data and the AI learning sides of it. Most of it I think is going to come in the deployed S-bomb and that's probably the least well-defined part of this life cycle today. And so we have an opportunity to make sure we build data in right from the start there too. So when we talk about building up that model, we'll have a source package and this here is effectively a source S-bomb. And there's source files and they may be depending on libraries and that may be linked, all that stuff may be coming out. It would be source S-bomb for the Linux kernel, all the files in the Linux kernel. But here the model image would be generated from some sort of sources. And as you build up the model, it's having executable, so you'll have a build S-bomb that has where it's generated from and then what it depends on from runtime libraries. Something would be static, so it may be dynamic. And then as you build up an AI application, you've got your source S-bomb for the AI application and you've got your executable and it's been generated from those sources and it may be depending on that train model to really run. So you see these different types of S-bombs have potentially different types of information associated with them. But the structures all work for them. And if you're building up your model, it's coming out from a source. There's some source behind that model as well. And when you deploy an application, okay, you know, you have that application. It's generated from the sources. It depends on your train model and it has a runtime dependency of the runtime data coming into it. SBDX today can represent most of this fact. But the train model and the train data is one of the areas where we need to extend things a little bit. And so we really don't have, you know, we have an original model. We have the notion of training data going into it. We have validation going into the train data model and we don't have the relationship. So there will be relationships introduced to let us express these concepts as well as start to articulate it. So looking at examples will let us figure out what is a reasonable set to work from. And so what we're doing in the SBDX-AIML group is basically doing a survey of existing sources of documentation to Karen's definition. And so Gopi, who's here with us as well, has been doing a lot of surveying of all the industry standards and we've been sort of pulling this together on our weekly meetings and trying to understand what the common denominators are and what the variances are. And what we've done is taken those, what these documentations are looking for, mapping them to existing SBDX fields and then seeing where the gaps are. It's a basic gap analysis exercise. And then what we're trying to do is figure out, okay, what's required initially and what is optional and what do we want to do as our first set going out. And so, you know, some of the gaps we've been sort of seeing and Karen was talking about in the AIML and the model part of it, the standards compliance. Now, I've already seen ARM automatically document in standards into SBDX using the common field extension point effectively. So they want standards as well. So that's very likely to be going in, going forward. Some of the risk assessments, autonomy, limitations, type of model, domain, information about training, information about the application, hyperparameters, metrics, metrics decisions thresholds and is model explainable and energy consumption are all areas we really don't have a good match to right now in SBDX. So we're defining fields to capture this type of information. And it will not, not all of it will be mandatory. A lot of it will be optional, just so that if you want to capture it somewhere, you have a place to capture it to share with someone. And then on the data set side, so there's effectively two profiles. There's an AIML one for an application, then there's a data set profile. And with the data set, there's certain things like, you know, how was it collected? What's its intended use? The size? What's the noise levels are? What are the things that are captured in these other documentation mechanisms today? Again, standards are important here. Nome biases. You know, is there sensitive personal information? We serve after a lot of discussion in a couple of meetings, we called that out explicitly. And has there been any anonymization used? Is there any confidentiality considerations that have to be thought of? Suppliers, some of the errata and some of the, you know, data set update mechanisms. How do we actually take this and iterate on it to update? So this type of information are things we need to start working through examples and then deciding what's the first issue we're going to have and which of these fields are we going to pull into a data set profile as options. So what's going on right now is the SPDX community has been working on the 3.0 model for the specification and we've got a formal model underneath it and we're starting to work on the serializations but you'll see here on the bottom, there's various working groups. And so the legal working group has been focusing on you know, updating the licensing considerations and fields available for licensing. The defects is working on security information. The build profile is working on what information do we want to capture from that build stages? And what evidence do we want to capture from the build stages of that cycle? Well that's important for supply chains because the build tools are important in the supply chain as well. And so there's work going on there. The Japanese are working very much on the usage profile. What's important to capture when it's used? When does it go out of service? When does it expire? We actually interpreted a few more fields in 2.3 last month. Basically build date, release date and valid until so that we can start to actually capture end of life and risk in policies and we could actually use that information for managing our software that is in the system and have indicators because certain open source projects for instance will say the community will support it until this period of time but then after that who's supporting it, right? And so if you hit the end of life from a community's perspective and it may change and you may want to update it and you may have to regenerate it but you have something in your system to force a inspection or look. And so the Open Change Japan team really wanted those fields and so we've brought them in early and we'll be using them extensively in the AI profile as well. And then we have our AI working group which is actually turning into two profiles a data set profile and so Gopi's going to be talking about Open Dataology tomorrow and they're using that some of the work with them to hopefully pilot some of this stuff and obviously the AI applications. And if there's things that people want to represent and don't know how we can start discussions on other profiles as well but this is what we're sort of our scope is right now and what's ready when 3.0 is ready to put out for the core we'll go in and the next we'll go into 3.1 So we won't block on these profiles but we will be working our thoughts in that direction with an attempt to intersect. So I'll turn it back over to Karen to talk about Waymo. Okay, so the Waymo data set so I manually tried because I'm not an SPDX person yet. I think it's a great tool. I was downstairs talking with lots of people about it but really what you know as you document your data set these are the key features that you're going to let you do today on your software packages so it's not really too much different. The only thing that is different let me just so it sort of starts from down here. Although before I go there this is partly you know Kate mentioned Gopi is doing a session tomorrow but one of the things we realize that the licenses for data are not that consistent. You have companies coming out with their own sort of special license and if you go you have to you know go and read it basically their data set cannot be used for commercial use. You know it brings me back to the days when we were trying to define LGPL and it's just like no no open source means it can be used by everybody in my opinion but at least they are flagging out front that that data set but what Gopi is going to try and do is get you know a common list of data license is out there so that maybe people will now go look at those before they define their own. But these are basically some new things that we're talking about so especially in the area of emotional or empathetic type data the whole how you calibrated the sensors, what sensors did you use all of this type of thing needs to come in with your data set longer term to document. So we'll go to the next one because we are running out of time. So interested in getting involved so this is a newly formed group I think we've how long have we been like less than six months but what we're doing and Gopi's been sort of the lead on this is if you're not able to attend the working group meetings we're trying to do the surveys where we ask you do we have all the fields that make sense to you or you know are there more gap ones so that we can start to document those so feel free to come in as a participant as part of the survey or we have a weekly meeting on Wednesdays but the one area that we're in right now is we're taking use cases and running them through to see we have a number of people doing this and one of the groups is hugging face because they do a lot of open source but go through the exercise hopefully we get tools coming out soon that includes all these fields but do the exercise do we have all the fields etc the more use cases the better and then come join us and that's our AI but let me just do a cell on the next slide a little bit so again as I said I'm associated with IEEE as well as doing that's my night job but we have created we got some money from IEEE and the EU Commission one of the problems with AI are out there especially in the standards group area is there's no common definition of what all these words mean so they've actually funded a small group of us to go do literature searches the standards that are in this space and also figure out all the definitions that are out there with the hope that longer term we can consolidate and unify on definitions because before you know we're going to be able to automate a lot of this we all need to be talking the same language yeah so we're basically taking it so this website is being focused at what does a developer need to know and do and rules for building an AI application and then it's quite different if you're regulating or auditing I mean myself I audit a lot of AI applications and sort of the claim to fame is I don't know how many folks know clear view out there but it's it's a lot of the police forces are using it to recognize whether you're a criminal or not so you're crossing the border and it flags the first thing I asked was what's your accuracy and they told me 6% and I went what and being a software developer I was like no no no that can't be and they were like trust us it's good enough it gets most of the criminals off the street like wow this is a new world that we're coming across that as a software developers are not used to sort of that lower standards for some of you know the information that you have so that's where the regulation sort of section is coming in is really looking at their landscape and how they police these applications so with that we only have a few seconds for questions sorry so is one of the goals of this kind of binary reproducibility of your kind of deployed machine learning very much so what tooling is there out there for generating AI specific as well because those are a bit different than the others do you want that one now there's not much because we don't have them into the specifications and the standards so the first step is to agree on the fields and then the next step is once they're there and agreed on we can start to encourage tooling to be produced so there's some prototype work that's going on but there's nothing I would basically point to anyone at today and anyone who wants to contribute tools and help with the prototyping work is more than welcome any other questions okay go for it because you show several bombs for AI and if they need to be represented in a single document is there a way that one can easily in a machine readable way understand which belongs to it yeah so the S-bomb format at least on SPDX has a lot of checks for authentication and ability to link from one document to the other and elements in one document refer to the others there Ivana from VMware was talking about your Ivana okay fine you've got your tool so you've got the tool that's doing it and so yours as close as it's getting right now it's going to become on the sources and like I say we're going to have to I think it's easy enough to merge it it's going to be splitting it apart that's going to be the challenge and figuring out what logical predictions are however software is an ecosystem play right and you want to have that granularity of ecosystem information because it's going to change potentially different rates at different times so having one S-bomb all the time for the top level may not be effective and so being able to you know pull things together when you need to and then quite frankly doing filtering on the S-bomb in terms of what you export out these types of tools will be I think emerging into the into the software ecosystem because it could be needed because some of these things like you know was talking with William Bartholomew at Microsoft and like you know the Windows S-bomb S-bomb is I think was 30 meg 30 gigs already like that it was a huge number and it really sort of like yeah okay and so cutting down the size getting it effective figuring out what what's important how many levels of dependencies you want to track down and represent consistently do you want to have everything or not there's going to be different purposes and so we're going to need to adjust the tooling for working with these types of artifacts but you're going to get a lot more knowledge out of the art these artifacts then you would with the trying to go into the sources directly and work it from that way and thank you very much I got the stop signal