 So today we're going to talk about Gitbomb. Now, originally, this talk was going to be given by Ed Warnocky and Ava Black. But Ed Warnocky got iced in in Austin. And Ava Black came down with an illness. And so we went down through the Rolodex of various other omnibore people who are present. And I was top of the list. So please excuse me. The slide deck is not one that I wrote, but I will do my best to give the spirit of the talk. I did write a significant portion of two of the six implementations of omnibore. So I can definitely answer questions on the technical side. Please raise your hand and interrupt me if you have questions along the way, because it's better to get the questions out as we go rather than let them sit and linger. Because there'll be other people with the same set of questions. But it's up to you on how you want to deal with that. So first, let's talk about timeline. So we had a colonial pipeline hack, which occurred in May 2021. Pretty sure everyone here is familiar with it. And that ended up causing the cybersecurity executive order to be put out by the Biden administration. But if we go back in time, we actually can see that much of the work that had been done that led up to omnibore was the creation of SPDX, which was launched in February 2010. And before that, where some of that work came from was from initially an industrialized SBOM that would have been done at Cisco, where Cisco had a requirement to know exactly where all their infrastructure or all of their stuff was coming from. So they can work out where did the hardware come from, what software patches and so on within their switches and routers. And so, however, mistakes were made. But at the same time, much was learned. So again, back to the timeline. Colonial pipeline hack, May 6. Take a look at the timeline. Between May 6 to May 12 was the time between when they, and actually what this means is that they were actually working on the release, but they were just waiting for the opportune time to actually release it out. And this was huge. So the timing was perfect. They launched it, got the support. The first omnibore talk, well, at the time it was called Gidbomb, but now the first omnibore talk occurred May 28. We actually were doing a lot of work before this as well to lead up to it. But we ended up talking about amongst each other and having various private conversations between people. We actually did the community launch in February of last year. So it's slightly under a year old. When we announced it, this was on the day of the reach that we had. So we had some pretty good outreach. We were pretty happy with that. And the interest has definitely kept up. So however, one of the things that we did, and especially Ava Black, was absolutely fantastic with this for six months, went around and spoke with various people. So she would listen to what people had to say about the space, usually over a cup of tea or coffee or something similar, hence the baby Yoda. And found that there were issues around how S-bombs were done, how do we actually correlate them to the things that we're working on. And reality is, in the small scale, it looked OK. But in the large scale, when you actually start to scale it out, it actually looked a little bit more like this, like trying to trace where everything was. And the general response to that was horror or frustration or a mixture of both. And this is all to answer a simple question. Am I safe? Is what I am doing safe? But in order to answer the question whether you're safe or not, you also have to look at the simplicity of the thing you're doing, because you cannot answer that question if things are too complex. And if things are too complex, there's no way you're going to be able to get everyone on board. And so we found many of us on the project have a significant distributed systems experience. And so we aim towards things that are more simple to try to keep reliable, try to keep it performant, and also so you can actually reason about the security. I want to make a point here. There's actually great talk by, I think it was about 15 years ago, maybe, a strange lube by a guy named Rich Hickey, who created the Clojure language. The talk was not about Clojure. The talk was about simple versus easy. And just because it's something as simple doesn't mean it's easy, just because easy doesn't mean it's simple. And very often, you can get lulled into the idea that something is easy, and then work out later on that when something breaks, you now have this massive rat's nest that you have to deal with. So we wanted to avoid that. Like, yes, there may be some areas where the simplicity may be at odds with the ease of use of it, but we want to make sure that what we were doing is something to be reasoned about. And there's different ways, as well, when you look at the complexity of something. So on here, we have a very simple example of a line graph. And you think, OK, well, I can actually use two points x and y in order to identify where something is. If you change your perspective, you actually only need one number because you could say, well, let's angle ourselves in such a way that we only need one number in order to identify where things are actually at. So part of what we're looking at is, well, how do we change our perspective a little bit in the software supply chain so that we could see if there's some way we can simplify the problem a little bit further? And that was part of what we were asking the question on with our neighbor. We actually had something that was a little bit more complex, we worked out if we make it a little bit smaller, a little bit smaller. What actually gets to the heart of what we want to do? So there's three simple questions that we came across. The first one is, what is the identity of something? By identity, I don't mean this is the URI or URL of the package, or this is the human name we give it. What is the canonical identity of what a thing is? Dependency is, what is in the artifact? Metadata is, things that are not in the artifact, there are not the identity of it, but are something extra that we've learned about it. So we have to have the ability to attach metadata to whatever it is that we're doing. A really good example of metadata could be something like the license of a project. Could be who compiled it is metadata. What did your image scanner say about something as an initial metadata? So jumping more into identity, we ask what is a software artifact? So an artifact in this scenario is any software object that is of interest. One thing that all software artifacts up to now have in common, and who knows, maybe in the far future this will change, but something I'll have in common right now is that they're represented as a byte of arrays. It doesn't matter whether you're using source code or object files, JAR files, Python or Python C, Debian RPMs, OCI images, they're all represented as an array of bytes. So we say that two artifacts are equivalent, if and only if the byte array of one is equal to the byte array of the second. So with this, this gives us a sense as to what a unique artifact ID. Of course, it'd be terrible to stick the whole application in an S-bomb, so we have to do something a little smarter. But in general, there's three properties. It's canonical. Independent parties presented with equivalent artifacts can derive the same identity. That they are unique in that two non-equivalent artifacts have a separate and distinct identity, and that it is immutable, that once you have the identity of something, it's not going to shift into something else. So these are the three properties we're looking for. Some identity non-solutions that we ran into was, you'll often see many things, this is actually prevalent in many S-bombs, when people think of S-bombs, is like, well, the file name, the file name is the identity of the thing. Well, file name is actually metadata. It's not actually the identity of it, because your file name can change. I can change foo.c to be bar.c, or I can actually change what's in foo.c to something else, and so now the identity should change. If I move it to a different directory, as I changed. Or what if I take that file and just stick the entire contents into a database somewhere where there's no file name? So file name's not quite there. They're good for locating things, which is not good for defining the identity of something. The same problem is with URLs. PURL's a little bit better than just the URL, but it still has the same set of problems. You're looking at the location of where something is, which is a hugely valuable thing, but is not the identity of an artifact. The minimum elements of an S-bomb, which I am actually a fan of, I know a lot of people rail against them, but the reason I'm a fan of them is because they're very simple. There's something that companies can do today that move the needle, but also sets them up so if they can do this, and then they can do a little bit more next time, and then they can do a little bit more of this year, next year, and so on. So it actually gets them in a baseline. We can increase that baseline over time. But that being said, it's not an identity. It doesn't actually provide you. And especially when you look at the kernel versions, like I have kernel version 5.17.3. I compile it. You compile the same version. We're actually coming up with two different pieces of software because in general, there may be around 3,000 files that actually get used, and depending on how I configure it and how you configure it, you're gonna get two different outcomes. And that's assuming that the builds are also deterministic as well. So we'll ignore determinism, like you could ignore determinism rather, pretend it's all deterministic and you still have this big problem, which makes us all sad. So it turns out that we already have an interesting solution for identity. So if you look at Git, Git actually computes an object ID to be the identity. And you have this algorithm called a Gidoid. And so in terms of a Gidoid, it takes the contents of that and it generates a 20-byte hash or I should say 160-bit hash, which is a size of a shot one. There is work to turn it into shot two. I know that violates the immutability property that we were talking about before. We actually have a solution for that that we've included in the spec, but we'll get into that later. In short, it has a solution towards that, which is basically that you have a Gidoid, that Gidoid ends up in a file system, which happens to be the Git object repository itself. And everything inside of that is identified by the Gidoid. And so what we did with OmniBoard to really dive into this is we set up the identity and you have an artifact. And that artifact is you have an OmniBoard document and that the OmniBoard document itself, you say like a specific file, is identified by its Gidoid, went into the creation of a particular object. So you have identity of that object and the identity of the things that went into that object. So that's the example of the lowest one right here. You have a Gidoid of each file. Those all get wrapped up into whatever those things came together to produce ends up with its own Gidoid that is unique to itself. And that's then represented as part of the dependency graph. The same thing happens when you go further up, where you have an OmniBoard document that is consumed by another OmniBoard document as you get more hierarchy. You say that the particular blob of a particular item is represented by an OmniBoard Gidoid of that particular thing. It gives us a graph. You can think of it like a, I often call the BOR a bag of receipts, of cryptographic receipts, because at that point what you're doing is you have a set of identity. You have a tree of identities or a DAG of identity of things that you use in order to build that particular system. So Gid is a Merkle tree masquerading as a VCS, as a version control system. So does everyone know what a Merkle tree is? Is there any people who don't know what a Merkle tree is? Okay, so a Merkle tree is a special kind of tree that is defined as, if you have a hash of all the subtrees, you have a tree that has several subtrees underneath of it and you take the hash for those subtrees and those subtrees themselves have leaf elements and those leaf elements also have their own set of hash. So you're looking at hashes in such a way that if you change a file, it only changes that file. All of the parents of that particular file going up to the root. So that way you don't change most of the DAG. You only see in the representation the changes of the leaf element that was changed and the parents moving up to the root. So like minimum amount of changes you can make. Generally from a growth perspective, the growth of changes over time, each change might change maybe, it's logarithmic based upon the quantity of data you have in it. So it's very efficient. It's the reason why I used to get up in the morning, I'd go to work, I would hit SVN update, I'd go have breakfast, come back, watch it finish. It was all ruined by Git because I'd do my Git pull and then like five seconds later, I can't eat my breakfast that fast. So Merkle trees is the magic behind it. And so Git uses that, it uses the GitOid as the identifier. So now that we have this dependency graph, we wanna ask the question, what is in the artifact? So in this scenario, we have the identity of a C file, maybe some headers, and those go through a compiler, it releases out, it generates an object. So we wanna capture that information and the same with the other sites. We don't have two object files and then those get linked together to create an executable. It's actually more complex than that because you also have shared objects that may already be present on the system. So when you run an application, it's not just the executable image, it's also all of the additional dependencies that were in the environment that get pulled into it through your operating system, which also have their own set of things. So each of these can be represented by a Omnibor and that Omnibor at the top, you could then generate something. So it's not just for things that are static at rest, but also you could use it to represent things that when you run, I give you something that gives you a little bit more information. Also works in other environments like in Java, everything is linked in dynamically. There's no concept of statically compiling things once you cross class divides, the class files. So the same pattern works for things like Java or other systems. So we take this and what we did is we generalized it and we said we have this executable goes back to what you saw before with Omnibor. So we generalized that particular path. So in short, we're able to use those getaways as artifact IDs to represent the whole set of dependencies that come along with it. And so when we mentioned, so going back to this, just to repeat, you have your artifact like this particular artifact has two entries. So you see a blob, you have the getaway that's attached to it. And then you have an artifact with artifact three has its own Omnibor that's attached to it as well in lexical order. So that way you get a canonicalized view of it. So you're not, you're not guessing us to like, what order did I see things? Everything is explicitly ordered by lexical. So this brings us to the question of metadata. What is known about it? So there was a great comment that was put together by Jeff who's another Omnibor community member and contributor. He said an SBOM is a format for organizing metadata that describes the makeup of software artifacts. So in other words, Omnibor is specifically about the identity of things. It's not about how do you store the metadata, but we still have this metadata we have to look at. So we have dependencies, all the stuff in orange and stuff that is within Omnibor. And you have all of the purple stuff on the side that is metadata. So what we are looking at as a pattern is that the metadata can use the gettoids in the orange boxes to decorate the tree. So you might, you'll have a separate database where one of them might be, I ran an image scanner. Another one might be, I ran something that checks licenses. The third one might be something like who compiled it and gave it to me. So you have this metadata that sits to the side. And that metadata actually can be dynamic because I ran an image scanner today. I ran it again six months from now. It's probably gonna change unless it's a very simple program that's, yeah, it'll change. So in short, we wanna make sure we can keep this separate but use the Omnibor graph as the identifier in order to work out like what does the metadata apply to. So getting back to the example from Jeff, Omnibor stuff on the left. The stuff on the S-bomb is on the right and we actually have been working with the various S-bomb vendors. So for example in, I believe, is SPDX 2.3. They've added support for gettoids which then gives the ability to reference the Omnibor graph. And that was the direct collaboration we have with them. So in that scenario, it's also, we're looking at what is known from a metadata but it's not enough to know what is known. It's also important to know how it is known. And more specifically, once, if you can understand how it's known, the question then is can you trust that information that's on there? So this particular section was written by Ava and Ava is a major voice in terms of trust. And I'll try to paraphrase the best because this was her part of the slides. And to paraphrase, when you look at something that's trust, trust is, you have to look at what is it that you're trusting? Are you trusting the source code that's going into it or what something says? Are you trusting it because I gave it to you? Are you trusting it because my signature is on it? What is it really, where we root that trust in? And signatures are a great example because something could be signed, like how do you know what was behind that, what was signed, what was in it? Like signatures give you an important piece of information but they're also very limited in terms of the total quantity of information they can give you, which then brings us back to our sad current. By the way, the first time this image was used, the contrast was way off in a previous presentation and it was like filmed war version of Kermit. It was fantastic. So when you ask a question of like what is trust, because this can bring us towards actually helping answer that question, there was a person, Dorothy Denning, who wrote about the Orange Book back in the early 90s where the Orange Book said, you can build a system that you can categorically trust as like, you can trust it because it's a property of how it was built. And Dorothy came forward and said, trust is not a property, but rather is an assessment based on experience, a declaration made by an observer, not a property of what is being observed. And this is super important because, just because I give you something and I say, oh, I used all these secure development practices in order to, so you can trust it. You can't trust it simply because I followed a set of patterns. I said, what do you have to do is observe what's around it. So part of the idea is like, how do we give you that metadata? How do we make sure we can link everything together and pull that information from the places that have the ability to actually pull, to give you the best information. And trust is also, Mike Roussell wrote a book on trusting in computer system, of which Eva was a editor on. Trust is time dependent as in, I trust my image scanner today, but am I gonna trust the output a year from now? Absolutely not. It's asymmetrical. And asymmetrical, the best way I can describe this is the relationship you have with your bank is not the same relationship your bank has with you. And so it's also contextual. Like you go into your bank, you're gonna have a different set of services that you trust them for than compared to like maybe you go into a doctor for surgery or something similar. So it's contextual in that response. The same goes with computer systems. It's like it's time dependent, asymmetrical and contextual. And so people often put their trust into the build tools. And so, but what we're looking at is, if you look at what a build tool actually does, a build tool will transform your inputs to some output. The problem is that you can scan the inputs. And so this is where you scan the source code. You say, okay, well, that doesn't necessarily explain the output. It helps give you a direction of what the output can be. It could definitely constrain that, but you don't know exactly what it is. How about scanning the output? So how many people here can say what kind of a pie this is? Is this an apple pie or is this a, is this pears? I mean, for me, if it has coconut in it, I'm probably gonna have a bad flavor that I'm not gonna like. I got a friend who, if it's an apple pie, it'll need an EpiPen. So knowing what's inside of it is super important. The same goes with software. You have to know what's inside of it. A scanner gives you some information, but it's not gonna be absolutely perfect. So what you wanna try to do is you wanna try to drive that to something that's a little bit more accurate in terms of what we can trust. And we can actually trust the build tools. So we look at the build tools. The build tools will give us the information. Have the information. They have what was going into it. By build tools, I mean the compiler itself, not Jenkins, not Maven, or the make file. I specifically mean the build tool that is actually doing the transformation. And so we can trust the build tools to give us more accurate information, not to say that it's 100% perfect, but it is the best place where we can find that information and at least get some form of attestation out of it in terms of like what C files went or what Go files, what headers, what was it, what were the environment variables, intermediates, and so on. And so the goal of Omnibore is to be able to take that information from the build tools themselves and to be able to create that artifact, that DAG, so that we're able to then decorate that DAG with information. And we build that DAG and bed an ID for everything that's there in a language heterogeneous environment, regardless of the packing formats, regardless of, and the key here is it has to get to a point where there's no developer effort. You run Go build, you run GCC, you run Rust or Cargo build. Like this is stuff that should just work, it should just give you that information and you can start decorating the trees. And to get to that answer is to am I safe? And part of it is that when you look at what actually went into it, now that you have this information, let's say deep down like several dependencies in, you have this log for J that was injected into your system. You have that information available because you have the metadata that was attached to it. And you can say these getaways are known to have that vulnerability based upon where, when the patch was in, based upon when the version, the version it was first found. So you can look and see whether or not those files were included as part of the build. And also in the future, let's say six months or a year from now, another vulnerability comes out. The ability to tag those particular getaways as being unsafe is something that would be of huge use to infrastructure and consumers. So in short, we have a construction of multiple systems. Actually, this is an older deck. We actually have two others that we have. So we have Go, Rust, we have work going on in LOBM, BOMSH. There's also work going on in GCC as well. So we've been working with the compiler communities in order to get this stuff in with the expectation that if you're building tools, you'll be able to use those annotations to generate the data. I'll make sure when this gets published, I'll publish the new version of it. It has the additional two links on there. And with that, you're also welcome to join and get involved. So first, well, I want to thank you all for your time and I don't know how much time we have left for questions. So does anyone know if we have time for questions? Seven minutes. Cool, so are there any questions that we have? Okay, that's a good question. So non-compiled, we can take the Omnibor or the getaways of the files themselves and put them together in terms of saying what a package is and also analyze. So we still get some of the benefits of, like I have a file that has an own vulnerability that I maybe pulled from another project and then be able to tell that that was an, or maybe a dependency when I compile in. But things like Python and Ruby and similar are a little bit more difficult and then you have to work them into that. We've not worked with those particular communities just yet, because we've been focusing primarily on the compiled ones, but we definitely see ways that we can tie into it. Like maybe when you load in a Python file or load in a Ruby file, you could check the getaway and check against a list and see whether or not you want to accept that or not. That is work that has not been done yet. So that's an area of interest. We definitely would appreciate some help there. The question is whether we intend to capture the getaways of the build tools themselves. So we're having some discussion on this because if you capture the build tools that does affect the output and whether you want to do effect or not is sort of up in the air. There is integration that we've been looking at with a project called Intodo, which captures those build tools are. And I have an implementation of Intodo that I've been working with to capture the build tools as getaways so we can get that information out. So the question is how do we establish the trust of the information that we captured there? So the first one is in terms of the inputs and outputs. So again, this is not designed to solve all aspects of it. So we're trying to solve a very specific problem within it, but we need help from other tools in order to capture that. So I mentioned the Intodo one specifically because many of the Intodo systems are designed so that you can capture not only what process was ran and in other words like I ran this transformer or preprocessor, I ran this compiler, I ran this test and this is the group that built it. That gets signed, gets stuck into maybe a six door or something similar. But there's also the question as to whether you can trust the tool or not. And part of, there's a couple things towards this. So one of them is when you're building out your CICD system itself, first you do want to try to keep that under control so that the limited people have access to it. But simultaneously, some of the Intodo tools, one of the ones that I work on is a project called Witness and have the ability to hook in with analyzing what files that the compiler open and what to capture metadata on some of that stuff as well or maybe there are other system calls like make a connection out to a network somewhere to be able to capture some of that. So there's, there are things there that help with some of that observability. But at the end of the day, if it's about should you run a particular piece of software or not, you still have to root that trust into something and that something is tied to do you trust the agent that built it and do you trust the agent to try to capture that information effectively. So you still have that issue in terms of build trust. So it's not designed to solve all the issues up and down the space, but it's designed to help give you additional information so you can make better decisions. So is there any other, any other questions? Fantastic. Well, thank you all for your time and I'll be around. So if you have any questions you want to ask me in person while I'm here, I'd be more than happy to answer.