 So I'm going to talk about GitBom, and the title is Repurposed Purpose because we are using existing technology for a slightly new purpose, and I'll explain why. To start, I might not look like or be your sort of average security expert in a space. I've been doing open source for 20-ish years across a bunch of different projects, from hardware up to in the cloud and orchestration, but not security specifically. But I have been studying a lot of concepts and the history here, and I want to start with a little history lesson for folks who are newer to the space or newer to software and open source. Going back to 1984, in a wonderful Turing lecture after a Supreme Court case on hackers, Ken Thompson pointed out that you really can't trust code if you didn't write it yourself, and you probably shouldn't trust code anybody else wrote. And that goes not just for software in our clouds, but all the way down to the firmware in our systems, that you can't really know what's in it if you didn't build it yourself. Since then, so much of the work we've been doing and all of the work in cloud native security and supply chain security is focused on solving that question. A decade or so later, Dorothy Denning pointed out that trust is a declaration made by an observer. It is not a property inherent in the thing. It's not a property in the observed. So when we are signing things or creating attestations with SGX or things like that, this is transitive trust. Someone else is making an observation, and then a declaration, then we are trusting their observation because we trust that person. The trust is not an inherent property in the thing. So with all of that long history in mind, jumping into the specifics, about a year ago, I stepped squarely into supply chain security and tried to take a beginner's mind approach. Just listen, learn, see what's out there, understand what architecture other people have been building in this space. And if you feel lost, like this is just a tangled web, there's so much stuff going on where you're really not alone. A lot of people feel that way. And at the heart of this, tapestry of projects, what we're all trying to answer is the question, am I safe? Can I download this open source package? Can I use this thing in a way that does not compromise my safety? And well, if it's that complex, sometimes complex models can be made simpler simply by changing our perspective. Just like in physics, you look at it from a different angle and you reduce the number of variables and it gets easier to solve. So I began asking a few simple questions as I looked at this domain, this crazy tapestry of stuff. And three simple questions, really what it boils down to. Identity, what is the software artifact? Dependency, what's in it? All the way down, all the fractal layers of dependency is across different languages and ecosystems and package formats. What's at the bottom of that? And what's all in between? And are those things vulnerable? Do they make me not safe? And metadata is what else is known about it. So digging in, an artifact in this definition is any software object of interest. And all artifacts are representable as an array of bytes because computers, this could be a source code file. It could be in Python. It could be in Java. It could be pre-compiled. It could be a shared object file. It could be an RPM or devian package. These are all software artifacts. And they're all represented in computers as an array of bytes. Therefore, two artifacts are equivalent only if the byte array representations of those artifacts are equivalent. It should therefore be possible to identify these with some sort of a unique ID, like a hash. So artifact IDs should have these three characteristics. They are canonical, unique, and immutable, by which I mean independent parties presented with the same byte arrays can derive the same identity. They are unique, non-equivalent artifacts have different identities, and they are immutable and identified artifacts. Well, if you change the bytes, you change the identity. With those three properties in mind, what are some non-solutions to software identity? Turns out a file name is not unique, canonical, or immutable because two people can have the same file in different directories or just rename the file. So that doesn't work. The URL or the PURL also doesn't work because you can change what's on the other end of that or use multiple URLs to refer to the same bytes. And late last year, the NTIA in the US issued a set of requirements called the minimum elements of an S-bomb, a software bill of materials. Those minimum elements are component name, version, and supplier. Also not unique, doesn't work. So that's kind of sad. What does work, it turns out Git already solved this problem. Most of us use Git, most source codes already in Git. Under the hood, Git computes an object ID for every artifact stored in the Git repository. That function, as we know, we all love it on GitHub. We can search for that hash and find an object. So why not recycle that? If you don't know, and most people use Git and never look under the hood, it's actually a Merkle Tree masquerading as a version control system. And a Merkle Tree is just a tree structure and data structure where a leaf node is labeled with the cryptographic hash of the data, and every non-leaf node is labeled with the hashes of the tree underneath it. This has some really fun properties, like if you and I have huge storage arrays of data, we can just compare the Merkle Trees and we know it's the same data. We don't have to actually compare all the data. So that's useful, which is why Git works so well. So with Git Bomb, we're proposing to just reuse that existing tech rather than inventing new tech. It's great, it's worked, we all rely on it for so much stuff. We should just use the Git OID to identify our software artifacts. Second simplifying question, and I hope I don't run over time too much, dependency is what's in it. From source code to executable, the nice things about a Merkle Tree is you can link them together, even across languages, and we can abstract that or generalize it to just a tree. Metadata, what else is known about it, all the other stuff. So we've been working with the SPDX community to just make sure that metadata and SPDX S-bombs can refer to objects by their Git OID, and you can refer to a leaf node or the end artifact or any intermediate artifact, it all works. So you have your S-bombs for metadata and your Git Bomb for the artifact tree. Why is this so cool? Because when there's a CBE like log4j, if you associate it to the software ID, the Git OID that has those three properties, you can then find it in any tree of software anywhere with different S-bombs, because I might give you a JVM and someone else might give you a JVM, they both have log4j somewhere deep in that tree. Git Bomb would help you find that. So Git Bomb is a tool, a minimalistic scheme, specifically for build tools, to build a compact artifact dependency graph that can track every source code file from the very beginning to the very end, embed an identifier for that unique tree in the artifact that was produced. So the Docker image manifest could have a little identifier, you can then look up that identifier and find the merkle tree of everything that was used to build that Docker image, for example. And we can do this in a language heterogeneous way across everything, and really cool part, zero developer effort. Instead of having to adopt a new CIS system or learn a new tool for generating S-bombs, our approach is to focus on a small bounded set of projects that already have large communities and often funding behind them. The language ecosystems, like Python and Java and Go and Rust, and the build tools, like GCC and LLVM, add a small change there and everybody in the world benefits, and this happens automatically. So we already have four proof of concepts running for this. We've done it for LLVM, GCC, and LD. We've added some stuff in Go. This isn't upstream yet. This is just our proof of concepts on GitHub. You can grab them from those URLs. You should definitely trust my QR codes. And the fourth one, bombSH, is a really cool set of Python and bash scripts that use S-trace that in theory should be able to instrument any build process. Now it's kind of slower because it's running from outside. We'd suggest people run this inside the build tool instead of observing it because, well, that's the source of truth, but here's a POC you could use. And bombSH already includes a way to cross-reference the get void of all your files against CVE databases. Now most CVE databases today don't have this metadata, but if you do, and you can build them yourself, if you have one, you can cross-reference. So that's really cool. And it's an open community. You're all welcome to join. Get involved. We'd love specifically more language ecosystems and build tools to come join our meetings and make sure that the spec we're building right now includes everything you would need for your ecosystem so that this works for everybody. Thanks so much for listening. You can find me on Twitter, Slack, et cetera. This was really informative. This was really interesting. We have time for one question, maybe. If anyone has any question. How do you think you're going to use it for searching for artifacts? I can't hear you, sorry. How are you going to use it for searching for artifacts? The question, I think, is how do you use it for searching for artifacts? If a build process is instrumented with this, the produced gitbomb trees would then need to be uploaded somewhere, whether it's GitHub or some sort of a universal shared file system. We have actually proof of concept right now about using global FS for this. And then you can search that. Cool. Thanks again. Yeah, thanks so much.