 Timekeeper is in the room in the back we have 40 minutes for zero awesome, so we're gonna do like Don't tempt us it's not nice to send people so Welcome To our talk repurpose purpose using gets died for supply chain security. This is Ed. I'm Ava. Hello so Couple things but housekeeping both Ava and I really do well with questions as we go in the talk So something comes up where you have a question that occurs to you It's very likely it's occurred to someone else and it's very productive you actually answer it and we find that super fun Because it's much more interesting to interact with the audience for you to ask it and us to answer it And we probably have one that we can say pretty quickly Absolutely, so second thing We are probably going to ask you a questions in the course of the talk now We're not going to usually ask for verbal answers although that it could happen there may be a quiz But usually we will ask you to raise your hand just to get a sense for where all of you are as an audience because that Impacts what we choose to present. Yeah, so Both of us have a scientific background. And so what do good scientists do in this situation? Occasionally make physics jokes. Yes, that's you We test our theory why We calibrate so quick calibration How many of you can raise your hand? You are an excellent audience. We usually only get about 80 or 90% for that one. Nice It's really tragic than the incapability of hand raising that happens So a little bit of her street from a Little bit history first. So how many of you remember the colonial pipeline hack? Awesome Good with hand raising Now when that happened You wound up with the cyber security executive order for the Biden administration that also familiar to folks fabulous so For a lot of people that's where supply chain s bombs first bubbled up to their consciousness But there's been a lot that's been going on for a long time, right? So for example spdx Which provides an s bomb format goes all the way back to February of 2010 and they've been diligently working in this problem space since 2008 is where this problem landed on my lap at Cisco and where I had built out an industrialized S bomb process meaning tens of thousands of software releases shipping every year for which we had to have the complete notion of what was in them both for open-source license compliance and Also for third-party security So Mistakes were made there's no question about that any time you move into a new field That early in it, you're going to make mistakes and that was absolutely something that happened there But in the process I learned a hell of a lot about the available airspace Which is always important because if you remember the mistakes that have been made you can try and avoid them in the future Which sort of flashes us forward to this colonial pipeline hack situation So again, that was May of 2021 Very quickly you got the cybersecurity executive order time a whole bunch of friends of mine in the spdx community Which I hadn't really been participating in for about a decade at that point reached out and said you remember some of your crazier ideas Now may be the time and so in late May There was the first hip-bomb talk occurred and presented the crazy idea that you're going to hear today people liked it And so after a whole lot of background conversations With many many people in sort of what I would call the pseudo public meaning that every all the slides were available Everyone knew it was going on, but we hadn't really formed a community. I didn't really announce it We figured okay, we've got a website up finally We finally went and got a Twitter account We should probably announce the public community start holding regular community meetings and start actually working on a spec and a bunch of proofs of concepts and things and That went well You can't do it. You can't call it public or even really work until it's on Twitter. That's true. It's not real So in the first couple of weeks we got about 82,000 impressions. We had about 19,000 people who went by and visited the profile We had, you know, five to five hundred five point six thousand people visit the website from all over the world Like it was a pretty successful launch for a Twitter account that had never tweeted before of a tiny open source product that didn't really have much yet besides a Draft of a spec and some words we copied out of the spec and a white paper and a white paper Yep, so again that went well, let's see if this clicker works. Can I go back? Yes, I can go forward Wow Okay, so this is where I step on the picture Gosh, we can't see that it's baby Yoda It's baby Yoda sipping some tea Because I stepped into the supply chain space completely new to supply chains I've been working at open source for a long time 20 some years everything from hardware security and firmware to cloud infrastructure to databases to other things but had not really Focused on supply chains didn't know spdx from cycle and DX when I started like great. I have a New thing to learn. I'm gonna try taking a beginner's approach to it and just learn from everybody else And I spent about six months Going from community to community reading docs talking to people finding out what's going on here And it looked like that Is this what it looked like to everyone else as well? I Mean not the disappearing screen is this what it looked like to everyone else So how many of you saw it this way when you first started looking at supply chain security in open source? Yeah Even as someone who's been everywhere from Linux kernel to hypervisors on up I Could trace a supply chain in one community or for one project, but thinking of the whole domain I didn't have an answer no one that I spoke to had an answer either This picture probably won't come through either. It's baby Yoda looking very overwhelmed like oh face This deck is gonna be a little bit more challenging with bad contrast hmm So anyway, well if you feel like the baby Yoda here going, oh, you're not alone even today A lot of folks are still really overwhelmed by the the the number of different tools trying to fill the space and not Seeing how they all fit together. So let's try and simplify that and whatever one wants to know Is that my safe? Underneath all of the tools and the technology we all want to know can I use this piece of software? Does it contain a vulnerability? Does it contain malicious code? What happens with its next update? so I've been around long enough to become slightly opinionated about the world and One of the things that I learned the hard way because I have this sort of mind that thrives on abstraction of the world It becomes very complicated and so having made again many mistakes One of the things I've learned is you want to focus on simplicity Because simple things are reliable. They work simple things are performant and simple things are secure But the problem is that humans tend to want to make things very complex And in particular some problems are complex, but often We make problems more complex by how we choose to think about them So just take a really simple example of what I mean here If I have this linear picture that probably looks familiar to all of us from middle school And I've got a bunch of points that I look at and say ah the model is complex enough I need two parameters to describe every one of the points here because they've got some place an X and some place And why that's actually not the complexity of the system here. We're looking at we've made it that complex If you just tilt your head to the side a little bit You realize how simple it is You could often simplify a problem space by making a perspective change And so, you know very often that's a productive way to approach these things so when I look at the supply chain problem space from the perspective of Building open source software communities and providing the stewardship for them over time Three things stood out What is this software artifact is it a piece of source code is it a binary someone else gave me is it a docker image That is the identity of the software Is the thing that we have to be able to see and understand and represent. What are its dependencies? What's in it and What metadata is associated with it? What else do we know about the software? This is really everything else that isn't its identity or its dependency graph So I'll dive into these for a few minutes. These are my slides. I think of course. Yeah An artifact in our terminology here is really any software object of interest Could be if you can see the tiny little writing there and dim gray source code file object file Dot SO shared audit file dot DLL on Windows. It could be a jar file a class file at Deborah dot rpm package They all have one thing in common software artifacts can be represented as an array of bytes because that's how they're stored on disk and So any two artifacts are equivalent if and only if the byte array Representations of them are equivalent based on this it should be possible to represent each one with a unique ID and That ID should have this the following three characteristics. This I think is sort of the the one of the hearts of get bomb The ID needs to be canonical that any two independent parties can derive the same Identity when presented with equivalent artifacts needs to be unique right non equivalent artifacts have a different ID and Immutable if you change the artifact you inherently change the ID as well a Bunch of things that are talked about today are not identity The file name I Can move a file between directories. It's still the same file I can change the content in the file but keep the name the same and it's a different file So file name and location doesn't work same as true for URLs if I fetch it from from the website I fetch something from that URL tomorrow or if you fetch it from a different IP address in a different country You get different content URL of pearl. Yep. That's what I just meant and then the Sorry, Alan. This one's at you Minimum elements of an S-bomb The the Linux kernel gets built by a lot of companies if I build one and Ed builds one And we tell you these are both the 5.17.3 But we might call it something different. I call it the Linux kernel and it calls it the kernel They're from different suppliers. Do you know if the same files were used the same source code files? You don't There's about 50,000 Files in the kernel tree any given build probably uses about 3,000 of those files But you don't know which ones when someone gives you a binary you have no way of knowing that today and That's sort of the fundamental problem here So Along with simplicity. I've also learned not to reinvent the wheel Especially don't reinvent the wheel only this time triangular because you have strong stability requirements This identity problem is already actually solved Very nicely for source code by get So how many of you use get? How many of you understand get? Yeah, how many of you have a really strong in this room? How many of you understand the merkle tree underneath the get? Okay, okay good good. So you eat a lot of this so basically For those of who are you who are less familiar get is actually Assigns a get object ID to every file that you check in So if source code files are artifacts For source code files, we have an identity for them that while it's not used for every source code file in the universe It's used for the vast majority of them So most of these things are already identified and indexed as well And it's very very simple the get object TV across the contents which are just the byte array of the file gives you a 20-byte hash That you can go and find and see exactly What the identity is that I use in my local repo that's used up on github that's used today That's used a thousand years from now. It doesn't change Speed up a tiny bit. Oh, I think it's you, but I can do it Yeah, so effectively get is not actually really a source code management system Get is actually an object store that uses merkle trees masquerading as a source code management system And so what that really means is a merkle tree is just something where you've captured all the leaf nodes with an identity that's unique immutable and canonical and Every non-leaf node gathers together its children in such a way that you can't Misrepresent that and this is one of the powerful things about get because it allows you to actually if I give you an ID for a commit or a tree and get I Can't give you something that doesn't match it and get away with it Verify all of the descendants all of the dependencies in that graph. Yeah Which is which is why by the way get is both history and get is both ephemeral and immutable at the same time And so We maintain and this is sort of one of the fundamental things that get bomb maintains And this is why get bomb is get bomb is that your best bet is to use get oids as your identifier for software artifacts They're a suitable identifier and they happen to be extremely powerful identifier because the vast majority of things that matter are Already indexed this way which brings us around to dependencies so When you look at what went into this artifact the dependency graph of the artifact So here's a really simple example. You've got a bunch of source code Something transforms that source code in a case of C a compiler into the object files Then some linker links them together into an executable You can then think about a running executable something running on your software on your system And in the case of C it's very often true that you need to know not just the executable that you're running But you also need to know what so's with their attendant trees were linked into it because otherwise You can have a vulnerability that sneaks in but it's not a vulnerability of your executable But is a vulnerability in combination with that shared option now this works all across all languages Java you compile Java files to classes you load them with class loaders into running executables And so you might look at this and say what if we just generalize and this is what get bomb suggests is Whatever language you're using whatever the artifacts you're dealing with Generalize this and start talking about the artifact dependency graph Now this is a directed acyclic graph. We will sometimes improperly call it a tree But just a graph and then use get oids as the artifact IDs in this graph Very simple every language has libraries and get they can compute the get oid and if you don't have one it takes about 15 minutes to write one in most languages that I've tried super easy now Then you might say okay. Well, how do we capture the graphing this of it, right? We've identified the nodes in the graph. How do we capture the relationships and so what? get bomb suggests is for each artifact that has Some inputs to it We simply capture what we call the get bomb document that expresses the relationship now in the case of artifact 2 and 3 They only have leaf nodes as children So that looks like blob space and the get void of artifact 4 New line terminated record blob space to get rid of artifact 5 new line terminated record You put them in lexical order. So again, they're canonical everyone gets the same one every time and you the same for artifact 3 and Then for artifact 1 and this is the merkle tree in this of it all You simply start the same way, but because artifact 2 and 3 both have children You include a stands up for the get object ID of their get bomb document Which means that you've captured the merkle tree in the course of this and this is by the way No, the projector did that last there we go. Yeah, this is by the way the same Almost the same very heavily inspired by the file system structure of git It's not precisely the same because git does care about things like file modes that are uninteresting to us. Yeah, but it's heavily inspired by get so Metadata the third factor here is what else is known about it I'll quote one of our colleagues that an S bomb is just a format for organizing metadata that describes the makeup of a thing It's really everything else. You know about it So looking back at that graph one might say that any artifact can have metadata associated with it source code file could have a license associated with it a binary could have a supplier or a contract terms or a price or a CVE Those are all things that can be and are represented in S bombs today And so we've been working with the S bomb communities like spdx to create a way to inter relate these Git bomb is not an S bomb it complements them So if you have an S bomb document, you can as of as pdx 2.3 the current draft It's been approved and merged waiting for them to issue the the new spec update You'll be able to reference a git bomb document directly from your spdx document and say this artifact has all this extra information and It's that one by using the git void So you can think of these as two different domains in however you're tracking this in your your excuse me Yeah, what was I gonna say on this one? I think I just said it Running ahead of myself on the slide sorry about that. So if this is what is known about a piece of software You might ask yourself How is it known and how can you trust that knowledge? Have you ever thought about this show of hands? Have you ever thought about how do you know the things you know with software not that many hands? So so all of you who didn't put your hand up if I just give you a binary and I tell you what's in it Are you gonna run it? If I give you a piece of software and It's signed. Are you just gonna run it because it's signed Good answer And yet a lot of people are focused on that as the answer right now, right? A signature is only as good as the chain of trust in the signature chain It could still be compromised beforehand all that the site the signing of the software has done is proved to you when you verify it That it hasn't been changed since it was signed if you then verify the signature through some other channel You might also know who signed it But not what they signed or that what's in it is the thing you think it is Which kind of gets down to this question of what is trust itself? What is trust in computing? It turns out this is a really complex question I'm gonna touch on it a little bit actually edited a book last year that was just published called trust in computer systems in the Cloud if you want to go down the rabbit hole of contemplating How do we trust our devices and how do we trust each other through our devices? What does it even mean to say I trust my computer when I type in the password or I go pay a bill? Who am I trusting when I trust my computer and my banks systems? It's a fun book. It's like this thick. It's super dry Don't read it. I mean, please do it's it's good. I don't make any money from it There's no answer. There's no easy answer answer to this Going back to the 90s Dorothy Denning Said this in refutation of what was then the prevailing wisdom from a government issued book called the little orange book that Said that a that the orange book said that a computer system could be built in such a way as to make it arbitrarily trustworthy and Dorothy Denning said no, that's that's not true trust is not a property in a thing It's an assessment of the thing based on experience made by some observer at some point in time in some context It might be trustworthy here, but the same device is not trustworthy running in space It might be trustworthy for my bank, but not for some other function and so whether it's an S-bomb or a Binary scanning tool that you use as your ingesting software downloading a package from the internet. These are our mechanisms To either look at someone else's declaration that they trusted it or decide for yourself that you want to trust it By combining them, maybe we get a little bit more of a sense of trust But trust is always Time-dependent asymmetrical and contextual and that's sort of the hypothesis or the main thesis rather of the book Last year just because you trusted a piece of software. I gave you today. Does that mean you should trust it a year from now? If I if you give me a piece of software and I choose to trust it does that mean you should trust software I'm gonna give you back in return. No, it's asymmetrical and same thing again. It's contextual so Built tools. Oh built tools. That's you. Yes indeed So Context is a really interesting part of trust Um, and so when you look at those tools part of the problem you have with build tools is that build tools transform their inputs and They transform their inputs in fundamentally destructive ways meaning you're losing information In the course of compiling your software And So, you know when you look at this you're like, okay, so the bill tools transformed their input So how do we actually figure out what happened there? so you could try And that will tell you that it contains apples and cinnamon and flour and eggs and app and other apples so This would be sort of scanning the inputs on the desk. So if you how many of you build code How many of you ever built a piece of code where not every filing your repo gets built into it? That should be all of your hands by the way So knowing what's actually there as a potential input or even what's it? You know No, we haven't We'll get there. You're a great straight man, but give us a second Surely this will save us take your output point to scanner at it. So we're gonna turn the entire audience Into a scanning tool What kind of pie is that? Shout it out shout it out So we have a play-doh pie. We have a This is not a pie. Are you a Taoist sir? I It's the image of a pie good good good other other ideas about what kind of pie it is as these scanning tools Pumpkin blueberry. So now you understand that the play to the poor person trying to run scanning tools on the output Even even if you you know take a little cut of the pie and look inside and go oh, I see apple slices but if someone has let's say a Peach allergy or strawberry allergy. Are you sure you want to eat it without asking first? Yeah, could be a peach pie Um, so post-hoc scanning while if there's nothing else available to you can be somewhat helpful It literally can't tell you what's in the pie. We just demonstrated that with an example in the audience And I think everybody's realized this I've been post-hoc scanning has been going on for at least 15 years in my experience So if we can't trust pre-hoc scanning and we can't trust post-hoc scanning to be perfect They'll get us some things but not all the way there. I think we got our things backwards. So yep Next one might be Yeah, so basically what can you trust the answer is you can trust the build tools They're the thing The build tools are the things that built it So if I'm the build tool I opened the bloody files that went into the thing now Maybe I'm gonna lie to you that could happen But generally speaking the build tools are at least in a position to know the answer Maybe something else attaches to them and injects things into them. Oh, no You can go down there up a hole very far if you need to But effectively all the available choices the build tools are really your best choice for the question What to trust about providing me sort of the canonical information about the inputs along the way And please know when I say build tools, I don't mean build orchestrators things like make I mean We also don't mean things like Jenkins The CI system that's running the build process. That is not what we mean What I mean here is things like GCC LLVM your linker or the rust compiler the go compiler The Python runtime itself since it it compiles as it loads on dynamically Or if you're putting a container together like that doctor bill actually be doctor bill Yep, but these are actually the pieces that at least in principle could be trusted Because they have the right context they're in the right context at the right time Yep, and so we finally reach get bomb. It is a minimalistic scheme for build tools for following what we mean by build tools here to provide a or to build a Compact artifact dependency graph by the way, I'm pretty proud that I used I reused DAG from git To create an adg that tracks every source code file incorporated into each artifact built along the entire supply chain To embed an ID for that artifact dependency graph itself in the artifact that is built by it and to do this in a way That is language Independent it can work in a language heterogeneous environment like open source is at scale Across packaging formats most importantly with zero developer effort One of the biggest challenges here is getting this adopted by open source projects 99% of which have roughly one developer and no budget So doing this in a way that can enable artifact resolution across packaging and languages Better all volunteer run and don't have budgets That's our goal people can answer the question. Am I safe? So let's fast forward. Imagine we have this It's been integrated into the open source build tools that everyone's using five years from now and you get a docker image and It has a fingerprint and you can look up somewhere in a public location That fingerprint and see the whole artifact dependency graph You you can pull its s-bomb you have the git bomb tree for it when the next log for j happens It's buried down there in some source file and read that you probably can't see on the slide. Sorry But you can identify it because the fingerprint is there in the tree and so the the value proposition that I believe we are really enabling is For the blue teams for the incident response teams of the world who are consuming open source packages today built by other people They download them. They do a basic scan on them load them into their internal mirror This gives them additional signal to be able to identify ones that have Discovered vulnerabilities in between the time of course that it was built staged launch in production The additional signal resolution here enables those teams who are often under high stress Short deadlines low budgets to find the issues faster and remediate them better With again no imposed cost on the broad community just a cost on a few you know a couple tens of Large projects like the build tools out there. I see a hand popping up I did skip that step very astute Well, so here's the trick. So let's take for example So take for example log for j if I remember correctly There's a small number of versions of Janie and I realm dot Java or something in log for j They're actually the source of vulnerability Now in the magical five year in the future world, it becomes extremely valuable report those get oids with the CBE, right? That's sort of the longer term answer is Report them because they're a higher precision report of the vulnerability Not in place out but in addition to addition to any other information. We have it's sort of like we were talking about That's bombs being a description of things if I give you the GPS coordinates of my house It's also useful for me to tell you that it's a yellow house with blue trim, right? So in the CBE report, I want to tell you the GPS coordinates of the vulnerability in source code And also tell you that it's this package this versions, etc Now as we're getting from a to b there's a bunch of places in there A bunch of tooling that's being done as proof of concept work in the get-bomb community Where folks have tools that can be pointed at the get repo for log for j and spit out those getaways for you so The mechanisms of discovering those getaways even from where we are exactly now Is fairly straightforward? Not perfectly but enough to be extraordinarily useful and I think part of your question was the correlation Who publishes that correlation? Well when we know there's a CBE and it came from these JD9 files Or it came from this version of an open SSL dot H that Can like that file has a getoid Whether it is MITRE or some company that then publishes the correlation doesn't matter to us It can be done by anybody It you could stick it in the in that project in some way I probably want to centralize it somewhere at least in a couple places. Does does that answer your question? Okay, well we have the same thing today, so I'll repeat the question if you Trace down a vulnerability to a particular function and you realize that that function was in the known the now known bad state For multiple file versions and multiple releases of the package How then do you do the correlation the same way we do the CBE where it says? Oh, this CBE exists in version x dot one x dot two x dot three It's in this range. We just use we just add getoids and say it also Exists in these getoids Okay, it's actually a good question right because it's not going to be obvious to everyone But for example, how many you've ever written get log on get log and a file name. Oh goodness Get log and a file name Because you could actually ask it. It's the blink of an eye To know the log information for a particular file in get Yeah, so Other questions questions. Yeah, we're the last slide here is just pay here some links to come learn more about the PoC is so we'll stay here and bring on the questions. I see a hand I see I think David's hand and then you and then you Yep, right So I'll repeat I'll repeat the question How does get bomb intersect with the reproducible builds effort What David didn't ask is do we need reproducible builds for a get bomb to work? And what David did ask is if we use get bomb by embedding the getoid of the Build tree itself. Do we prevent reproducible builds from being reproducible in environments where there's a nut? There is no other functional change or difference in the resulting artifact, but some non-functional header files Got read in resulting in a different I get oid embedded in it thus making it not reproducible We're thinking about both of these. I don't have a good answer yet Yeah, well, so I mean part of the thing the thinking process and please note This is not an answer because as David mentioned this is ongoing dialogue Yeah, is part of it and I've sort of been whining in my head how you attack the problem is How do you define the reproducibility of a built? Right and one way to say that a build is reproducible is to say if I give you a set of inputs I will always get exactly the same output and It happens and then you get into the question of equivalence of the output So you may have noticed we made a very specific choice about artifact equivalence And that choice was the biter rays are the same Now This is not the only available choice for equivalence But as far as we can tell it's the only one you can reliably do in the generic because Different languages will have entirely different behaviors around all these semantics work out and it's also one that's extremely simple I will throw a wrench in our own works Okay, um bite obfuscation techniques can result in Functionally identical binaries with different bite structures, and this is a lovely way for malware to you sneak past scanners We're gonna have to figure out how to deal with it in this space as well Yes, well, we had we had a couple other hands come up and wait if your other hands come up David It's only like two minutes left debating debatable discussable I believe your hand was up and then I saw hand back there if we have time Doesn't matter. We don't we don't depend on get we're just using the get hash function The getoid function as the identifier it could be stored anywhere So we use the getoid hash function and we draw tremendous inspiration from get yeah, right? And we advocate actually for using the getoid hash function in a lot of places a lot of folks We've talked to about interoperability of get get oil with other things in the ecosystem The answer really comes down to hey if you'll store the getaway of the thing Then we're done. We've had interoperability doesn't matter doesn't matter what version control system you use I think yep So as bombs do that well, let me let me actually let me sort of repeat the question first because this is a brilliant question And one we just didn't have time in this presentation for so I'd love to talk to anybody and everybody But especially you after months debating and discussing this point effectively with a good question the The question was what if there's other metadata that I care about that? I'd like to know and in particular what if there's other metadata the build tool knows that I would like to know about And the fundamental thing From our perspective is we've actually talked about it's perfectly fine for your build tool to write out of their metadata Into we usually talk about putting it in a metadata directory, right build info or debug info Whatever you call it credible numbers of things that people are interested in and the way you connect them is The metadata points into the artifact tree So if I wanted as a very simple case that almost every build tool once immediately is I'd actually really like to know as a Matter of metadata the file name and how it associates to the getaway to the artifact So when I'm debugging what the hell just happened I can see what's going on now You don't want to put that in the artifact tree for all the reasons we talked about about how it's essentially a femoral data to the artifact dependency graph But it's super useful for users and it may also be too detailed to get into your S-bomb But you can have that metadata generated out as part of your build you can pick it up and use it You can be generated by your build tools And in the same rich discussion of the space in the community I think that in the same way that the spdx community now is using getoid as an artifact External reference ID build tools could do the same and then you'd also have Compatibility between different build tools for their debug in but in for debug symbol files if they all referenced into the Artific dependency graph in the same way. I think we're at time So, thank you all so much. We're definitely a time I know I am and I think I suspect that as well we can wander out of the hallway and find some place to folks want to talk Yep, also find us at any point. You know subsequently We're very friendly And we'd be delighted to talk about this stuff because the more eyeballs we get on this the better Thank you