 All right, so we're going to be talking about an artifact knowledge graph that we built out and we're calling guac Just a couple a little introduction to myself So I'm Mike Liebman. I'm a CTO and founder co-founder of Kusari supply chain security startup and Do a lot of work with the CNCF. I'm a security tag lead I helped co-lead the CNCF secure software factory reference architecture as well as a maintainer on fresco and open SSF project and I am a high I was I joined Google four years ago to our contents of our security and This year I joined the Google open source security team and started working on guac and other projects right today So we first want to just shout out that This tool that we're going to be showing off is an industry collaboration you know it started off as a Collaboration between Google Kusari Purdue University and city All right, so what's the problem right? What's the problem here that we're trying to solve well, you know There's all this stuff that's happening in your software supply chain, right? You know you keep hearing about hey, I need I need to be generating salsa I need to be consuming salsa. I need to be generating s bombs. I need to be making sure that everybody else has s bombs I need to be Analyzing those s bombs and all that great stuff right but one of the big problems that people keep bringing up is okay Well, how do I know if something has an s bomb? How do I know if it's been consumed already by you know this stuff? Where do I you know if there is an s bomb out there? Where should I be looking for it, right? Some of these things are stored in stuff like six stores recor some of this stuff is stored in you know rest API's Random buckets in OCI and all that sort of stuff And usually when you even pull down a package, you know, you might be pulling down a salsa attestation for that particular Package, but what about its dependencies? You're not pulling information down there And like this is similar to some of the stuff we saw with for example log4j where it was The compromise versions of log4j were really deep in your supply chain right where you thought no I'm not using log4j But it turns out you're relying on some library which relies on some library which relies on some library that does Use a compromised version of that and so what you're going to go two or three layers in as you can see there in the red It's like I don't know am I pulling stuff in does it exist? I have no idea Okay, so now let's talk about where this tool that we're going to be talking about fits so these are sort of the the layers of Supply chain security Right first, you know, you want to kind of say you want to have a trust foundation, right? You want to have stuff like Sigstore where you're saying I'm making sure that Stuff gets signed that you know, these are the rules around what identities I trust and and and Stuff like you know secure time stamping all that good stuff, right? And once you have that sort of foundation then you want to start attesting to the things you are building, right? And you want to also be able to consume those attestations, right? So that's stuff like S bomb sauce attestations Vex all sorts of other in Toto Attestations as well. And so then we have sort of Aggregation synthesis right you want to be able to then consume all that data analyze it make You know be able to then go into the next piece Which is like you want to fight figure out insight and then generate policy that can make decisions based on that data that you've aggregated and synthesized and So guac fits in on that green box there for aggregation and synthesis So okay, so what exactly is guac, right? So guac it's a factor in them for graph for understanding artifact composition naming is hard But it is a knowledge graph of software metadata to answer security and supply chain questions, right? So the idea here is as I mentioned previously, right? You have all this data How do I actually ingest it and start asking like broader questions about the problem? So let's talk a little bit about how it works. So at a very high level we're consuming all sorts of records from stuff like sigstore and That are stored in things like recor or OCI or buckets or you know, whatever We're ingesting, you know various things from, you know, S bombs that are generated by various tools We're ingesting vulnerability data from various streams We're then correlating all that information and putting it into a You know putting into something that can then be queried by users and now what does this look like, right? So a few slides ago. I showed you, you know, what was implicitly there in your supply chain, right? There are all these artifacts and there might be some documents here and there and not very clear where all that stuff lives But what guac allows us to do is guac allows us to actually build out a graph with those Relationships so that we know, you know, this S bomb was, you know, we have records inside of a database of hey This identity didn't just sign this S bomb But it also signed this salsa attestation because we ingested all of those documents We also now have a bunch of additional metadata about the actual artifacts and not just about the About a bunch of additional artifacts. We are including stuff like vex and who might have signed off on that vex And we plan to sort of also include all sorts of other Metadata as well, right? And so it's a record of those ingested documents We also then sort of parse those ingested documents allowing us to then generate The relationships between them, you know a graph that allows us to sort of Make it easier to sort of query about information about what depends on what who attested to what actual What what actual piece of information We also plan to sort of include sort of temporal data like assertions on vulnerability scans So, you know, one of the big things is like hey, yeah, yeah, we we scan that artifact Yeah, you scanned it, you know a year ago But have you scanned it since then and as it turns out there's either been new Horistics or new data feeds that have come in that would have shown that actually, yeah It wasn't vulnerable a year ago known by you know certain scans But since then it would have been detected as vulnerable And then we also plan to have all of this sort of stuff be a data source that you can then use for policy to then make Decisions on stuff like how does this get it gated into production and all that good stuff Now I'm gonna pass it over to me hype to sort of show you what this looks like currently So we have a local instance of guac That we ingested several documents and we have several notes already in the neo4j database. We are going to use the Neo4j graphical interface for queries at the moment, but we plan in the future to actually create the API So that would give you graphical results instead of opening the neo4j interface So right now I will type in neo4j queries in here to just see what we ingested from Salsa test stations and this pdx and so on But in the next release in the future list will have a proper queries Okay, so for this one we have in our graph. We have artifact nodes. We have package nodes this one is a Score cards, so if you're it's it's an attest it's an attestation Score cards, I think is this one the metadata and this one is the builder So like when you are using salsa, it's the salsa builder and so on and there are all of the relationship in them All of this is detected based on what we parse in documents So if you parse a salsa doc if we parse a salsa attestation we detect that there is this builder node So we create a builder node type. There is this artifact and so on so everything is created based on the documents that we ingest Okay, and let's see what we have in here. So first thing just a simple GraphQL query to see how many nodes we have So we ingested around the 70 documents and in total we have 8,000 nodes out of all of these types that we send before and Now we can go and explore what we have here I'm not gonna type the queries again, so let's try first to explore Kubernetes cluster So if I actually let me begin with a simple package node And return Just that one and because I don't want to return all of them. I will return just 10 So the graphquail interface here offers multiple options to look at the results one is as a table or the json's And the other one is that as a graph We'll probably use the graph Representation most of the time because it's easier for visualization, but in some cases the table one would be better suited So for these 10 nodes that we limited we have All of the attributes so if I click on the node I have all of the attributes on the right and I can then query on them and One that is important is the pearl which is the package URL so that that I can use it later to query and to zoom into just one of the nodes And then there is this tags that we have So that we can identify which are binaries artifacts which are container docker container artifacts and so So now let's look let's explore a Kubernetes container I am Expanding this query So I'm taking a package where the pearl contains Kubernetes controllers manager and I'm returning that node So it should be I think I deleted something Yeah, I deleted one this year Okay, so this one has several nodes And I can zoom all of them into the graph we will pick one of these versions So let's say if I look at them each one of them has a different version. So this is one 24 for one 24 5 and so on we can pick 1246 and Now we'll have just one single node and now we want to ask queries about this one So one way we can ask queries is selected the node and expand all of the relationships But this definitely doesn't work because there are so many dependencies on this project on this container Another one is to actually run the queries ourselves So I will copy this one where I am going to just focus on relationships of type depends on or contains So a container depending on another one or an artifact depending on another one Between one and five hops and returning all of these pads This is still gonna be large and now I can Limit this I want the artifact tags to be binary So I can continue this and now I have something that I can look at So I have this package that is to controller manager and it has two dependencies One of them is the Cube scheduler Sorry, one of them is the go runner. I look at the name So what that's one binary and the other one is the Cube controller manager And now I want to see which of these binaries has attestations on them So what I can do I can expand this query to also look if I have metadata or attestations in them Let me copy this bigger query. So now out of all of those nodes I Can also rotate them around to make them look better So I have this attestation that refers to this node If I scroll to the names to the Cube controller manager But this other dependency that was the go runner doesn't have an attestation in this container so Some policies that would say I only trust stuff that has attestations would flag this Container would say this doesn't follow the policy rejected Okay, the next thing we can ask is to look at the docker container so I can look at packages that are of type container and from Debian So I have just one single node in this case in our instance of guac And from this one I want to see what are The dependencies shared with other containers. So I have another larger query here Where I am looking for this Debian node that I have what are its dependencies? And then what are other packages that also depend on those and then I order based on the dependencies. So I have several Containers that have a lot of shared dependencies in common like 1500 and so on But then I also have smaller images that only have one single dependency in common so with guac it's easy to determine which are slim dependencies which are harder bigger ones and so on and Finally another query that we might want to look is for look for J Let's see how many packages that we have ingested that depend on look for J. I Go back to the graph node. It seems I have only three packages that contain look for J in their name And I can look to see which containers Contain look for J So I'm gonna take this query So in my case I have two artifacts that contain look for J One of them is has been generated from spdx vulnerable So it's it's an image that we know has been vulnerable in our document. The other one has been generated from SPDX statement that says that the image has been fixed However, if you look at this image, you see that both of them depend on the same look for J packages So the fixing actually didn't contain a fix for look for J contain the fix for some other Vulnerability and if I replace this look for J with the Apache commons, so that should be Text common I think one second. I'll get to that Okay, so in this case I have two containers With each one of them depending on another container that is fixed So actually the SPDX that was fixed was fixing the Apache container was not fixing the look for J So it's very easy with guac to identify what got fixed what didn't you can also write bigger queries to identify What is the plan to fix what what dependencies you need to update to fix to update after a vulnerable dependencies and some back to you so yeah, just a Highlight there. We showed a little a few things off for Guac there and as Mihai mentioned It was only like 70 or so documents that we had it ingested for that specific thing But we have been testing this out with you know ingesting Tens of thousands of documents related to You know tens of thousands of different artifacts packages images and so on It's kind of a little hard to demo that just given how big that is right now But wanted to sort of show you you know We are sort of looking at that and seeing a lot of interesting relationships between packages between artifacts Between attestations for those things and and so on and then the other you know nice thing That's kind of coming out of some of the things that we're seeing here is is We found in certain cases like certain you know a lack of information about certain artifacts But then you start to ingest other documents and then all of a sudden those documents might have some additional information and You know if you and if you trust that let's say the person producing that document then you can kind of ingest that and immediately you get a Much better understanding of your supply chain All right, so what what are some of the challenges we're having as we're building this out So once again this this thing here is very much pre-alpha at this point You know we're doing a lot of work on it, but it is going to take a while For it to mature So but while kind of building this out. Well, what are some of the challenges right so? One of the big ones is data quality Right, so without a lot of different attestations from a lot of different sources And good high quality documents, so you know it's going to be hard to sort of I Say guac is gonna be quite empty if a lot of folks are not putting out You know, you know, they're not signing their containers if they're not generating s-bombs If they're not generating salsa and other sort of in total attestations and those sorts of things Another thing we noticed is a lot of document generation doesn't actually follow the specifications A lot of things that are generating s-bombs generate Like 90% to the spec, but that kind of leads to a problem where if two different documents Don't generate the correct Let's say s-bomb or the correct salsa attestation Then it's almost impossible for guac to ingest it because then we'd have to support, you know tool a is sort of spdx Implementation or whatever and that's kind of very very difficult. So That's one big big challenge that we're dealing with right now Another big Challenge is actually the quantity of the metadata and also the completeness of the metadata So we're seeing in a lot of documents, you know a lot of documents and we get it right You know, this stuff is still relatively new relatively nascent But for a lot of folks, you know They're including the bare minimum in the salsa attestation or the bare minimum in the in their s-bomb and the less data That's there the less valuable it is overall And generally, you know, just given how new a lot of this stuff is is that, you know There's some documents that are starting to sorry There are some artifacts and packages that are generating attestations and s-bombs and all that good stuff, but but that needs to increase drastically because Without that then, you know, there's not gonna be a lot of data to then use to analyze your supply chain Another big issue is also sort of interoperability and this is something that you know, we we noticed Is there's lots of different software identifiers, right? And that also includes different hash algorithms so if One document refers to a hash by shot 256 and another document refers to that same artifact by using shot one It's not gonna line up unless you have some understanding of yeah, actually those that is the same literal thing and that kind of you know Obviously causes a lot of challenges there and then also with stuff like, you know pearl right Stuff like the hash is optional in pearl and so there's also a lot of stuff of hey These two things are claiming to have be the same package, but there are they actually the same package it can be very difficult To figure out and then also obviously scalability right the ecosystem is very large And you know, we're looking to see how can we make it easy to ingest lots of different artifacts? Especially in that open-source space, you know What are the ones that make the most sense and how do we make sure that you know? If we begin to ingest hundreds of thousands of artifacts this whole thing doesn't fall over So now we can talk about a little bit about the next steps, right? So for us the big thing for us is we really want to work with the community more, right? We want to ingest a lot more document and metadata types As well as other sorts of interesting metadata. So this is stuff like Vex documents, right? We want to be able to ingest those Vex documents and be able to attach it to like claims of a vulnerability that is attached to a Particular artifact so that you can then go back and say okay great. I know you know, I can go in As opposed to needing to go out To some sort of Vex, you know Figure out where that Vex lives you can sort of query something like guac to see if there is a Vex for that We're also looking to integrate with sort of vulnerability streams like OSV and all those sorts of things to sort of figure out, you know To pull in additional information At a given time We're also looking to sort of integrate with some other stuff like git bomb and gitoid so that we can have you know The git object ID is a pretty useful sort of software identifier. And so we're looking to sort of integrate with that We also want to help harden existing types So we want to work with the community in you know on on salsa and I'm a steering committee member of salsa but and You know, we also have a bunch of folks who are maintainers on some of the SPS bomb Specifications as well. We're looking to kind of figure out like what can we do to make it easier to make sure that all of these? Identifiers can help, you know, we can have that interoperability between them Yeah, and along those lines. Yeah, we want to rally around those software identifiers and make sure that there's some way Even if there's mappings or something that between them. We want to see if there's ways we can kind of make that easier We also want to create a lot of new Relationship types like certification. So like hey did an organization or an auditor actually certify this particular application And or this particular package We want to also be able to you know, pull in stuff like build inputs where whether it's like a build info file or information from a salsa SPOM we want to say for like open-source code Okay this is how it was claimed to have been built and then you would be able then pull that information out of guac and potentially run that build yourself and some short-term goals for us or short to medium-term goals for us are one is we want to do a And sponsored by Google a public service for querying a subset of open-source packages So the idea would be this would run as a public service similar to something like a sig store And you'd be able to read ingest, you know, not every open-source package because that's just not not gonna work But we can ingest a lot of the top open-source packages And related information about them into a public service that can then be queried Obviously with some level of restrictions and so on because you know without rate limits That'll that'll fall over And another thing is we're looking to create, you know, and we're not taking the name chips We're just like it's cute for this presentation, but we have a little set of Little utilities and tool plugins that we plan to build for stuff like VS code and plugins for the package managers that can then be used to then query Guac about some of that information and either give warnings or also be used in conjunction with policy to say Oh wait, we actually discovered all of this really weird information inside a guac Maybe you don't want to install that and to just give a very very quick Here's a screenshot of something that's still very much under heavy development, but it's like hey, we have an sbomb and In that sbomb we can actually go in it parses that sbomb and then you can click in and it'll automatically query into guac for additional Information and we plan to do the same for stuff like requirements that txt files go mod files and so on where you'll be able to then Just be able to see okay. This is all the packages. We discovered we can click in there We can go and then look at all the stuff We can then go and look at you know, if there's any interesting information if there's any known vulnerabilities if it matches the local policy or whatever And finally a call to action right so We're on github, you know github.com slash guac sec slash guac you could also just take a snapshot of the qr code There is we have listed here, you know, there's not many contributors yet only I believe 11 or so And we want obviously more contributors to the project and just a couple of calls out to some prior art A lot of this reflections on trusting trust Jacques Chester had a really good article on requirements for a universal asset graph and then also a lot of the graph pieces were inspired by Nick's and Nick's OS Now opening up for questions. Oh Oh It's oh, yes, this one sorry So we also have chips and guac that we will give to the first five people that ask Obviously the supply chain is not in guac, so it's not tested No, I guess yeah, okay So thanks for this talk. I was just wondering so What is guac actually so is it I saw it's a concept. It's a schema It's maybe it's it's a bunch of ingestion tools to build up the graph You said it's pre alpha there. There's plan to be a service that you can query So what what is guac? Is it the idea of using a graph to create? Will it be the database that you built up centrally? So what's your vision for this? Yeah, so it will be So primarily it'll be the database but the apis that are also wraparound that database and then you know Google plans to run it as a public service as well And so so that's like part of that but the same way that you know For something like sick store you could run your own full CO if you wanted to and whatever you could run your own guac And we sort of expect that like folks who you know, you're not gonna be ingesting your private You know software into the public guac As well as there might be a bunch of extra data that you want to pull in that The public service might not be ingesting and so that's really what it is And then it would be essentially just the apis that make it very easy to both ingest documents as well as query it So I could run my own and maybe import your rebuild database dump and then Expand it with my personal internal data So we have a plan for the future to be able to query the public instance You said some starting points like I'm interested about this package this package this package You get the entire subgraph generated from those and you can import it into your your own instance from the same API All right, so this is a really interesting talk and I think this is really needed in the space I There's a couple of problems that you have when you have things that have sort of rich schemas and description types There's the advantage of having a schema where you a tool like guac understands What a lot of the S-bomb metadata means but there's also a problem that You know the example you would you point out with the Shaw hashes think of a slightly different example where One person representing shot two fifty six like a two fifty six Shaw hash writes Shaw two another person writes shot two fifty six another person says shot two dash two hundred fifty six bits Have you seen or heard of any kind of? Normalization of schemas in this space that would make your life easier We have a little bit of normalization in the side the person part of the guac but We will still need to normalize to handle multiple cases and so on so it's still work in progress Yeah, and I believe there's some effort from like the in-todo side on what's being called sort of a predicate Dictionary that could hopefully help out with a little bit of that But a lot of it I think is also a lot of work needs to be done in the community space to help sort of build that unified data model I was actually in a conversation yesterday with some folks from Docker who were saying exactly the same thing is that Hey, it's great that there's all these specifications all these schemas but without sort of having some way to Help standardize the data model a little bit then you end up with a situation where it's really difficult to know Even when like you call it name and other another person says package But it they're both the name of the package. Yeah, it becomes a mess I Think also there's some efforts on the package URL community to try and standardize that a little bit I think this is still work in progress, but I think that's that's the hope that to be like a universal identifier. I Know you just did so I'll get you a block later Hey Great talk. Thanks What is the query language? Either now or in the future plans for around the query language and then also is that is it something that can be like? Scriptable and imagining for example having a job in a CI pipeline that Is going to use API calls to guac and and make Make decisions based on that. Is that something that's in the in the roadmap? Yeah, so the plan is probably something like graphql for That sort of front-facing and user stuff We do plan to make it flexible especially for folks who are running it internally right as a public service We'll probably have to restrict what type of queries can get run because we don't want queries that are just so enormous That that'll kill the service, but I think for for that and we I think we plan to make it flexible Hi, thank you very much for the talk. It's a Agreed much needed. Have you considered using? Ontologies or you know things like shackle the shapes language to kind of help with some of the you know instead of Normalizing to one schema for all things kind of building up an ontological understanding of what the meaning of these things are And what other names they could be? Referred to by it seems like we could take some some things from the Semantic web work that's been happening over the last Good long while so I can take a little bit of that That if we use the ontologist directly then we'll have too many edges between Notes and that would over a complicated query part because then the answering the queries will become too much too slower too much slow so it's better to Encode the ontology representation inside the parser so when we parse a document we encode at the extra information like this Are this shot shot on 256 and shot to they mean the same thing So we are always mapping them to the same metadata Yeah, and I think also a big challenge That that we are looking to do which is part of that sort of predicate dictionary as well Some of the other things is we're trying to also figure out ways to make sure that we can ingest new document types That still match that data model even like so that we don't have to recompile You know the the guac code and the API every single time we could have a way of saying Oh, yeah We pulled down some sort of schema that has that sort of ontological mapping to what internally Happens inside a guac. Yeah, that's something that we're we're very interested in doing All right, we have time for one more question anyone on this side of the room Why did you name it something we can't actually find on Google Cause you can eat it. Yeah naming's hard. Oh Also, there's t-shirts in the back. Please please take Also, if anybody has any other questions feel free to tap me on the shoulder All right. Thank you so much