 Hi, everyone. So I'm Ian Maloy. I'm here with my colleague, Jiang Zhang. We're from IBM Research, and we're very excited to actually come to you and talk to you about a project that we've been working on for a while called the Code Genome. And this is for fingerprinting code to build trustworthy S-bombs. So one thing that we've probably all done is open up a terminal, typed foo install bar where foo could be anything from pip to app, yum, et cetera, and installed software and not necessarily really thought about whether or not we could trust that software where it came from. And a lot of the times the software, you know, it might be signed, you know, with some certificates. So you have some form of authenticity. It might list the dependencies that it has. Typically this is just so that it can install and run correctly and it can install those dependencies, but we really don't know if we can actually trust it. And so I think we're going to be about at least the third person today to mention Ken Thompson's seminal paper on reflection, the fifth, okay. So we've missed a sum. So everyone knows, and yes, we actually have a screenshot for the year. That was something that came up in Brian's talk this morning, so 1984, where you can't trust any code that you do not completely, you know, totally create yourself. And so really what it comes down to is if you're going to install software, how do you trust it and can you simply rely on the certificates, for example. Now this really comes to be important when it comes to supply chain attacks. And there have been a lot of supply chain attacks that we can start to look at. SolarWinds is probably one of the biggest where a compromised developer certificate was actually used to insert malicious backdoors into the code. There have been other attacks looking at things like dependency confusion where someone will actually change the name or have a package that has a very similar name, somewhat of typo squatting, or actually look at the difference between public and private repositories. You might have your own internal repository. You'd like that to have precedence, but it doesn't always happen. There could simply be vulnerabilities and the package managers of the repositories and they can be compromised. There is things like protest where a developer actually turns malicious and changes that code. There is an example here of the NPM package where it actually wipes it. I remember historically there were Chrome extensions where they would actually sell the extension to someone else and then it would kind of turn malicious as well. And then the latest attack that we had to add to our slide is the C3X, which has the distinction of being a supply chain attack, which was caused by another supply chain attack. So there are a bunch of ways in which the industry has attempted to protect the supply chain. And if we look at the standard pipeline, you develop your code, you're then going to go and build it. You'll compile it, release it, and finally deploy it on your final end systems. And these all indicate key places in which you can actually put different checks and add trust to the system. And there have been a number of projects that tried to do this. You can add different security vulnerability checkers on the development side. You can look at building integrity into the build system, things like the Salsa project. You can do a sign code and create S-bombs so that when the final user has it, that they know they can actually trust it. But we know that this is not actually a perfect solution, and there are different vulnerabilities and exploits that can happen. So again, looking at SolarWinds, there's the leak certificates. Salsa requires, for example, two-person verification of any commit, which doesn't actually help with the proverbial one person in Kansas that's running a project. And it doesn't actually address how you handle things like legacy code or build environments that you might not have access to. Older systems, 10-year-old environments that might still be around. Can anyone actually get access to an older compiler, for example? And it also doesn't handle the whole legacy code problem. So we live in worlds and environments. It's not all clean slate. We have lots of code that's been deployed. And we need to go back and figure out where that code is running and where it came from. So I think earlier, I think Brian referred to this as forensics. Typically, forensics, when I hear that term, I think of post-mortem. And I'd actually rather view this as more of a pre-mortem. Can we use this tool to find out where software is deployed, where it came from before things potentially go bad? So what we're actually proposing is a way to do what we call software fingerprints to actually gain assurance that code is what we think it is. And the easiest way to actually express what we mean by this is with a few examples. So we have kind of a standard Git tree. And you can imagine you have a single version. And now I'm going to compile it, for example, two different operating systems. And a bunch of operating system and rel. They might use different compilers, different optimizations, maybe different patches. And what we'd really want is to be able to have these two come up with the same fingerprint. We know it's not going to be the same hash, so the signatures aren't going to match. We were actually showing the SSD, which you would think would actually have some of these properties, but it doesn't, and we'll illustrate that a little bit later. But we would want this to kind of indicate that this is semantically the same code that's running. Similar, you'd actually want different versions to have a similar hash or a similar fingerprint. So if you have two different versions of a package that are off by, say, a small patch, you'd want to be able to identify that they are very, very similar and exactly where the two different pieces of code actually deviate and be able to then look into that. We also want stability across architectures. So you can have cases where you have x8664, you can have PowerPC, you can have ARM. And we basically want to be able to assure that across all these different architectures, it is the same binary. It's doing the same task, even though it was compiled differently. And then finally, we want this to be very, very robust so we can run it against all of our legacy systems. So this is something that the team's been working on for probably close to about 10 years, some of them during their PhD work. Some of the original work that I'd actually like to kind of mention is things like work that Jiang had called re-debug, which is actually looking for vulnerabilities in software that might've been copied and pasted into other pieces of code. So you patch one, you don't patch the other. And here having something like a fingerprint where you can detect all these different variants would be incredibly useful. SIGMAL is kind of a seminal work looking at variants of malware. And we use some of the original ideas in that paper to create our fingerprint. But 2017, we actually started looking at Docker image vulnerability scanners and realizing that they all had at least the ones that we looked at a fairly fatal flaw where they relied on the metadata that was actually in the containers. And as a result, we couldn't trust the results that we were getting out of them. Because we would not want to view these images as being potentially compromised, potentially malicious and don't want to honestly trust what they're going to state is inside them. Kind of fast forward, because the project took a bit of a rocky start until 2021 when the gift that keeps on giving Log4j kind of brought its ugly head. And IBM, like every other organization, had to scramble and figure out where do you have Log4j, where is it installed, which versions, and remediate them. And what we found is that most of Log4j scanners had several deficiencies where they typically looked at, they looked for packages with a very specific name. Do you have Log4j.jar with a specific version? And what we found is that because of how lots of software is packaged, maven with dependencies, everything gets put into one, this would completely fail. You had lots of other third-party packages that would kind of embed it inside, and it was missing all these different instances. So we dusted off some of our old ideas, created the Code Genome project, announced it at the Linux Foundation member summit, and today we wanna kind of give you a bit of a peek at what we've been doing since then and how it's shaping up. So how does the Code Genome actually work? So how do we compute fingerprints? So what we wanna, again, are these semantically meaningful fingerprints. And so what you see here is four different versions of code that all compute the same function. They're different in how they might do inline assembly, they might do some white obfuscation, they might add some additional functions or routines, but they're doing the exact same computation into the day. So you can see the function, you can see the machine code, but what we really want is them to come up with the exact same representation. And what we do is we canonicalize that into a single gene, like so. So we were able to take the original source code and we can either compile it and then lift it to IR or go right to an IR and we'll canonicalize that. We'll apply it for an optimizations, different mutations on it to come up with a single kind of representative form of that code. And then we can apply the equivalent of like a fuzzy hashing function to get in embedding and then embedding becomes the gene. And we can do this at multiple levels of granularity, intermediate representation. So in the Code Genome Project, we use an LLVM IR. So code, we can do this at different levels of granularity. So I like to actually state that software distribution is just a tree ducking. We have archives filled with archives filled with other archives and we have to continuously unpack them. So when we get a new file to actually analyze like a Debian file, an RPM, a Docker image, whatever it might end up being, we start to recursively unpack that and find all the files that are inside of it until we find something that's actually an executable. From there, we can actually look at computing a gene across the different granularities, whether that's the entire file, a segment like the tech segment or the data segment, or each individual function. And as we do this, we start to build up a graph of these different representations and we can start making the connections. We have a package that has an archive inside of it that has files, they have functions, they have multiple genes and so on. And then we can make connections. Which other functions share those same genes? Which files were they in? Which packages and archives were they in? And we can use this to then make a connection for something new that's unknown. If we have some ground truth information, what is it, where did it came from? And actually track that back in time. And at the end of the day, we produce what we hope to be a very large, very complete knowledge graph of all the software, all the code, all the functions, where they came from and how they're used. And once we get this, we can hopefully do a lot of very interesting and very useful things. One might be vulnerabilities. So if you have a new vulnerability, you can identify what function was actually impacted and where that is, what other packages are using it, what other code is using it, was it copy and pasted and so on. If you have some unknown code, you might be able to classify it. What is this code actually doing? Is it network code or crypto code, display code, compression code, things along those lines? And again, where is it actually located? We can then use this to do unknown package identification. In this case, this is, for example, WGET. We can identify that. But if we have some threat intelligence information, we can say not only is this potentially a vulnerability, but this was malicious code. This was code we saw in a piece of malware. It was injected into a third-party application, distributed, and where else are we seeing that? So we can actually then hopefully find repeat attacks and different supply chain attacks. So I'm gonna hand it over to Jiang, who's actually gonna run through some of our core use cases and results. Okay, thank you, Jiang. So of course we are at the supply chain security conference. So we, of course, talk about SBARM. So what is the relationship between the genome pros and SBARM? And we think there is opportunity using the genome to validate and verify the SBARM. And the problem we're trying to tackle is, of course SBARM is a great format and great kind of initiation to understand and describe about the dependencies and component of the specific software that can be provided vendor. But the issue is that, can you trust SBARM? Because there could be a case, maybe the developer is not familiar with about all the SBARM format, maybe specifications, so they make a mistake. Or sometimes they did not have the full knowledge about the dependencies, or maybe dependencies, dependencies. Sometimes they may miss some of the dependencies, so meaning that it could be incomplete as well. Or some of the cases, especially commercial software, as you may kind of see some of the kind of the bracket talk or some other kind of the open source committee, it's not uncommon for some of the people trying to use some of the GPL code and inside the commercial product. And in that case, of course, there is strong motivation, they wanna hide the dependencies because they don't wanna get into the legal problem. So then how do we really verify SBARM you get from the vendor is really complete and the correct need? So that is a kind of the problem we're trying to tackle with the software genome project. And so then how we do that? So we first go out and look at some of the tools available out there. And of course, SBARM specification has been involved over the time, different format, SVDX, Cyclone DX, all the committees there, and there are lots of great tools out there. And here the slide is not to criticize some of the tools, but this is trying to understand where the potential gap might be. Because as a security researcher like all of us, we need to think about where is the gap that might be exploited by the attacker. So if we're not carefully handling it. So for this case, we noticed that most of the SBARM generation tool, they rely on the metadata. And the metadata, of course, is great information that can be bootstrapped about all the information about the component. But the potential issues that they can be easily manipulated. For example, on the left-hand side, when you build a darker file, as you can see, it's quite simple darker file. You're just spinning up the Ubuntu and you install wget. In the end, of course, you get the SBARM that includes the wget. On the right-hand side, you see there is a move command. So which is basically technically removing the d-package file from the Ubuntu. Of course, as you know, the apps get whenever you do something, it'll update the d-package file. But what if you just remove the d-package database then result in, of course, empty database and then now there is no metadata. So that's why there is no matching about the wget. So even though image has a wget or maybe some other program, it can easily, we can just kind of hide the presence about it. Or we can even manipulate it. And on the right-hand side, we essentially look into the Go binary and it happened to be the case because of some of the collaboration we have with some other friends. And so when you look at the Go binary, the interesting thing is the Go, as you know, so it includes the most of the binary and then they also code dependencies and statically compile it to giant one executor file. And the nice thing is there is code section called the build information that contains all the dependencies in the lion when you build the specifics up to you. This is great source to collect and then get built at the dependency information. The problem is that as you can imagine, you could hex edit, you just edit it and now suddenly have the different package name. So the problem is that metadata, we can use it but still they didn't need to need, we still need to verify whether this is actually the case or not. So that's the kind of the problem we trying to address. And recently, you may also see the same chart, multiple time, percent of the talk. So C-sauce really is a type of S-bump. And they also think about, there could be different type of S-bump depending on the which phase of software development lifecycle. So most of the cases, we talk about the design source or the build because that's the way you can get exact the dependency about what software reliance which are the dependencies. So that is a great source and there's great place to get it. But the problem is that when you go to the commercial software, probably as an end user or developer, when you rely on some commercial software, you don't necessarily have to access that phase. That means we have to rely on some other method like analyze S-bump or deploy S-bump. So meaning that from the actual binary artifact you get from the vendor. So then it's a way to inspect and then get S-bump about it. So that's kind of the benefit we are trying to get from the software genome. So we are not relying on the actual source code. We analyze the binary code itself and then trying to get the S-bump out of it so that we can support, for example, the legacy code which we may lose access to actual source code or some of the building model one, maybe 10 years old is hard to replicate it. So it's not easy to replicate or rebuild the process, but we still want to kind of analyzing it and get the S-bump out of it. So, and when you look at the S-bump for binary, so we notice that there could be multiple level of complexity and multiple level of the problem we may need to tackle. So I quickly talk about the metadata. So metadata is given from some software. It comes with it. It's nice to have information we can use it. That's kind of the one case zero. And then the other case would be equivalent packages. For example, this is a known RPM or DBN packages. If the hash is matching, of course, we can just using it. And then more complex case would be individual file. So for example, someone maybe grab some of the file from different packages and then create new packages. Of course, now the package itself does not have any matching knowledge because the file hash should be different. But we still want to kind of tell about each individual file so that we can give information about what packages it might contain inside. And that's the kind of file level of information. And then we can go one level further down. So from the package, file, and of course, each of the file, it may contain multiple functions. The reason we're interested in what function is, as Ian mentioned, so there could be multiple level of granularity. We can inspect the software, but the function is naturally one kind of the self-contained kind of unit that represents some kind of computation by definition. So that's why we're focusing on the function level of granularity so that we can tell specific file, maybe change it from maybe hash perspective, but individual function, how much is really changed from version one to two so we can tell the specific difference. Is it changing some specific function or is it addressing some other vulnerability so that we can tell that level of granularity and then to matching it. So of course, current S-bomb is handling the granularity at the file level. So that's why we're trying to kind of discussing with other people what is right level of granularity and how we want to represent this extra level of information. So that's the case three. If we go down a little bit more complex case would be, it's about not exactly matching at the function or the gene level, but there could be some degree of similarity. For example, maybe 80% of similarity, how are we going to tell it? Because S-bomb is about which package has which version, but it doesn't tell about the confidence level. So how you want to capture this, this is something kind of up to the discussion, but this is kind of the different level of complexity we're trying to discuss and trying to address. So as you can tell, so this is not, we don't have complete solution yet, but this is kind of the beginning about how we want to address from left to right so that we can have more detail and more control information so that we can provide the vendor and the developers and the user so that they can have the informed decision about the package. So this is about the presentation. Now let's move on to the sum of the demo and the first, so we are going to show the three demo and I think we have enough time to go through it. So first demo is about S-bomb generation. So I'm going to show what is the unknown IPM package contained on the left hand side. You see the file tree. So inside it has a multiple file, multiple executable inside and then we package as RPM. And then of course when you're using some of the existing tool, of course this is new packages and there is no information. So most of the tool, not most of them. So all the tool we tried, they couldn't generate any of the S-bomb because it's complete new. So we submitted to our core genome project and then we can get the S-bomb and that I'm going to show in the next chart. And then we can show also how this generated S-bomb of course can be used for other S-bomb analysis tools. Let me play some of the demo. So this is a UI of the core genome and team has been working on to improving the usability so that it can be quite intuitive to use by many of the people. So here from the UI, so we can upload the file to the UI and then we select this unknown IPM. I just show the file tree and then we upload and they really start processing it behind the scene and after several minutes after, it will come back with some of the analysis result. Kind of typical information about first, file name, file hash, file type, file size. And obviously, whenever the IPM or some other package file, we have to inspect inside. As Zia mentioned, the software packaging is kind of Tadokon. So we have to go down into the deep and then see what file is actually inside. And then from this identify all the executable, we generated cycle on the XS-bomb on the right-hand side as you can see. So this is full S-bomb. Of course, I'm not going to scrunch through all of them but we can just easily download and then that can be used for ingested other tool which I'm going to show in the later part of the video. So, and this is the middle of the screen. We show this kind of summary view, the table view about what kind of component we identify this unknown IPM and what are the version, what are the license, what are the package URLs so that we can present about the information about the component. So this is quite useful information to highlight what is unknown IPM. And from the UI, one extra thing that kind of we develop is for given some of the job, we can highlight which file was being successful process or could be failed so that we can do kind of helping the debugging about the processing because there could be multiple different type of file so it brings a kind of different challenges. I'm going to the kind of the dependency track, the UI. So if we don't know about dependency track, this is one of the open source project that can be used to ingest, visualize about S-bomb and then also it can be connected to many open source intelligent so that if you have the multiple project and you ingest S-bomb about the issue of the project it will also connect with open source intelligence like CVE and NBD or some of the kind of the package NPM dose metadata. So it automatically connect the external intelligence so it highlight where the vulnerability is. So since we just downloaded this S-bomb I already created some unknown kind of the project and then I'm going to upload this to new S-bomb we just generated. And then if the format is not correct of course it's going to reject but of course we have the correct format of the Cyclone DX and it ingest it and it highlight what kind of component inside these packages and what this S-bomb is contained. Then showing the version and license in this case they highlight one of the vulnerability but the disclaimer is for demo purpose we ingest it kind of the artificial intelligence so that if we know about some open source intelligence we can connect but of course it's not representing the real vulnerability about this software. So this is demo one about generating the S-bomb from the given the binary or the RPM packages and then how it can be used to analyzing in other kind of the component and other analysis the tools so that we can use this S-bomb. Now let me move on to the demo two which is about reproducible build. So earlier this today so there's on a talk from the red head to talk about the reproducible build. And of course it's really hard problem because there could be multiple vectors that affect the final outcome of the binary. And reproducible build in the simple term given the source code if we source code the same it resulting the same binary so that it can be used to verify the compilation process is not compromised. So it's a really great concept to verify there is no confirmation about during the compilation process. And we think from the code genome that this could be one way to approach this problem so such that if this code is the same source code and compile regardless of our compilation environment or the setup we can generate the same gene. I think that's kind of the one of the approach we can handle or kind of the trying to solve this kind of reproducible build problem. So I'm showing the kind of screenshot to compare two binary and these are actually the same source code from being util in this case help edit this is kind of utility and then we compile with the different option like one is compile on the left hand side that is compiled with O2 and on the right hand side is compiled with O3 option. As you can see the gene similarity is close to 100 and in this case it's 100. So that means these two file even though they compile with different option they're resulting the same gene. So we can tell okay these two binary looking different and even though compile differently they actually come from the same source code. So that's we can tell about this binary and more challenging case would be in this case this is core util and touch is of the binary and the part is the source code and we compile with both different architecture on the left hand side this is X86 64 bit on the right hand side that is compiled for ARM 64 bit. And of course the binary is different even though architecture is different we get the same gene not the same gene so almost a similar gene so that we can tell these are actually coming from the same. So at this phase maybe you curious about how the gene quality of the perform across all these kind of challenging cases because we're talking about different platform architecture, compiler, compiler's option there are so many variation we need to handle of course this is really challenging problem if you also work on the binary analysis this is really well-known problem starting for decade and we are claiming we have 100% accuracy but we are working hard to get there and then this chart showing where we stand and where we are going so this graph is just showing about what kind of progress we are making it so the data set first let me describe about what kind of data set we use to generate this graph so we use the core utility so and quality as you know so it has 105 different programs in it and that we compile with that we compile we actually use the data set from the bin kit the citation number four and that compiles with many different options for example five different architecture X86 32 bit, 64 bit, ARM 64 bit and MIPS and they use nine different compilers four different version of the C-Lang and five different version of GCC and four different compilations when you multiply this you can tell this is 180 different combination so meaning that each program we have 180 different version of the binary that compile in different combination option so as you can tell this is the binary of course coming from same source code but resulting in a very different format of the binary and of different architecture so once we have all these binary now what we did is we randomly select 250,000 positive pair function pair and 250,000 negative pair and then we measure the accuracy about different gene so since we talk why are we talking about different gene because we don't think there could be one single gene would be the best and if you think about the file hash there are multi file hash could be MD5, SHA, SHA 256 so we think there could be multiple genes that represent the program and different gene might be performed better for different use case so that's why we're exploring multiple gene in this case we are comparing the different type of the fuzzy hashing the first one, the blue one is SSD as you know this is context where the fuzzy hashing that's mainly used for text comparison and here we are feeding the actual raw byte of course it's really challenging for SSD and probably it's not the most intended way to use for SSD but we tested that kind of because SSD is one of the kind of the the most well-known fuzzy hashing so that's why we tested it and then it was kind of close to 50 that means it's kind of almost in the random guessing it's really hard to get the byte level of fuzzy hashing using the SSD to get a good quality and then we use the version zero which Ian mentioned about early part of the presentation so one of the team member who work on the Mary analysis he developed the tool called Sigma which is about image level of representation of the byte code so which turns out to be more robust than SSD but still it's not giving the really good sense about the quality so we also use the function same search this is actually the open source project from Google project zero which is using the control flow analysis so now from the byte code what we are discussing is how about bring more program analysis technique to understand the software instead before applying the fuzzy hashing so that's the kind of the idea about FSS function same search so from the control flow graph which is basically how the program executing from the beginning and it's kind of the graph representation and then they do the structural level of the fuzzy hashing and then to the mapping and we measure the same data set it of course one part of the accuracy from the simple image processing now we're using the program analysis it accuracy goes up a little bit but still of course we want to improve more because we still want to have the 100% goal and what we did is as you mentioned in the early part of the demo now we're using the IR which is about intermediate representation that is kind of handling different we can apply architecture agnostic optimization so that we can handle many different variation so that the result is the bump up almost 90% so which is quite good improvement just simple fuzzy hashing without requiring any kind of training so we can achieve this but still as you can tell we are not at the 100% level yet so that means there's still room to improve the program analysis and at the same time so there are today's talk you probably heard of the loss of AI development or foundation model so another kind of direction we are looking into is is there a better way to embed the knowledge we have so that's kind of the direction we consider to go up kind of the more the accuracy boost up so that's kind of the team is currently handling it and this is as you can tell still the research and this is kind of work in progress but I think with the current progress it's already quite good enough to tell you some of the differences the last step is I'm gonna show the demo is this is actually about as you may remember there's hot bleed the vulnerability and we grabbed the two version of the hot bleed before and after this patch and we compile it on the left hand side this is before the patch on the right hand side is after the patch and then we compile and then as you I'm not sure whether you can see so both of the program is resulting almost 5,000 zines so that means almost close to 5,000 function and we are able to compare and tell actually these five or six, seven the functions are only the difference between two so that means given the two consecutive version of the change of the binary and we can tell what are the change and then we can inspect this change was expected or not for example, the developer know this is secret patch or something the change they mentioned and they intended and then we can compare the binary level and then see whether the change is actually made at the right level of right function or not if not, that means something can happening in the during the compilation process so someone need to kind of take a deep into it about what is the problem or what went wrong during the compilation process and in this case one disclaimer is actually the function change was made for hot bleed there's only two function that was the third and fifth function but then we see the more because some of the function rely on the global variable and that was a result in the change between the two binary so as you can as I mentioned this is kind of work in progress of the project so we trying to address kind of the how to find the out and minimize some of the variation that can be caused by different compilation so that's something we trying to improve over the time and then as and this but this is already showing the way this is already good enough to show and highlight what are the difference between the two versions so that we can pinpoint where to look at it so this is the last chart and at the current status we currently support different types of the binaries like Elpu Linux binary or Windows P file or Mac OS binary so these are something we are handling it and in terms of architecture and six different architecture we are currently supporting packages well known DBN and RPM packages our whole lot of different archive is basically the kind of unpeeling the onion to get to the actual the file content and then we build this whole platform and in the cloud native manner so that we can boost and we can also handle many different file and scale of manner otherwise it's really hard to scale up and handling that many of the file and also we are building the knowledge graph behind the scene that's gonna be the kind of ultimate knowledge we can tell about each of the binary and then we already showed the two different version of the genome we are making it and we actually kind of improving the next version of the gene so that we can bring more at the accurate gene representation and of course we are also populating more data into the knowledge graph and then to tell more about the software as a next step we trying to integrate this kind of part of the build process so that we can verify during the build process is this the right package or this the right one to include on us so that we can tell what went wrong about whether where is correct or not and also another part is the Goring provided kind of the different level of complexity and the Docker images of course images top file so we can easily handle it but the issue is the Docker file as you can think it's kind of almost like entire OS systems that means that it contains huge amount of file in it if you think about Ubuntu, Debian it's kind of base images so that brings a different level of scale with problem so that's something we are trying to handling and then we actually playing the limited service launch so that it can benefit the community so that they can upload their file and verify what kind of SMAM they can get it and then they can comparison about the different version of the file and then one of the request of course we had to the community is we wanna hear about the feedback whether this is useful or not whether this is something you guys could be interested join and then to developing maybe improving the gene quality or maybe the feeding more data maybe there could be more interesting use cases you're facing maybe during your day-to-day development and so we wanna hear about the insight and kind of feedback about this process so this will be the end of the presentation and we are happy to take any question so the question was how to map from binary to the original source code so at the moment so we are keeping we ingesting the trust the source of the data like from Debian Ubuntu repository so that's we get the actual binary gene and their associated metadata and if we have the source code of course we can build and then generating the gene and that information we can capture in the knowledge graph as kind of the link between this source code was actually used for generating this binary so those kind of the relationship will be captured in our knowledge graph I was going to ask would you be able to give some insights into performance aspects of this analysis and how long it takes to analyze let's say a 10 megabyte jargon or something like that that's a really good question and to be honest we don't have really good number to say yet because as you can imagine this research project back and forth changing improvements here and there so as a kind of data point unknown RPM reprocessing that was almost 200 kilobyte of the RPM but inside of course when you unpack it I forgot how many file what was the size so in the end it was processing within four minutes to fully process in a pipeline to generating and to generate the result you want to say something? Yeah, we're good. Yeah, so you can kind of imagine we made the explicit comment about scalability and decreasing cost there were a few versions of this where we turned into an Amazon meme and have gone through and made a lot of improvements and optimizations there so the other kind of interesting aspect of this is the long tail distribution of the size of files so standard RPMs, a couple megabytes, we can handle those but as they become larger it becomes more difficult to actually do the decompilation to like lift it back up into an IR and so we kind of have to figure out where is that inflection point? What's the average? And I don't think we can kind of give you a oh it's two minutes per megabyte and I keep asking the team what's the dollar per gigabyte in which we're actually processing so it's a little bit tricky to answer because of that and we might need to kind of amortize by functions and then kind of do the regression so that's still something we're trying to work on. But good question. One thing that I'm curious about what kind of target you're imaging for this product? Who is going to use these like distributors to verify before the distributor or somebody before them, auditors? And also who should be feeding the potential database of genomes for example so that they can be trusted to be truthful? You can answer. All right. Questions get stupid. Oh. Thanks, Jeff. Great question. So we have a couple of models that we're considering. So one that we actually kind of think makes sense is standing up a public service as kind of a public service which is a very large database like a graph database that contains all this information and we would have to see that kind of the question is asked with some of the ground truth information. How would people use it? Well, there could be any number of ways in which you could actually use it. I mean, for us, research for threat intelligence, anyone who wants to kind of verify a binary or a dependency before they kind of ingest it so DevOps shops could actually use it. We thought about, do you integrate this into your package manager so it kind of does a check? It might be a little bit heavy, but we're looking at some of these different use cases where someone might want to do it. So if people have ideas and say, hey, something like this is great and here's how I would use it, we'd love to hear that. Okay, before I answer, sorry. So if I add to the question is, of course, supply chain is one of the problem and we want to tackle, but at the same time as Ian kind of presented before, so there could be the many several use cases and of course validation of the package is one thing, but eventually what we're trying to do is understanding software. So identify maybe some of the kind of unknown packages, what this package is doing, there is more like SPAM related one. But first and third would be the other interesting anger when you discuss about some of the people. Okay, so you have the knowledge grab and of course we're feeding the B9 software and then all the kind of the goodware, but what if we're just feeding to the malware, could be some kind of the state agents kind of the malware and then we're feeding it and then we can kind of make a link whether this kind of the package or the file was kind of the coming from. So we can kind of trace back about this file was actually used for many different campaign or maybe shared with maybe some of the component with the B9 software or not. And also some of the interesting would be, so there could be the kind of one vulnerability found in one of the packages, but the same vulnerability may be used for many different undocumented packages. So we want to kind of find out or kind of expand the knowledge to find out other vulnerability. So there could be many different internet use cases, but again, I think we open to hear about or the feedback potential interesting use case or the challenges so that we can try to adapt and then how to use this better in a better way. My question is, is this something that could potentially use to like enhance S-bombs where metadata is missing or like by cross verifying, cross checking somehow? Is that a use case or? Yeah, so you can imagine you get an S-bomb that says, hey, I've got, here's a package. It contains these four things. Trust me, I signed it. Yeah. Yeah, Ken Thompson, I'm sorry, I don't really trust any other people. I want to be able to verify, I want to rip it apart and figure out what's inside of it and kind of confirm that myself. I want to know that multiple compilers are producing something that's affected by the same. So. Okay, and then is this open sourced? Not yet, that is something that is being discussed at the moment. Also regarding the open source aspect, so one thing the team is discussing is, what is the best way to benefit the community? Because this is, at least we believe this is gonna be helping the open source security and could be providing much more interesting intelligence out of the software. But as you can tell, it requires a huge knowledge grab. And they mean someone or some organization needs to spin on, maintain the quality of the knowledge grab, which is going to be huge task. So that's why I think we are kind of the opening of the discussion and the question and whether what is the best model and to benefit the community and what is the proper way to kind of the building and extending the quality of the paper, quality of the this project. Like the gene quality measurement, something we present it. So gene, of course we can do the more interesting mapping and the collection correlation. But then it's also requires a knowledge grab, how we can build and also the query in a more scale of manner. I think these are also kind of the interesting question. And I think open sourcing is one way but how we wanna kind of the proper maintain about the entire service because probably single person would not afford spinning of the huge knowledge grab. So that means there needs to be some form of centralized kind of the community effort to kind of support. So I got a bit of clarifying question and then a further question for that in the results where you had the comparison of the gene. Was that comparison gene equivalency or the five tiered thing? And like I'm kind of curious like how much of it was gene equivalency versus the similarity? And I'm so curious on like, yeah. To answer this. So we picked 250,000 positive function pairs and 250,000 negative function pairs. And then we generating the different genes and whether they are matched or not. So if we match and of course, if it's negative, that means you're correct. And of course, gene is matched at 100%. And if it's positive, then we are correct. So that's how we measure the accuracy. So this is equivalency, right? Yeah, equivalency, cool. For the similarity function is like, is it something that's easy to kind of index or you have to kind of go through each one and do the comparison? So. Okay, so one is, I guess there could be multiple ever source. One is about exact match of the gene or other thing. That is what we tested here. So to measure what is the real quality we can get out of it. And then another interesting angle would be the most similar cases. It may not be 100% match, maybe 99% of the match. It's more like a nearest neighbor search. But that KG is a good way to explore, to find in the match, but still of course you can explore the entire knowledge graph for every single query to find out the nearest neighbor. So for that we are using the Milvers as kind of vector database to indexing it so that help about finding the nearest neighbor. So given the vector representation of the gene, what is the closest one so we can find it in a scale of matter. So that is something we are using it on top of the KG. Oh, cool. I was just asking this because I imagine like if you could evaluate the genome or individual functions and then have them be kind of like packaged with the delivery of like the binary reason, stuff like that then like you wouldn't, anyone could do it by themselves and kind of like construct parts of the graph and evaluate it as well. So then you don't necessarily have to like centralize it. That's good then, yeah, good comment, yeah. So can you go back to the graph, the slide of graph because I'm quite puzzled about what happens that is the first analysis of the gene that is coming from the package that has the back door. So we don't have any reference before and the first one that you get is the back door one. How actually can tell that this package is back door it? So, yeah, so I think it is rely on the quality of the knowledge graph. If it is something we never seen before, of course we cannot tell exactly what this code is but of course we can tell what is similar to some other known genes. But of course if you're looking for the exact match, if it's complete new code, yes, we cannot tell. That would also imply that it's a tactic that you've never seen. So no piece of malware has done a task that is similar to that one. So, yeah, there tends to be some similarity between, different ways of producing the same malicious function. So we rely on that and the hope is that as the graph, the knowledge graph in the database grows and it gets denser that the likelihood that we wouldn't have similar matches and be able to then track them back to what they do and where they came from will actually go down over time. If you go back to the embedding slide, I think two slides before this one. So, which one? More, this one. The embedding function, is it semantics or for example, if I have a function that shells to curl and another one that shells to W get, will the embedding be similar? This embedding part, we are using one of our colleagues, the previous work called Sigma. So we actually get the machine code, if the IR to counter correlation, bunch of the LLVM IR pass to optimizing it. So the goal is that we can minimize the difference at the program analysis level. And then once you get the big code, we simply represent kind of the big vector image and then we embedding into different kind of the vector representation so that we can kind of do the comparison. So this is, as you can tell, it's kind of a simple kind of vector kind of transforming. So this is a part, we're trying to explore many different type of the embedding so that we can kind of improve the robustness, as you can tell. So semantic preserving or kind of the more complex way or is there kind of better projection processing it? So that's kind of the part we're trying to address. As you can tell from here on the left hand side, we're trying to bring the knowledge about program analysis, software engineering community. And on the right hand side, we're trying to leverage AI community, what kind of embedding or the method. In this case, we have the big code and the image, but there could be control flow, data flow, so there could be graph representation or IR itself has kind of the calling code language that could be the other multiple model we can using it. So this is kind of the parallel, kind of the development we currently exploring. Do you want to add? Add to one question or one thing of this. So when we actually look at this, it is kind of a pipeline. And there are key areas where we can see improvements and key things where tomorrow I want to rip this part out and redo it and get the fourth generation genome. So we know there's some issues with lifting to IR from different tools or ways and people have evaded that. So we've looked at different tools to make that more robust. The canonicalization, so as Jiang said, leveraging from the optimization in the compiler community, there's probably a lot of work that we can do there that would get better canonical forms. The actual embedding itself is another area where we have some very concrete ideas of what to do there. And then there's one thing that we do on top of that which is then realigning or re-grounding the embeddings. So even here, when we use the existing approach, which we actually have gotten a surprising amount of lift from that, given how simple it was, we're able to then like re-ground that with small numbers of samples and that gave a three to four to bump in AUC from just a very, very simple realignment. And that was a little bit surprising. And I know Jiang kind of mentioned that there are different subclasses of this from it's the same architecture in the compiler, but you only change the bitness going from 30 to 64 architecture. And for some of that we got more, for others we got less. So we kind of pinpointing the places in which we can actually make optimizations. Yeah, that sounds good. Thank you. Yeah, I was thinking something like Word2Vec for the Code Genome. Yes. Yeah, very, very much so. Everyone is jumping on the foundation model bandwagon and IBM is no different there. It's what you pass them like the issue with like the Word2Vec, ASM2Vec, things like that. Is it actually misses this middle layer? So we had considered that originally, but it misses like all these different optimizations that you could actually do. It didn't directly address some of the issues, but. Did you have a follow up? Got time for I think five, seven minutes left before we're at the top of the hour. Any final questions? So have you tried to do any tests on codes that actually in terms of compilation, it's altered the code, but still the same. Let's say you open SSL when you need to do an export control and take away parts of the code and the resulting binary comes without plugins or other parts. And then, for example, when it's port to China, it needs to cut down some algorithms and it goes to other parts there. So it's basically the same code, same architecture, but then with different parts built of the compilation. Haven't done that, but I imagine it would be interesting to do. We can probably do it, you know, shortly. And it'd be interesting to see if the FIPS or non-FIPS, you know, that type of thing. But what we would expect is that the functions that don't match, we'd be able to identify, hey, this is what's been ripped out and this is what's been added potentially. This is what's been changed. So where you actually, you know, where you'd make that switch to call a function that is no longer there, that would change by some small amount. The functions you've removed, they're obviously not gonna have any matches. So I would imagine it would look similar, something like that, but we have not done that test yet. Looks like it's time for closing remarks. Didn't have any closing remarks, but I know there are a couple of people. There are a couple of people, I'll go to the final slide. I will say this, you know, if people are interested and want to, you know, pressure us into open sourcing it, you know, if you have a compelling reason to do that. You know, there are a few of us here, you know, Jeff is here who's, I think, been encouraging us in the past. We need to have that compelling reason to do it. I think it's there, but we'd love to talk to people about how we might be able to use this, collaborate on it. And then there is that question, you know, is it a big service? It's gonna be expensive to run. So how do we do that? Is it open source? Is it, you know, some distributed architecture? So very, very interested in collaborating and talking. Cool, thank you. All right. Well, just before you applaud, I'll just say I did hear VMware asked if this was going out in open source. So that's one request. If anyone else wants to come up and talk to me about their interest in seeing this going to open source, that would be fine. But okay, there's a couple more. So yeah, because I wanna just recognize the hard work that these two and some others in IBM research as well as others in the past, other projects that have laid a foundation for this. So as you do a little round of applause, I just wanna add my thanks to Xiong and Ian in this capacity. So great job, you guys, thank you.