 Hi welcome. This is the second from the last talk slide of the whole conference here So I appreciate everyone coming to hear us speak and talk about whereas my source code vulnerable and matching CVEs common vulnerability and iterations with with your source code My name is David Barrett and I'm presenting on behalf of canvas labs team for some work that we were doing To help us be able to track down vulnerabilities and actually do something more general Which is called source code to source code matching matching actually is what we were doing So I'm a little bit open up today a little bit with the motivation about how we ended up looking into this this problem Essentially, you're having modern software and if you're using software you're getting you're getting code from all over the place now So a very large percentage of opens of software that's in deployment in industry these days is using open source software and component of it Oftentimes more than half the code that's in a typical deployed system is open source software And so given these dependencies on open source software There's a bit of it even though the code itself is free and you're free to use it as you see fit it introduces some costs associated with it So some of those costs involved Tracking dependencies when you're using open source software, especially if you're in a situation where you are not pushing the code back In the upstream base you might take a snapshot or use some piece of open source in your code that then subsequently gets modified And some of these modifications in the results of vulnerabilities that are being discovered and introduced in the system And so in a situation where you're actually using open source And you're not keeping the code up to date or you're trying to keep the code up to date You'll be pulling this code from all all all kinds of different repositories And unless you're very have very good engineering processes to track the dependencies and keep track of them It could become a challenge to be able to find out and keep keep up with these vulnerabilities and how to fix them So that's kind of what we're trying to do We were trying to deal with situations where this could also happen where you have legacy systems for example Embedded systems are a good example or you have systems that be able to have software that exists over a decade long time horizons You have engineers come and go and they end up New engineers are not keeping up with what's what's been in the old code base And so you're trying to figure out where the heck did this code from and where is it vulnerable? So we at Canvas labs were looking at this problem and we came up with kind of an interesting solution for it basically the idea is that you'd like to be able to take the System and be able to take the vulnerabilities that are known about in the open source Repositories and communities and be able to somehow find a way to create a database that contains an archive of that open source software And this database can be maintained in real-time or updated accordingly and keep track of the the open source software That's available that where you could have chosen your software come from So it forms the universe of software where you think your code may come from and being able to match your code against that Software base to be able to find matches for that code based on the code itself All right, so one part of the problem is you go ahead and you basically place your You create an open source database that contains essentially an index of all the code that exists in the open source Repositories of where your code might it coming from The other part actually is a little bit counterintuitive It actually turns out that there's a vulnerability database the the NVD national vulnerability database and in that database they actually include a description of where the Vulnerabilities are and in that description they actually include Some package names and the versions that are effect of effect code. That's vulnerable So surprisingly it turns out that those descriptions are a bit difficult to map to the actual code that contains the vulnerabilities So oftentimes you'll have human beings have to go through they read the they read the vulnerability database They read the description and they have to be able to find out where that code is that corresponds to the exact version numbers The exact packages and the locations of that code so that they can find out how to find the fixes for that code So with humans It's fairly easy because if the description is good enough that a human can understand it You could just go ahead and find it and do it well This all works great until you have so many vulnerabilities that you can't possibly and they're changing so rapidly That you can't possibly keep track of them as a human being and much more having an entire team of engineers trying to do that So what we at Canvas labs are trying to do is we says okay Can we automate this process of being able to connect your code with the vulnerabilities that are already known? Okay So there are two pieces to that one is to find the the versions that are affected in the national vulnerability database And then secondarily you take those version and package names connect it to the corresponding code And then connect that corresponding code to where your code is that kind of matches that system And you want to build a do that even if you've made changes to the code So for example if you were to cut and paste some code from Stack Overflow into your system And you were able to add some lines or make some modifications It is free software after all so you're free to make modifications to it So why should it be that if you if you make those modifications you can no longer detect the code matches and where it came from? So that was kind of what we were trying we built a system to be able to kind of do this And what I'm going to do now is describe the first piece of this problem Which is going ahead and find the package names and the version numbers in the nbd from their English language descriptions And so we're kind of dealing with a little bit of a natural language problem natural language processing problem It's a well-known problem in machine learning We're going to talk a little bit about some examples and present some examples of why this might be a little bit challenging to have Machines actually be able to recognize them So first I'll describe a little bit of what at this slide is basically a diagram a Diagram of our system, which is the solution I just described just presented diagram form and basically the two pieces are the national vulnerability database on the bottom and the top piece Is github which is where we actually got the source code Okay, so we took these the left-hand side of the slide is showing you where we Pre-processed this large archive of code and vulnerabilities from the national vulnerability database And then we put this into two databases one is called a software library that contains Essentially an index of all the source code that you have and it's a very unique index And it's actually an index based on the code itself not based on the metadata that's used to describe the code Okay, so we're actually doing code to code matching The vulnerability database merely contains an index of the the vulnerabilities Like the CBEs and be able now we could essentially Overlap and connect these two things together to connect your source code on the right to the corresponding vulnerabilities list and fixes of those vulnerabilities All right, so the first part as I described is I'll be talking about mapping these package names to versions matching the nvd descriptions to package names and versions and I'll be giving about Four or five examples of that to show as progressively harder how that becomes So first thing is we'll start with a very simple Vulnerability we have a description section which appears in the bottom This will illustrate a particular vulnerability that's labeled here with this number here. Okay, and This description is that this is a retype of the description that actually appeared and our software system is doing the natural language Processing is strictly in search that in finding out the versions of the affected software and the package names that are affected So that it can then now connect this package name and the version numbers to the corresponding code in github Or any source code repository for that matter you can have a proprietary repository of your own code in your subsystems as well So the blue here indicates the interesting language that's relevant for the purposes of that problem And then what we did was we took a machine learning engine and we tokenized the text and we went ahead and use this as a feature space for a machine learning engine and use that to basically Label each token, which is the words on the left-hand side in blue that's corresponding Meaning essentially okay, so in red you'll see a product name Which is an abbreviation of pn and that's bugzilla And so we want to tag that bugzilla as a product name that we're going to use to do it And then we have the version number actually appears in kind of a subtle way here We have a version range this version range end and then we have the and equals We're basically going to construct a query that we can then Mathematical query that we could use to find other versions from this So the results of this in schematic form as we've converted the English language text to a machine readable expression on the top The colors here correspond to the tokens that were used the types of the tokens And so blue appears the package name the range is appear in green And then the corresponding schematic diagram shows the package name at the arrow and the affected versions in red and the other Versions that are not relevant for the purposes of the vulnerability above So then we give another example of another vulnerability its description All right, this gets a little bit more Now we've got the tokens are separated across the sea of English text as kind of a cavern between them All right, and then we go ahead and we label the tokens. We have a package name of Range a range and an a vertical range notice the 3.0.x corresponds to a range in itself Even though it's just a single token All right, and then we have an earlier and the resulting diagram here shows the affected versions as well Now notice here We actually had to do it to an inference because they had an end earlier clause And so we have to infer that there must be some beginning number for the ranges of the things And so we've we've we've actually assumed we have a 1.0 on that So the next example we have a slightly different situation and you'll notice here We have everything in blue for the text we do the tags Notice here as anybody notice anything unusual about this particular example as far as version numbers There's actually no version numbers there so the machine's not An oracle all I could do is take the information that's given to it And so we're just going to infer exactly what the information is there and just show that all we have here is a Set of product names that we're going to do the inference for so we have we're going to have to assume that all those products are affected by that text and I think that hopefully that was the intent of the description So the last example I have is a rather gnarly one For a human being trying to read this this would be I wouldn't want to be the poor guy That has to go through hundreds of these and be able to figure out whether my job depends on whether the code It's got this vulnerability or not, right? Fortunately the machine is going to come along and help me do that It goes ahead and creates the versions of the product names and all that stuff and then Thankfully we end up with this simple expression That allows us to find out the affected versions and you notice it I don't know how they found all these vulnerabilities That's I'm actually a little suspicious. This may not even be all of them Be kind of curious. Does anybody know about this vulnerability? Is it look familiar? These little slow here Where was I Is anybody in the room seen that one before I don't know We pulled these examples out because they were illustrative of the complexity of the technology That's actually needed to kind of parse that so we're trying to pick some representative examples All right The last one here is one that we couldn't make heads or tails out of Has anybody else able to make any sense of this? I was completely dumbfounded as to how a machine or a human could could actually do this It'd be very curious So the short answer for all this is it would be really nice if people when they When these CBEs are created if they could actually put language descriptions that that would actually be easier to understand than this And so we would encourage that to to actually happen and we're hopeful that could be done But in the absence of that so this one would be at a vulnerability that we we probably really wouldn't be able to detect that one So that illustrates kind of the bounds of the technology So what we did now is we created a database of these vulnerabilities from the database of the vulnerabilities We have a list of package names and all their versions We can go to get hub get all the code that corresponds to that created in a big code based index And now we're going to talk about how do we match the the code to code match from the vulnerable code that we've identified with your code So we start with the vulnerability it's in its fix all right So we end up having a here's a typical example of a vulnerability line 90 in red shows an example of a Fairly well-known vulnerability. Has anybody recognized that one off the hand? All right, so we'll talk about that in a minute But this red vulnerability here was a was a vulnerability in place And it turns out in the guts of github Somewhere in the in the commit history was actually an eventually a fix that was actually tagged with the CBE ID And so that allows us to backpropagate from the fixes to the code itself and actually be able to not only show you The code that was vulnerable, but we can also show you its fix once we find it And so this example is the package was Linux we've identified that there's a specific file verifier dot see And then the function involved with that was check ALU op All right, so here's an example of that code that was actually vulnerable I've just retranscribed this into the white background and what I've done now is I've colored the the parts of this source code with Information that indicates its role in the source code and what it's what's its purpose actually is and so what we're going to do Is we're going to take this source code and instead of just assuming it's a bag of words like you might traditionally do with a Natural language processing machine learning machine What we're actually going to do is we're going to use the structure of a code as a compiler would see it and be To use that to generate an abstract syntax tree that represents the code in an abstract way That allows a search for that code to find it even if you make changes to the code itself such as inserting lines Changing variable names or things like that and still be able to have the freedom to do what you do with source code And be able to make changes and still detect the match So now if you watch carefully as we make this next transition You're going to see this code is something going to get replaced by an abstraction of the code And I've color-coded the colors here to correspond to how the symbols that were appearing in the original code now get replaced by a Token rep type that represents what that code would look like to a compiler Okay, so in particular the name of the function check a la you opera is what it was I'm going to go back and look a little quick mark reg known was the function We're going to actually abstract that way as a function call and we're going to do the same thing for the types the local variables and the parameters for the function and So the end result was when now we're going to label that we're going to replace all this text that appears that are Specific to this exact instance of the code with the representation that we can then match other code for it All right, so for example if you have your version of the code You might have changed the variable name NSN to your variable name for purposes that you may need to be able to do to have the freedom to make your engineering work easier For you and your team to understand So our system will be able to recognize this code is matching the previous code because of the abstractions that I just described to you So the end result is we can show you the vulnerability Which is still marked in red and then we can actually show you how to fix it Which we're basically tying your code back to the database that we have So we use this we use this technology. We built an actual system to do it Okay, and we actually have an API and we could actually integrate this and we actually run this system We have a demo that we we actually give it our low showcase in our booth that we'll be happy to give you guys And I'm going to kind of show you a little bit of a snapshot of what that demo might look like So the way our demo was actually structured is we actually have the ability to execute And place your own source code as you from anywhere that you got it from We could actually cut and paste the source code and place it into this box in our demo And then we could basically run it with a process command We can then run our system to say find me everywhere this code appears in in the repository that we've indexed in the database We've indexed so for example, we've indexed all of Android and the Linux kernel and we've indexed quite a bit of Java code as well And so now we can find everywhere in that code base that this piece of code appears So it turns out we actually match that code against 29 29 instances of code in our in our database repository of that 29 Three of them actually also appeared in the NVD database Okay, and we presented here each of the three matches that appeared in the NVD big national vulnerability database So all of them well two of them were part of open SSL one of them was part of Android These are kind of important subsystems that appear In in the repository and we see that the function the library the file name is t1 lib dot see All of them were in that one and then the function that was affected in three of the three different functions decrypt ticket process ticket and process heartbeat All right, so now I'm going to show you the one in the lower right-hand corner We could drill down in we could click the show button in the lower right-hand quarter and we could actually see the vulnerability itself and The fix the vulnerabilities and read on the left-hand side and the fix for it on the right-hand side and And and this is where the first piece came in where you're identifying the versions of the package names come in we could also identify Just the exact specific versions of which versions in the repository were affected by this Vulnerability and then we could actually show you which line numbers match and which version numbers matched This turns out to be fairly important because one of the things we want to do is make our Inferences that we have explainable. We found that no Automated system is perfect with being able to represent things and find stuff You need to have a way that a human being can go back and audit these results and see for themselves Of each of these is actually valid before making a definitive decision whether this matches something you need to worry about in their own Code base so we wanted to be able to make sure we provided enough information So that human beings can go and make the ultimate decisions about how trustworthy reliable the system actually is So we placed we actually have this is an open-source summit We've actually created a a github repository for some of the code that does some of this features We've open source pieces of this We're actually using a machine learning model called BERT to be able to recognize the natural language processing for the Diversion numbers and things like that We decided to release that to make that available for us to be able to help solve this problem with the versions in the In the community for doing that We have several different subsystems and we invite you to come and take a look at those So that concludes the talk is anybody have any questions or would like to to ask me questions about the subsystem Could you take the mic Peter maybe and take that there? So This talk is being recorded. So if anybody wants to see it later, they can Is what you show mom is what you showed us? completely able to be run based on what you have in open source or That's just a part of the system that you demoed and it's not demo both stem. It's not runnable standalone Yeah, the entire system is not open source. We have portions of the system. We were opening source We're planning to open source more of it than we have so far the technology. I'm describing for the for the Version number enumeration and those things we're planning to make that open source as well as a substantial fraction of the database So that people can use that and deploy it themselves And so which part what would be the limitation without the proprietary components? What will someone be able to do without the proprietary components? Peter would you like to take that question? Sure Peter is the person that's our CEO the company and he actually founded the company is implemented the code that does all this so the the question was The open source or not of course, it was at the Did the part Okay, the the deep scanning of the source code matching part is is not is proper proprietary dependency checking and then finding parsing, you know the versions and etc. The even the how we train the model and the data to train that model is open-sourced so I'm running the mic and Please one in the back One of the problems we've seen trying to do CVE matching within our product is false positives in the database where it says for example Versions 1.0 and later are vulnerable and then no one ever goes back and updates when the patch made it in I'm curious if you guys have done anything to try to help deal with false positives like that and identifying code as vulnerable when it really isn't We have not as far as I know that we're basically not going to be making We don't have anything in our system to be able to infer what the the writer of the CVE intended or Changes to the CVE that have not been documented in in some sort of repository somewhere to the extent that someone has Documented this history or improvements to the CVE. We could certainly incorporate that into our system If you're aware of right now I think we were using the NVD, but our system is agnostic as to where it gets that data So for example if your company had knowledge that it discovered from your own reviews You could certainly incorporate that into the system very easily Thanks piggybacking on what the Previous question was one of the common concerns about CVE in general is the amount of false false positives is that What's the impact of that on the success of this in other words if upstream? It's the database. It's not accurate or it's a lagging indicator and this is making it more accurate Do you have do you have a sense of what the? Still the benefit still is given that the upstream problems are you know, not in your control Yeah, I could actually speak to a fairly non obvious aspect of that When I was working at my previous employer I was a strong advocate of integrating machine learning and discovery technology in with the human the Engineers that were involved with it or even the developer developers the legal teams whoever the human side of the organization is So what we found in practice is we implemented a prototype where we actually tightly integrated the results from your machine Your automated subsystem into a very easy to use GUI system that allows the reviewers to Effortlessly make their knowledge available to the machine in the future So for any organizations that's fighting that battle battle I would strongly suspect that if they were implementing a GUI that in their review system that allows them to discover these Sort of things or use the machine that allows you to take the results of a human reviewing it and say yes This is a false positive especially or add false negatives But usually it happens with false positives It's very easy for humans to say this is wrong And so if you have a button that says this is wrong and the humans can incorporate that and you could feed that back Into a database we found it converges extremely rapidly usually within two or three iterations with the machine learning algorithms to find and reduce Substantially the false positive rates I'm gonna make another comment on top of it. So we find that the These tools in order to be really effective. I mean I think well the way This is more of a vision In my experience from that perspective is that a machine should be More of you should be AI should be used in a way that can really help you not to bother you You know and in order to do that The the The experience of using you I should actually include how to evolve Over time in mind rather than this is wrong and you know just to regard it There has to be some type of feedback loop that is kind of like a semi-supervised I'm being a little technical Say by supervised learning that actually builds on top of the knowledge and it actually updates as you collect more data And that that's the way that we are approaching this problem Yeah, we've actually deployed systems that have that tight user interface Integration in it and what we found is that the people once they use it they never want to go back Yeah The other thing we found is that people are often feeling like machine learning and artificial and automation is being used to replace Human talent that's not at all the case what it's being used for is it's being allowed to replace the tedium of Something that human finds tedious and error prone with the letting machine take over all that for you So the humans could focus on what they're good at and once you allow people to focus in that We found that these false positive rates tend to fall very dramatically very quickly I Could give you an example here for example if we have this interface and we found we found a positive or something like that and a Human being all it involves is simply adding a button that says the human reviewer could say this is a true positive or a false Positive if you just have a simple button that says that It's not there yet It's not there yet It needs to be in them it needs to be in every tool that allows you to look at databases and review results from database queries Yeah, yeah, and so I'm a strong advocate of making all of your database query engines Interactive in that sense, and it's a very trivial change. It's so so invisible that humans don't even know it But once they see it so we've used we've actually added a single button This is this is right or this is wrong and then the second button we add is rerun the machining scanning engine and update the database with The knowledge I've just given you with the last five mouse clicks, right? Yeah, so what we did in practice was that we would use the model to label them first and then That kind of helps us to go to the next round and so that's been the semi-supervised learning method So in our case, we actually had to have one of our interns tie us here in the front row He laboriously labeled thousands of different examples of these things and he actually provided the examples for this talk He had to do it as this big batch job sessions for hours, right? If you would incorporated that into the GUI that's displaying the results and just allowing the people that use your system to Just label it for you as you go They don't even notice the work and then what they're highly motivated to actually click on the button and train the machine Because then they don't have to see the results as they go So what happens is their tedium goes down every time they click a mouse button It's a very powerful concept In terms of the fix actually How does once you show the vulnerability and it's fixed is there an easy way to kind of apply that to your source code Excellent question We actually had a legal subsystem in place at my previous employer and what we actually did was they had actually Turned out they had done just that they had be able to take the system and they had the machine Automatically make the updates to their database based on what the machine inferred and what we found in practice is yes That works very well at the beginning But as you start get more and more human reviews and when you have machine reviews mixed in with human reviews If you haven't labeled whether the human did it of the machine did it Very quickly you discover that your data becomes very unreliable very quickly And so I'm actually a strong proponent of not having total automation when you have it actually insert these what we found works Much better in practice is what you do is you group the you group the test results You get here with these vulnerabilities and the fixes into a way that allows you to put them in folders So to speak that have the same types of vulnerabilities and same fixes you put a human reviewer on there and you say Yes, this is one I want to make the update And so all they have is a button on this that says make it so and then that'll automatically Incorporate that revision and then it would present to them a menu of the other ones It thinks are also the same let the human click a few of those and say yes make it so and now if you make the user interface Very clean and easy where you can select and say make all of these so kind of like me sorting mail and your mail folders and Letting the humans make the authoritative decision about whether it's accurate or not That allows you to have an audit trail You could timestamp it you could say who made the decisions and over the years what you find is that the system becomes much Maintainable and robust because you know every decision that was made and how it was made and how it got there So I'm not an advocate of complete automation of that I think that you really do have to leave the human being in the loop There's a chain of responsibility that I think needs to be there I was just curious if you could speak to the performance of the system and like the hardware requirements to run this for a given size project So we so the obviously the AST building that will take Time and then we what we do is we catalog every version of the so are you talking about? User input when you look up or when we build a database. Sorry about that Okay, so it's a user customer or user wants to scan their source code and then how long does it take it's actually Hardware requirements will reuse very old machine like a few years old server machines and it's Because of the way the data is structured look up is very fast. It just takes time to do the build Input File needs to go through this a set of series of files signature or the function signature extraction which Which means that you know there's a lexer and parser and etc So it's pretty it's very pretty fast and it's very The look looking up against data is very fast But and the problem can be parallelized But overall you can do it and run it in our we our demo which contains all of the Linux kernel and Many of the very popular open-source C projects it takes Less definitely less than a second to do one file look up to do with a 29 functions that particularly example that you saw It's running on like three-year-old machine. That's like so small to answer the question kind of a bit broader The stuff on the left-hand side everything to the left these night these six boxes on the left That's a pre-processing step that's done on a very large code base and Peter speaks of that process takes a little bit A while and you pre-compute it once yeah pre-compute once and then your code the start that you're going to be running to run that that that runs Incredibly fast. Yeah, so now you would have to have an extraordinarily large code base Now our systems designed to run against large code bases and well So we have this this this demo we have where you just cut and paste one function. That's just for the demo All right, we've designed the system to run at scale So if you have your own Android instance for example where you've not something of that size and be able to run it Do that comparison? Have you run a test on how long that something that that size would take Peter? Do we what kind of sizes that we have I don't have that number in my head, but we did We did design all of the Android and then we looked up against it It didn't take long, but it was like it wasn't like a minute It was the whole I'll takes while so but looking up was trivial So it's I think it's a function of the of the horsepower you throw at it, but it's completely It's more I don't rather than a few pounds. Does that answer your question? You don't need like a 10,000 Amazon instances to be able to run the system. Yeah, we're running this instance We're running on less than a hundred cores to do all this stuff And even on the laptop we actually do the demo at the laptop so So when you when that your source code box Do you point your entire repository to it like for example when you say Android if I'm using yeah Whatever whatever source code you want to scan what we typically do is we have plug-ins that integrate within your build Process and you just integrate that with your continuous integration process and that scan takes place over whatever subset of the code you want Whether it's proprietary open source or wherever Any further questions? Well, thank you very much. I'm really glad you can come today. Thank you