 Hey everyone, this is a vulnerable code. Vulnerability code exists because we don't believe vulnerability database should be just about vulnerabilities. First of all, I am your co-host, Yotick Wajir. I'm very much interested into Linux and information security as the project vulnerable code. And I have been working on vulnerable code for over a year now via Google Summer of Code. And now independently, these are my details. And you can contact me on Twitter and my mates. Hey everyone, my name is Tushar Goyal. I am a co-mentainer at vulnerable code, Fetched code universe and package URL. And I'm currently working as a software engineer at NXP. I have been mentoring at Google Summer of Code and also a previous participant there. You can contact me via the email ID. You can see right here. So let's go with the agenda for today's talk. Well, we'll talk about the state of all the vulnerability databases, including the open source ones and the closed source ones. How do they search and how do we search? Well, we don't believe that vulnerability database should be just about vulnerabilities. So there is a hint right out for you. We'll talk about a better approach, talk about vulnerable code and how they develop vulnerable code, the ideas behind it and our future plan of the project. Well, let's begin with the state of vulnerability databases. Well, right now during our research, we have seen a lot of databases with closed packages. This means that the packages just don't exist, but there have been adversaries for that package. So we get XYZ is infected with vulnerability ABC, but there is no reference for that package XYZ. Sometimes it happens that the package XYZ exists, but the version vulnerable does not exist at all. We have a very, we have a problem of crying bull as well. There are a package would be treated as vulnerable, even if it's one vulnerability lies only inside one of its dependencies. So a vulnerability scanner should not just include that entire package as a vulnerable entity. The vulnerability scanner need to pinpoint which particular dependency is vulnerable. That is important because the package developers want to improve the package and that can only be done if you know how to detect vulnerabilities in your dependencies itself. Some version readers do not agree about themselves. Sometimes you say it's better than two and it's less than one. It does not make sense. There could not be any versions in such adversaries and there are a lot of noises in adversaries as well. For example, you will have an adversary that says every version after 1.3 is vulnerable, but that is not the case. What happens is at the time of publishing the advisory, the vulnerability was identified. So it must have been fixed after that, but sometimes we are left with open and vulnerable ranges and that does not help at all and creates a lot of noise. Now we have cities where we have a vulnerability map, but it's not very easy to find out which exact package is vulnerable. To a particular vulnerability or a bug. This makes up to a very interesting problem. They call it the telephone game problem. This is a game that is played in various places of the world where you start with a person and they listen something and the person conveys whatever they listen to the next person and so on and so forth. Well, the similar thing happens with the case of vulnerability data as well. There is a lot of reliance on automated tools. We have a lot of bad data and everyone at every step makes something up with the data that they have. So a database A based their container database B and if the initial database was not correct to begin with, all the database gets corrected down the line. Even further, they alter the data sometimes they convert to version ranges from real numbers or operators to English sentences. Which sometimes just does not make sense and is not possible at all. This implies that as many database interpretations you have you'll have as many version ranges and this boils the entire data. And this turns over time into a telephone game where at the end the message that upstream wanted to convey has totally been changed. You don't get the exact version ranges and you are unable to parse any package or maybe find any vulnerability type and package and pinpoint the versions that are running. Well, all in all, what we find is upstream has better data. If you want to trust anyone, we have to trust our street of who published the vulnerability and that's how we tackle this problem. Here you can see this picture can be interpreted in many, many thoughts like the best data is upstream. We have to tackle the problem whatever comes along our way but we got to find that sweet upstream data that lies behind this there. Moving on, we have databases which are proprietary and we have no idea what secret source they use to get the vulnerability data. It is very painful to not to know how the data that you're consuming has emerged in the outside world. If it's a false code, the vulnerability data should be open as well. Everyone should know how the vulnerability data was aggregated. Everyone should know what are the steps involved and there should not be any secret source. It should be as transparent as was this. Well, we have something that's a great thing over here. We have GHS days, OSP, GitLab, all of them are publishing open data, open vulnerability data and that is a giant step towards so making the vulnerability data base of the program. Please. Package URL that we have talked about in a different presentation on this channel has been getting traction and we are getting one package URL which is very consistent. It's the affected package or the vulnerable package or text packages. We also, that package URL has also been used at OSP, Sonataite versus index and we have common formats for in-app availability. Yeah. How do we search? Right. We are a package first. We, so we have a package full at version one. We want to know what vulnerabilities are associated with that package. So we check what are the vulnerabilities? What are the severities attached to that vulnerability and which version has a fix for that vulnerability? Right. It's a very rare case that we face that you have vulnerability and you want to check. How do you have any package that is associated to that vulnerability? So a better approach is package first. You don't need to look up to vulnerabilities to find packages, but you need to be package first and find packages and look up to vulnerabilities. So why vulnerable code? We are accurate and correct. So vulnerabilities are important. They are very important, but code is more important. You need to be package first. There is no free software vulnerability database that is number one, open. Most of the solutions that we find are proprietary or closed source. Number two, it should be comprehensive. It should be covering most ecosystems. It should be created by expert humans. It should be validated. And most importantly, it should be working towards correctness. So wonderful code solution. We leverage all the tools that report package URL and we also support CPEs and the tools that we have currently here are scanco.io, toolkit, OIT, turn and bunch of these as you can read on the screen. So you can query by Perl, right? As Rithik and Philippe has talked about Perl's. So I think you have got a lot of context to what Perl's, what are package URLs and how you can reuse them. So we support queries by Perl and open data and open source tools are always better. You know what's going behind them, right? And you're not confused what could be the consequences of the closed source ones. And yeah, eventually we'll be having expert review and the curation of data. Well, we talked about a solution that is vulnerable for, which in a opinion is a robust solution. We talked about packages, not vulnerability. And it serves very well. But how do we create a database as that? We use data directly from our screen. The source that provides us with the highest confident data, which is our point of truth. We employ a confidence-based system. That means we don't have all data that we trust. It could be, if the vulnerability data is from the maintainers, we can trust it blindly because they have population, it's their goal. But if it comes from a third party, there is a problem. We cannot give them 100% confidence. There could be certain differences over there. But what we can do is we can aggregate and correlate as many data sources as possible. And then we can process those data sources to make sure the confidence of a particular package vulnerability relationship is as high as it could be. And then we come up with new relations with vulnerabilities and then remind the graph. We find out the relations that were not very much apparent in the first place. And this will eventually help us to get correctness with very high confidence and a review system as well. See how do we aggregate and correlate many data sources? Well, of course, we collect as many sources as possible. That would include OSB, GitHub, KTLAB and all the open sources out there. We have a common data model where we can cross difference and create graphs. We can have trackers that are specific to certain projects like Apache, OpenSSL, all the trackers, change logs to win. We can even track distributions that is DVN1, 2, et cetera. We can have application package trackers and so on and so forth. So we get us data in one common format from all of these sources and then we create a multi-level refinement. So first we import, which gets inside the advisory staging area. That is just an area that we name inside our code base which just consists of the upstream data. Now, advisory data can be true or false or a low confidence or maybe just no data at all. We don't care a lot about our staging area, but there is some structured data over there. What happens is when we go over to the next step, there is an improved step. In the improvement, we take all the advisory data and convert them into a relational database of vulnerability package and their relationship between each other, the confidence and et cetera. We keep the original advisory data, that is the raw data along with the relationship so that we can get to the root of every relationship. But at the same time, we have very specific maneuvers that can even cross-check data sources. They can invalidate data sources. They can provide confidences for different type of relationship. They can even update version ranges if need so be. So by this multi-level refinement, we can simply grab all the sources out there, convert it to a very simple format and then improve upon that source. And slowly and steadily, we get a final output that consists of a confidence and provides us with a very concrete vulnerability and package relationship. So as Kritik has presented, the current state of open source vulnerability databases. So we now talk about the issue, code packages, right? So some packages do not exist anywhere. And when I say packages, the package may exist, but some versions do not exist anywhere. So what's the solution for that? We look up and see upstream registries and repositories and check if the package URL and versions are correct and they do really exist at upstream. Let's set a data quality. Some vulnerability sources can be trusted. As you have seen, the package vulnerability doesn't exist, but are still reported by some of the sources. So you can't trust them all. So what do we do here? We assign confidence levels. Confidence levels ensure that we are getting all the data, but we can mark them with lesser confidence level if the data quality is not so good. So we do not trust others. We discount the data sources. We trust this, but more importantly, we do not trust ourselves as well. We may discount our automated inferences that we are not sure about. So now in-character missing versions. As a very interesting point that was pointed when talking about the open-source vulnerability databases, you know, sometimes it's reported that a package above 1.0 is vulnerable, but it was according to that condition only that all the packages are 1.0 are vulnerable, but after that also some packages will be published. So it can't be inferred that those packages are also vulnerable. So solution is that please store origin range, resolve range and time travel. Yeah, time travel. We, so let me explain all of these. We store origin ranges as a complex thing. We use intervals for storing our origin ranges and we dissolve those ranges using our library and improvers can do time travel. We can check if a package origin was vulnerable in past when published. So next issue is duplicated data. Because we are aggregating with multiple data sources and because they exist in the world, a lot of vulnerabilities could be duplicates and that leads to a lot of noise, which is the worst enemy that we can have in the vulnerability database. Because as noise increases, you lose trust in the database. You ignore the data that the database is providing and you just carry on with something else. And what we can do is introduce aliases. With every vulnerability, there is a set of things, a set of keywords that are common amongst all of its copies. For example, that could be CVE ID. We are also proposing a one core ID that is vulnerable code ID. But it could be ecosystem specific. Different ecosystem would have different type of IDs but they have a set of aliases that we can refer to in order to combine the data. Now after combining all of the data, we create a worldwide ID to alias relationship. So that vulnerability has a lot of aliases but that has one one core ID, which helps into finding relationship between one alias to another alias as well. So if you are saying CVE 1234, I could reference it through the worldwide and find out a difference, maybe bisect adversary or maybe a different adversary associated with the same vulnerability. And it won't be duplicated. We'll get all the data at one place without any repetitions. In the end, we reconcile everything with the help of improvers discussed in earlier steps where we improve upon our collected data and merge all the duplicated data into one single relationship. These are not the end of the issues that we face. There are other issues. All the data sources that we encounter are somewhat unstructured, messy and sometimes even incomplete. Well, there has been a lot of effort towards it and a lot of organizations have started publishing very structured and very clean data sources for vulnerability adversaries. But it is not the case worldwide. There have been vulnerability adversaries that are published just in human readable text, which is in English, followed English grammar and it's completely alien to computers. And there are a lot of problems over here, but the solution is to integrate all of the data sources and cross-reference there. One data source could help identify how badly formed or unstructured other data source coming with the same alias. In the duplication example, we talked about aliases and that could even help us to structure the unstructured upstream data. And that could be a very clean vulnerability database. Another issue would be not every data is irrelevant. Old vulnerabilities on Windows 95 or Windows ME does not help us in any way in the parent era. That could be interesting for a few, but in general that lacks a huge interest and we are not very much interested in commercial software anyways, because vulnerable food is very much focused on open source data. And we want to make it very much transparent how we are working without a secret source and want to incorporate as many data sources as possible. We want to keep licensing in mind and not want to jump into the realm of commercial software, but that stops us from being the universal, vulnerability database. But for this, we can let go of some of them. Let go of the past data and move forward with the future. We would have a little bit of a problem here. We would have a little bit of holes in our database, but I guess we can do with those ones because in future we need to pave the way for the new software, not the old ones. Hello, this is Shilip O'Madan here. I'm just jumping in for one last topic, which is our future plans. I'm the lead manager on Vomible Code. And so the first thing we're looking at is all the time adding more primary data source as much as possible and going upstream because that's where we find the better information which have not been reworded and transformed. And remember we talked about the telephone game earlier on. The second thing also is adding actual commits that are either fixing or introducing vulnerabilities. These are really useful because they help track if the code you use actually has these vulnerabilities or if a given package, which may be a derived package say there's a vulnerability in Zlib, it's present in a Linux distro or vandered in a Node package. How can you know exactly what's the version? If you know the exact commit, then you can access the exact bit of code that is effectively introducing or fixing the vulnerabilities and check whether you have the code or don't have the code anymore. In the same line of thought, we are planning to add Yara rules. So Yara is a tool to match patterns in source code and binaries. And once we have commits, we can effectively build rules that could detect the presence of this commit in code. And this would enable much finer grained detection of actual vulnerable code. Beyond this, and that's really important, it's sad but we have to have a human expert review system and we're going to build a UI for that. It's not entirely sad, it's expected. I was expecting when we started this project that there would be higher data quality that we could depend on in practice. As we said earlier on, we've observed a lot of problem and issues in the data quality and there's really no way but to have a human review these vulnerabilities and how they apply in which package they apply to. Last but not least, humans but also as much as possible, machine learning and AI. One specific application that we've made some experiments with and looks promising is to actually spot inconsistencies and discrepancies and there's another application which is able to do some level of natural language understanding and parsing in security advisories. So you could translate something that says, oh, this vulnerability is applying to this package from version two to version five. Being able to transform that in an actual version range could be one of the applications there. And you know, machine learning is great but Google heuristics go a very long way in many cases. An example, Ritik spent quite a bit of time discussing with the engineers, maintainers upstream to better understand how they create and form their security advisories. So again, these are the authors and maintainers, the folks that write the code of engineering. So they know everything upstream because they are the upstream of NGINX. The rules and the approach they take to version NGINX and version vulnerability and vulnerable ranges is a bit peculiar. There's a lot of things which cannot be guessed. They're not something you can just infer by looking at the things. There's a bunch of heuristics that we can gather this way the good old way and also observing the data, finding patterns and being able to fix these is very efficient. Last but not least, we've started a project that we call VonTotal, like vulnerability total. And the goal here is eventually work with other providers of vulnerability data as a willing fully or if we cannot get their agreement and the data is public, we will try to work out something that can be helped such that we can compare all the vulnerabilities. And think about a tool called VirusTotal, which is originally a tool that was running scans, virus scans on one file against many different various scanners that have been acquired by Google. But think about VirusTotal for vulnerability. I think that can be a very powerful thing. Again, help us in two ways, help everyone because you can then find out if you are really vulnerable for given vulnerabilities, given the version and input and packet input, that's one thing. The second thing it will highlight which database is the best. We hope we'll come out first, but we don't know. And in the end, it doesn't matter because one thing that's really important here that we may not have talked much about is that the reason why we're doing all this as free and open source code and free and open data is that we think that security is like oxygen. It has to be open and free. You cannot put a tax on oxygen. And that's why we want to make that the very best. And we may not have the very best tool and data, but at least we are contributing to making this a better place and more secure place. That's it. Thank you very much by now. Okay. And thank you for letting everyone know about the future and regarding the future. If you guys are in peace, I want to put that register for our upcoming webinar on July 21st and the following thing. And read our blog posts. Tommy will help you. You can get more people from here. So if you guys want to help us, so you can just be very kind or cash. You can use a lot of tools and let us know how we can make them better. You can join the conversation at Gitter. We are very much active on Gitter. We chat with Gitter. You will be getting responses from us very quickly. And if you want to help us in cash, you can donate at the following link here. So this is the end of our presentation. And we might move on behalf of Billy and me for joining us here. Yeah, signing off.