 So I think we're gonna begin in a few minutes. We'll give it like two or three minutes or so. Just make sure that you join the urn.io discard channel. The links for those are already on the NotSec Workshop channel and their PIN. So if you go to the PIN messages on NotSec Workshop channel, you're gonna see the links to the urn.io server and especially the query corner channel where all the questions and answers will be handled. And at the same time, we have the workshops repo. So go to the repo, you're gonna see these slides that you're seeing right now and proceed to slide two and three. Which explain you how to set up your system. And at the same time, some resources that you would need and your machine requirements. So those are some things that I just wanted to say beforehand. We'll give it a couple of more minutes and then we are gonna begin. Yeah, I just posted the Discord link on the Zoom chat. So you can just follow that link and they'll take you to the particular Discord channel that we're using. How, Sutragra, did you post a link to the CPG, to the GeneratorVLC CPG? I didn't post a link. Actually, I can post it right now. People may be needing that as well. Good idea. Just give me a second. So I'm gonna post this. There's a file. It's called as vlc.bin. It's like a big file. And if your machines are not so powerful I will recommend that you join them. And you basically download this file and keep it ready because when we are gonna use your, we'll use this stuff. So I think two more minutes. And I'm gonna repeat the instructions that I gave because I see a lot of people are still joining in. And so the instructions are as follows. Just go to the NSEC Workshop channel. There is a pinned message which explains how to join the Yorn IO Discord server. All the questions and answers are gonna be handled there. We have a GitHub repo attached to that and please clone that repo. The NSEC Workshop is there. All the contents in the NSEC workshops that you would be requiring are also there. These links have also been pasted by Vicky in the Zoom chat as well. So the way to join the server. And if you have any questions, please, I don't know how many people are there or if it's okay for everyone else. We can probably just unmute yourself, be mindful of everyone else and ask questions. Or you can ask questions directly on the chat itself. And the channel that we'll be using in Yorn is the Query Corner channel. It's okay if you just wanna use the general channel, that will also be fine. I mean, just ask your questions, but try to stick to Yorn kind of queries in the Query Corner channel. I see that many people have already joined there. It's gonna be fun. I hope you enjoy this. And in the slides, when you download the slides by cloning this workshop repo here, just look at these slides, number two and three. They explain you how to set up your system, install Yorn, and then download the VLC source code and just keep it ready. That's basically what we would need. And also please make sure that you have at least five to seven GB free RAM. So close as many browser tabs, and keep an eye out for a good amount of RAM. Because when you will be dealing with graphs, we are gonna dig a little bit more deep. So when you're gonna deal with graphs, you would be requiring a lot of RAM. Okay, so Chakra, wanna get started? Yes, let's get started. I'll also start my video if someone wants to see me. Okay. So let's begin from the first slide. I am Suja Chakra. I'm a staff scientist at ShiftLift and we have with me Vicky Lee, hey Vicky. Hello. Yes, we can hear you. Hi, okay. Hi everyone, I'm Vicky. Okay, so what we're gonna do here is we are gonna build a static code analyzer. We are gonna use a tool called as Yorn as a foundational tool to build that static code analyzer. And before we go all the way towards the end, we will begin with very basics. And the first part of the workshop will be a lot of theory. And some of you may enjoy it. Some of you may just want to leisurely sit and watch it. And when we are done with the theory, it sets up a base for us to understand what we are actually doing. And I hope you enjoy it. So again, just a reminder, this is the third time. Prep the workshop using slides two and three to download these slides themselves. They are in this repo called as workshops. And you can go to the 2021 NSEC and download these whole slides. You would be requiring these slides to run the workshop in parallel. So I just recommend you that you please download these slides, keep them handy on the side. All the commands that we are gonna give, they can be copy pasted directly from these slides. You can run them and they will work. Another thing that we would require, of course, is this tool called as Jorn. And this tool can be obtained like this. You just simply get the install script and run the install script. You can use sudo or you may just use, you know, direct Jorn install without sudo and it will install it in the local directory and you will work from there. So this install installs it system-wide becomes very easy to manage. And then we are gonna use vlc3.0.12. To go through our workshop, this is the base source code we are gonna use and we'll find interesting things inside this source code, I believe. Recap of the machine requirements, please make sure you have five to seven GB free RAM, at least four CPUs recommended. If they're not four, it's fine. It's just gonna take more time when you work with the graph. Use OpenJDK 1.8. I believe these requirements were already sent to you yesterday, but still, if you have time, just keep on doing this while this whole theory session goes on for 15 minutes or so. Some important links, there are your docs here, queries are there, and the community discord channel. Again, I see many of you have already joined. Most of the discussions are gonna happen in the community channel. They're regarding this workshop and in the query corner channel. So this community group is here. Okay, moving on. So I'm Suchakram, I'm a sub-scientist at ShiftLeft. I did my PhD from Polytechnic Montreal and this is something about me. You can catch me on Twitter at TechSoloji and this is my email ID. And... Hi everyone, my name is Vicky and I'm the developer evangelist here at ShiftLeft and you can follow me on Twitter at VickyLeast7 and I was a developer, web developer by trade and now I'm a penetration tester and longhunter. And it's good to be here. Yeah, it's good to be here and we have some relationship with NSEC for quite some time. I mean, I used to live in Montreal so it has been a conference very close to my heart. So I'm really excited to do this. Okay, a little bit of pep talk before we begin, just to set up some stage. So why you are probably here, you may have the following questions. Either you just want to go to the depth of code and understand what computer programs are, how programming languages work or you may know some bad coding practices and you may know some bad coding practices and you would want to mass detect them in large code bases. So you know that I have a function that should not be called under this context and then instead of manually going through all of that, you would want to just mass detect it in very big code bases that you have and you're trying to find something like that. It's a specific ask, but it's a very repetitive asks. And then how to do static analysis tools which are there in the market, how do they work? So just to understand them and can you create your own custom tools because they are not fulfilling what you want to do. And the kind of people who would probably also be here are those who have worked somehow with interactive debuggers. You have used GDB, RR, or used some SaaS tools or sometimes you just hunt bugs by going through GitHub or browsing your code inside the IDE. So those are the questions you may be having and we'll try to answer them as we go ahead in the workshop. So what you're going to learn today. So what you've become by the end of the day is Hacker Man. And just aside, you're going to gain the ability to find vulnerabilities in big code bases. That's why we have chosen VLC. We'll do some interactive code analysis and interactive exploration of code. And we convert your manual coding steps or the vulnerability type, vulnerability hunting ideas that you have in your mind to automated steps and we'll do those automated analysis. We'll stop the reliance on vendor SaaS tools. And if you imagine that I'll press a button and I'll get thousands of results and I have to go through them and these are the truths, that's not true. We'll try to make sure that you don't rely on them and you're proficient enough to build your own tools. Understand how your code is structured, what are the external libraries which are there, what are the internal libraries, et cetera. And probably you'll also gain some proficiency in Scala because our tool, the yarn tool is built on Scala. And so something more about interactive code analysis, Fabi and my colleague used to give this line to me. He used to say, each program is its own universe and hacking is about exploring, documenting and exploiting its rules. So it's not just about pressing a button, getting some results and we can proceed on. It's a lot about understanding the program. It's like a big universe and there are connections in that universe of code. And it's about exploring, documenting and exploiting the rules that have been set inside that code base. And many tools do that. And when you're writing code, you use debuggers, you use address sanitizer, Valgrind, et cetera. And we are used to them and they give out some results to you. So what yarn tries to do is it tries to flip this approach where you have the code base in front of you and you ask it questions. You ask very normal questions. Can you tell me where this method has been used and is there an if statement before this? And all of this, is it in this file which has been called by this method, et cetera? So these complex network of questions that you may have, you may be able to solve them with a tool like you. And so it's like plain pause debugging that you are used to with dynamic analysis like working with GDP, you run something pause and that kind of debugging, but now think of it as moving to static analysis. So now it's inside static analysis. So what we require for this is some fundamentals of programming languages. And that's what we're gonna cover right now. Just in 10 minutes, it's not gonna be something super in depth. You can just chill and relax and just see what's happening. Small fundamentals. So what is even code? We may have statements like this in Y equal to X plus 50. And what happens is that a computer needs to understand these statements. So what it does in the first stage is tokenize them. That stage is called as lexical analysis. So you tokenize each and every component and make it a little bit more descriptive. Then what you could do is you do some syntactic and semantic analysis. You as in the compiler will help you do this where it's gonna take all these tokens that it has generated and start understanding them. So an equal token here becomes an equal operator. And then it starts laying them out in this tree-like structure, which is called as abstract syntax tree. And now it becomes easy for a machine because what a machine would do is it would say, okay, it will start traversing this graph like this. And it would say, okay, Y equal to, there is a plus operator, so X plus 50. So if you want to understand what this is doing, you could reduce this tree and suddenly you have meaning attached to these simple tokens. So it's a tough ask. We are writing in some sort of broken English-like language and then we want to convert it into something which a chip can understand. So this is the intermediary thing which happens inside when your code is there and needs to be executed. At the same time now, what you may have is not just a simple statement, but it may be part of a function like function X. So we will enhance this tree a little bit more further. We call it enhanced ASD. So now we can attach a few more nodes to this graph and we can say, okay, this is a function and there's a declaration which is this equal operator and this is how you would basically understand this. So it encapsulates this part inside a function. So it's kind of like a way to do that. At the same time, we also identify where data is. So these blue things like, okay, there's data in X, there's data stored in Y, data stored in X again. So we have some information. We attach this information to the graph and suddenly we have an enhanced ASD. Next, what may happen is that this function may not just be very simple. It won't be just the statement encapsulated inside a function funk, okay? It may not be like that. There might be conditions inside it. If this, then make call, else do something else. So in that case, this graph starts becoming a little bit more complex because now there are conditions inside it. So we need to understand control flow as well. So we can then denote a control flow graph saying that there's a function, declaration, then there's a conditional statements. If it becomes true, you go here. If it becomes false, you go here. So you again try to make another graph which is called as a control flow graph to represent the control semantics inside your code. So the control is represented here. ASD represents a lot of information combined including part of the control and part of the data, but it's basically a tree representation. And then another thing that you have along with this is program dependence graph or data flow which defines how the data is gonna flow which means like the value of Z here in this statement, there's a value of Z. So the value of Z here is dependent on the value of Y which is dependent on the value of X. So we have these three constructs and keep them in mind. We have abstract syntax tree. We have control flow graph and we have program dependence graph. And eventually what in a normal case what compilers would do is they will translate this to some other intermediary languages, optimization, register allocation and then machine code which directly executes on the machine. But what we do usually and what we are gonna care about is this representation. So we are gonna take pieces of code, convert them into a graphical representation which we call as code property graph and which is a mix of all these three representations and we just stop there. So we now have a database containing information about the code with us. So it's like half a compiler. Think about this statement like this. I keep on making this statement again. It's like half a compiler, not a complete compiler. And we use the information from that half a compiler to basically make more deeper analysis on code basis rather than working here. If you were working here, you have strings. And if you wanna make sense of what's happening inside the string or this big chunk of string, you'll be just gripping through them and trying to make sense of it. But now we have graphs. So we have relationships between each entities. Okay, moving on. This is too much low level. In real cases, you'll have code which becomes a little bit more high level and there are things like classes, member variables, annotations. When languages become more complex, the constructs also become higher. So obviously these statements that you were seeing here are always there. But at the same time, you have these high level constructs also now. So we need this also be there to be represented on the graph that we had generated. So those graphs are for low level individual statements. Again, start becoming big pieces of graphs. And then we take all this information about methods, whether this method definition is there, this is a return, this is a package name. And we take all of that and we put it above that graph as well. So we're kind of constantly making a graph and then enhancing it with information about the code. And we keep on doing this. So, and this helps us give information about what's happening at a higher level in the code as well. We saw lower level stuff. And so in higher level, you could say that there's a pi class which is defining a method foo which calls a method bar. And the pi class is inherited from a recipe class. And this method, the M bar method is defined in some other class which is inherited from some other class. So there are a lot of complex relationships which are there inside the code in terms of hierarchy of classes, in terms of hierarchy of types, in terms of how methods are called between each classes, between different elements, even like files. So we have all these relationships there. And then there are relationships about how data is flowing. So for example, the X data is passed to this specific function call which is actually inside this class and which is defined in a logger class. So we have a lot of information inside the code like this. And when we want to make sense of what's happening in your code, you would want to actually ask these kind of deeper questions. And this is a key component in security analysis. You would want to understand the whole code base in a very holistic manner so that you can ask intelligent questions to the code. So think of it like this, like Google Maps or any mapping software is there. It has a map of your whole city. At one point sitting here, you do not know what's inside that city, but the mapping system gives you an opportunity to ask a question that find me the most optimized way which goes from here to there through these specific routes. So you can ask these questions. Find me a way to go to the grocery store while stopping at this place and see how many red lights are there. So you can ask these questions and what we want as security analysts and especially static analysts, what we want us to ask these questions to this big corpus of code. And we want to move away from gripping and just doing simple tree analysis and stuff like that. We want the whole holistic approach where the whole code base is represented. So as developers, we always think in graphs while we are coding and we should also think in graphs while we are debugging. So we want the graph to be the base component where the whole code is represented in the graph and you can ask questions to it. Which brings us to this tool called as Yoern. It's pronounced as Yoern and not Joern or Joer or whatever. But we don't care, you know, as Shakespeare said, what's in the name. And so let's move on. What Yoern does is it's a framework for understanding code so that you can gain insights about your code and build tools for debugging and security. So it's a framework and we'll use this framework to build our tools. And what it allows you to do is is to take in pieces of code, do some parsing on it and convert them into this big graphical representation called as the code property graph. You should search it on YouTube, you should search it on Google what this code property graph is. It's something which my colleague has invented and this is a representation of all the three graphs tied together so that your queries become very efficient and that's what we have. So Yoern allows you to convert your code into the code property graph and at the same time gives you a shell, a very interactive shell looks like GDP or any other kind of shell. It's such that you can ask a query to the code and you get insights, you get bugs and zero days or whatnot. And this actually works because we have found multiple bugs and zero days. And I don't know maybe after today in the evening, you're also gonna find some of them. So again, so it's a whole framework allows you to ask queries on an interactive shell. You iterate quickly on those queries, gain insights about what they are. Then you convert those queries into a recipe and you can run across large code bases so you automate what you have done. So it's not just lost. And then you take it and integrate into your tools, your own tools or you put it in your pipeline, CI, CD or whatever you want to do it with that query. Think about like bash scripts. You have bash commands, you could just run them once and they work or you create them into scripts and then you package those scripts and you could create them into a big file. I don't know, like you may be able to build complex tooling around it and with bash but with your own, you can. You have nice APIs, you have a whole framework and you can work through that. Okay, so before we go ahead, we'll take a three minute break, maybe two minutes. So if you have any questions right now, please ask them here in the query corner or in the chat here and Vicky and I would be able to help you. Yeah, actually, Sutra, can you go back to two slides before? Yes, sure. So I was wondering if you can explain something for us. So the code property graph is basically a combination of the three different types of graph representation of code, right? Yes. Could you explain the decision process behind the merging of the graphs? Okay, so yeah, that's super interesting. So thanks for bringing this up and some people may have this question as well. So what happens is with a graph like AST, I'll go back to this place. With a graph like AST, just having an AST, you may be able to ask, can you tell us if what's the right-hand side of this plus operator? So can you tell what is X being added to? Okay, a question like that you may ask and you may be able to solve it by going through the graph, it tells you that. You may even be able to understand that you are part of a function, okay? But having an understanding where you can say, tell me all the locations and code where malloc is used and where the parameter of the malloc has an arithmetic operator in it and is it part of an if statement and is that if statement having this condition inside it and at the same time, are they all called from these methods and is that data on, which is at the parameter of that method, attacker control or control by some external entity or some external libraries calling it. So these kinds of navigation you have seen already the question that I just asked you requires understanding not just of browsing the AST but also understanding the control flow and also at the same time understanding the data flow. So a joint graph representation allows you to ask these questions, ask these complex questions in just one go. You don't have to rely on let's do a separate data flow analysis, let's do a separate control flow analysis. You could do all three together. So the questions become more rich. They become more human questions, I would say. And rather than writing something to navigate the graph at scratch, you could convert your human kind of question very easily to this. Right, because I think that code analysis based on primitive graph representations of code like AST already exists. So I was just wondering how this process is different. Yes, there are many connections in the graph. So as I was explaining, the statement like a call to string copy, call to malloc, we would know who is calling that thing. So that's a control flow representation. But at the same time, we also know that what are the data inside that malloc? So what was the argument of it and where it is coming from? So these linkages are all there in this single graph and it allows you to explore those linkages rather than just going through the AST and trying to do reg X's over the AST. So instead of that, we have like a complete representation. Right. Okay, so I think let's move to the first module. We are already quite progressing quite well. So let's do a quick start. I'll bring this thing up here and we have our urine session here. So the first thing that we will do is we'll get this simple piece of code. This is just to kind of like get into the mood. So we'll just get a small piece of code. It is called as TSTDPD. It's a small HTTP daemon. It might contain a lot of bugs. It's written in C at least that's what I know. I have used it before for demoing some performance analysis things, but this is the first time I'm doing some code analysis with this. So I have this code with me. It's in TSTDPD like this. So what we'll do is we'll start urine and if the urine installation is done, if anybody has a problem in installation or you want to take a pause or something, just please ask questions in query corner or on the Zoom chat and we are here to help. So you start urine by just typing urine once the installation is done and it's gonna start at the same time I want to keep an eye on how much memory I have. So I think I'm good for now. So urine begins like this and it gives you a nice shell. So the first thing that we'll do on the shell is just import this piece of code that we have. And to do that, we just do import code and we give it the path. So you have to give absolute path for this. Since I've started urine from inside this, you can also give related path from the same directory. So I do something like this and it runs. It goes to the whole code base, it understands that it's a C code base and starts generating the graph for it. It will go through all the nodes and edges inside the whole graph and it will generate a CPG for you. You will find some warnings during the analysis because we are fuzzy parsing. So we might miss some types of member variables or some methods or something like that because large code bases are there, C is not written in a standard format all the time. So we might miss a few of them, but more or less all the information that is required for doing a proper analysis is encoded. So once you're done, you will get something like this that a graph is there, you can now begin using this object called a CPG. So CPG is the object that we have now inside a shell. It's populated with this graph and we can control the graph using this. So first thing that we'll do is cpg.method.name.l So I'll just copy paste this. You can also do the same thing. And what we are doing here, you can do control L and it clears the screen. Many of the commands are same. There's tab completion. So if you do tab once, it will try to complete it, but you might see a lot of things don't get inundated by that. It's fine, you know, just do one stab completion. So what we are doing here is we are using the CPG, which is denoting the graph. And inside the CPG, we ask the graph to give us all the methods. So we get all the methods and then we tell the graph, give us all the methods which have handle in their name. So we have a RegEx pattern here. So all the methods which have handle in their name and then give me a list of all those names that you have found out. So, you know, it's trying to kind of think of it as like traversing the graph, reducing at each, you know, dot here. So think of it as like that. If you are well versed with Scala, you might understand what's exactly happening here and what we are getting it. But if you're not, don't worry about it. Just think of it as each of these steps are returning you something and then you pass it on to the next. So the graph gives you the methods, all the methods, and then you try to filter it by this function called as name. And then you again pass the values of the methods which have been filtered, get names of all of them and list them dot L for just listing. And so the moment you do this, it runs through the graph, finds out all the methods which have handled in their name and just list them. Now you can do a little bit more, you know, complex stuff where you try to look for the methods, string copy inside your, inside a graph, inside your graph and then try to find what are the callers and then get the names of all those callers. So give me all the string copy methods inside the graph and then find out the callers of all of them and then give me all the names of them. So you do that. And so these are all the methods which are calling string copy. Make sense so far? No response yet. Again, just reminding you can unmute yourself, say yes or no. It does. Okay, thanks. Yeah, so let's move on. So this is something which is super simple. So now what we will do is we'll go a little bit more deep and, you know, import a large graph. So this is gonna take time because VLC is a large codebase. So if you start importing, it's gonna take a lot of time. At this time, I'll remind you, you can either use this or you can actually, sorry, help. So, or you can actually import the graph itself. So I have the graph already generated for you. You could import that. I'm gonna put this link for the graph in the Zoom chat. It's already in the Zoom chat. Okay, it's already in the Zoom chat. Again, appreciate it. Yeah, it's the Google Drive link. It's like a 320 MB graph. I had asked you to download this. So if you want, you can begin the generation of the graph. If you do not want to generate the graph, you can use import code and import CPG instead of import code. So what I will do is let me check my workspace. Okay, I think I have the VLC graph loaded. So by doing workspace, it tells you what you have been using in the workspace. So we just currently did import code of this. So it shows that you have a base graph ready for this and for this you have, and this is open right now. Okay, so since I already had this graph, what I'll do is I'll do open and I can specify the name of the graph that I already had on my system. So on your system, what would be happening is you would be doing something like this. You will do import, okay, inside your not here. So you'll be doing import CPG and then paths to VLC.bin. Okay, so the moment you do, you are gonna reach almost this space at this level where I am, which means the graph has been loaded. So you're not generating it right now, but you have already loaded it. If you want to generate, you again do the same thing, import code and then you do this generation of the graph. So as you are seeing in the slides, so you do the same thing. And once it's done, if you are doing this part, which means importing the code directly and not opening the binary as I have explained to you, it will take time. It will take like 10 minutes, at least it takes 10 minutes on my machine to generate the whole graph. So we'll pause for like a few minutes here. And if you have any questions or you're still generating the graph, just let us know. And if you have other questions, let us know as well, we will try to help and solve them. Right, it's very important to get this step ready before we move on in the workshop because every command that we will run later depends on this step. Yeah, so we'll wait for everyone to at least have the graph ready with them. So load it at this stage where it says that, okay, you have the graph loaded, it's ready now. Let us know if you have any questions or you run into any errors, we can troubleshoot it together. Yeah. And there are people on the query corner discard, you know, developers of your own as well. So all of those folks are there. So you can ask questions there as well and they will be willing to help you. Well, it's compelling. If you don't mind, I have a question regarding Jordan and the CPG as a whole. I know that you support multiple languages and they're all compiled from what I understand to the same CPG. So is the CPG pretty much language agnostic and how does it compile everything, all the different structures to one sort of normalized format? Okay, thanks a lot. Very interesting question. So yes, CPG is language independent. We have nodes inside the graph. I will actually try to bring some structure of the CPG. So everything is open source. I'll give you links to the specification of code property graph as well. Just give me a second. I'm trying to bring up that page. Code property graph. Okay. Okay, so yeah. So the CPG is something like this and whole specification as well as an implementation of the graph is already open sourced. And you can go through this. As you can see, so for first time users, we recommend building Jordan and understanding Jordan. But there are all the structures which is like the methods, the methods, the call sites, the methods, the call sites, the data types, which is type declarations. And all of these nodes are already defined and they have information inside them. So there's one example here. So type declaration is like foo. So there's a foo type and it is linked through ASDU. It says that, okay, there's a member variable. So it's of the member type and it contains some information about the member. And then it also contains this type declaration, also contains a method and the method has a name and it has a full name and location, line number, et cetera. So the graph contains all these nodes connected through edges. Makes sense. Just like that we have here a call. So there is a call which defines a call site. So it says that this method is being called at this specific call site. There's a name for this method. And this is then associated with the full name to the method node as well. And this method node is the definition of the method, where the method is. It will say, okay, this method is inside this. It has this specific text, et cetera, et cetera. And it is basically being called at this call site. So this way we link all the calls to the methods to other identifiers and to all the other nodes which are there as we will go through. So for different languages, the question may also be like this, okay? For example, in Java we can say we have classes. How do you map it to that? So at a very base level, they are type declaration. So we map them to type declaration. So to get classes in Java, you would do cpg.typedecl to get type declarations. And for C, we will have TYP, the type structure. So you would do cpg.type. We will have types like int and all those specific types that you have designated. Structures can now also be mapped to something like that. So we have all of these nodes and we have all of these edges which have been defined and they all map to different programming languages. So in JS, for example, if you have anonymous methods, which is a new kind of constructs, not in other languages, they will be called as methods. Their name will be just like an arrow function. And then, or their name will just say anonymous. But when you get the full name, it's gonna say that it's the first anonymous function in this file. So this way we try to map for all different languages in the same structure. Okay, so Mansi was asking, citing to got the graph ready, what languages are supported. So for Yorne, okay, some people were trying to get out. Okay, fine, yeah, sorry. So for Yorne, we have C, which is supported by default. It is at a decent advanced stage. I would say it's fuzzy C parsing, which is being done for C. We also have Java support now with Yorne, and both of them are supporting Dataflow, which means like Dataflow is a very difficult thing to do. Usually when we try to do graph analysis, and we have decent Dataflow analysis for both of these languages right now. We have a commercial variant of this, which supports other languages as well. We won't discuss it today. So for Yorne, we have most of these things ready with mean C and Java. So C, C plus plus Java. There is very initial binary support as well. I don't wanna spoil anything right now. It's gonna come in soon. It's via Ghidra. And then we also are coming up with PHP support. So it's all a community effort. It's all open source. So people are generating this for different languages now, using this as a graph structure. So, okay, yeah. Maybe not 10 for import code, yeah, okay. So assuming that everyone has the graph ready at least, can I assume that and move on? Vicky, what do you think? Yeah, Sushriak, are you wanna answer the question in the Zoom chat first? Okay. There's a question, how well does it deal with calls that are polymorphic and calls in dynamic languages? Okay, so yeah, that's a good question. So static analysis can just do so much. It won't be able to, for example, if you're generating dynamic calls, it may just not tell you whether this dynamic function is there or not. For polymorphism, yes. You would see method nodes with similar names. They will have some different names and you would be able to still get those separate nodes. So I think it would work. I think Niko is also there on the call. So if you have some expert answers, Niko, if you wanna give them, you can give them. I don't know if he's here or not. He's in Berlin, I don't wanna disturb him. I don't think he's in the Zoom. Oh, you never know Niko. He might be there under some pseudonym. Uh-huh, yeah, that makes sense. So has everyone gotten their graph yet? Loaded and yarn? I'm actually having just a little bit of trouble. Could someone, the exact incantation, do I have to be in VLC to open it or can I be in its parent directory? Oh, you can be in its parent directory. Okay, all right. I don't know if I've just installed your yarn incorrectly. I'm just having weird import problems. I don't know. Okay, can you send a screenshot or something to the Discord and probably we'll try to go through that? Yeah, we'll do. The Discord, I'm not seeing a query-con channel. Where is that? It's called Query Corner. Query Corner or anything like that. I'm in the Discord right now. I see conferences, workshops, training. Okay, okay, so this is a separate server. The link to that server is, I think Viki has the link, yeah. We just have a separate yarn channel. There we go, thank you very much. Sure. All right, I will sort my stuff out as I go. Please don't wait for me, but I read some of the slides last night, so as perhaps I don't feel the need to wait. Okay, thanks a lot. So I believe we should proceed. And even if you are able to do half of this or it takes some time, don't worry about this. Do it at your own pace. All the instructions are there. They are tested, so this will definitely work. And if there are questions, the channel is always open. You can come in the Yarn Discord and ask questions and we'll be able to answer them. Okay, so moving on, we have the VLC CPG ready. So there are some tools in the CPG by default. So in due course of time, when you become advanced enough after a few months or something, you may be able to write these tools also. These are basically what we call as overlays. So if you run these tools, they enhance the graph a little bit. So what we'll do here is we'll run this one basic enhancement over the graph. It's called as OSS data flow. And we will run this. So run.OSS data flow. And what it's gonna do is it's gonna do basic data flow analysis over the graph. Okay, so my graph already had that. For you, it may take some time when you try to run this. So it will do some enhancement over the graph where it tries to connect various variables and relationships between the variables in terms of data flow. So where is the data coming from for this variable X and then going all the way up and going all the way down. And also at the same time, identifying interesting sources, interesting sinks, and then tagging them and saying that, okay, this might be an interesting source. We should look at it. This might be another interesting source. This might be a sink and we should tag it. So add some small tags to them. And at the same time, doing a proper data flow analysis over the graph. So before this stage, it's just a basic graph. There is AST in it. There is very less data flow tagging done, but we have this open source data flow which links all of the variables and data flowing inside those variables properly. So we run this and then please do save. So once you do save, it saves the graphs on the disk so that you don't have to spend another 10, 20 minutes doing this. So save is important. Please do the save. Once all of this is done, your graph which is inside the memory and it might be a lot. So it's already taking like three, four gigs I see. And it's in memory already. So you won't have to do it again. So if you do the save, so just make sure that you save the graph. Okay, moving on, we will do some very basic queries. So these kind of queries, I think you have already done. I'll probably increase the size so it becomes a bit better. So our graph is ready. We have the CPG ready with us. Everything is fine. We need to reach the stage. And once that's there, we start answering, asking questions to the graph. And then we say, okay, give us all the methods which have parse in their name, go through them and just list them out. And it ran and it listed out all these methods which have parse in their name. Okay, you can tune how your stuff is, but it's nothing super fancy. You can do all of this stuff in an IDE. So all of this is just basic features that you will have with your... Okay, so I'm just trying to comment. So all the basic features you'll have in your... And we also have this thing called as dump. So if you are in a method node, which means like CPG.method is gonna return your method node. And then you filter those nodes by parse.sig. So what you're gonna get after this dot is basically methods. So we take these method nodes and I'll show you how these nodes look. If you do something like this, so let me look at one parse sig or something. And if I do .l, it's gonna list all these methods nodes. So this is what I was talking about, how individual nodes look. And they have information about them. They have the code, like what's inside the code, the file name, in which file they are in, where they are defined, when does the definition end, what's the exact name of this, et cetera. So we have the whole node here. So what we do here is the node is there with us, but now we just dump it. So when you do a dump, it basically dumps the exact code attached to each of them. So this is one, this is another one. So it's like a list of all the dumps. Basically, that's what it is. Nothing fancy, you could do this with ID itself. But now we try to do something a little bit more complex. And what we are gonna do here is look at all those methods which have parse sig in the name. We got that list already. Like they were 60, 70 or so. And then we can use this thing in ScalaCola as map. Okay, so what map allows you to do is take those methods which are being returned after this dot. So all of those interesting methods that we have. And then for each of them, it's gonna create a list of a tuple with containing file name and the code of that method. So we just do this and we'll see now that list has enhanced with one tuple. So the tuple contains the name of the file, but also the text of the method itself. And the text is like the string that you're seeing here. So this way you can quickly do some things. You could also, if you find something interesting, for example, you wanna dump all these methods to a file, then you use this weird piping operator that we have, it's like a pipe and greater than. So it pipes it to this file. And then suddenly you'll have this file which contains all these methods. So what we have found is when you're trying to work with your own, you find something interesting, you have a nice query, you could just pipe that text, the raw text to a file and then save it for later. You could basically pipe any string. So anything that is returning a string like this thing returns a string, you could pipe that to a file. Quick comment. I'm using the pre-compiled binary and when we're running this command, I'm getting an error reading from home such a graph projects. So like it's using your home directory. So I just wanna let everyone know that the path might be wrong if you're using the pre-compiled one. Okay, okay, yeah, that's possible. So yeah, that's because it got, I generated that binary and the information about where to take this also probably gets encoded in the graph. Okay, okay, at this point, I think I would recommend- Does it mean that the line numbers are also gonna be messed up when you use somebody else's CPG? Yeah, I think so, I think so. But I think there is a way in which if we can also provide the meta information. So what happens is that in the graph, let me go to the place where the graph is. Workspace, I think this is the workspace. We have like VLC here. So this is how the graph is stored inside the workspace and you have other information about project JSON. So I think it reads from that also. And if you would have- You can modify that. If you modify that, it becomes a bit easy, yeah. So ideally I should package this whole thing and give it to you, but I just took this VLC.bin and sent it, okay? So from next time you'll remember, but I didn't want to package this because it also has this temporary binary of the graph. So when you're moving the graphs, remember from now on, when you're moving your graphs, just move this whole directory. So copy this whole directory target and put it in someone else's workspace, adjust this and I think it would work. Okay, yes. So thanks for that information. Thanks for bringing that up. So we were somewhere here, yeah. And I explained to you how you could dump these strings into a file and then you could use them. So same thing, nothing interesting. There is also a pager inside the yarn shell. So if you have some commands and it's outputting like long bit of text, you could just do browse, just like you would use it. So it opens this up in a pager and you can browse through them and then you can do queue and it quits back. So it's using actually the systems pager. So it's straightforward, nothing fancy there. You could do all of this stuff in an ID. So we have not really reached Nirvana at this point. So still now we'll start going a little bit more deep and exploring links. So what we'll do is we'll look at one specific method and we'll try to find out all the local variables inside those method and then list their names. So this is exactly what was happening somewhere here where you have a method and that method, okay, there's a type here. So is there a method one? I guess not, I didn't create that. So if you had a method node, a link to that method node from a local variable and then, so we have those links. So what we are doing is exploring those links. So for this specific method, and maybe we can dump this method to actually see if this exists or not. So for this specific method, we can actually see that there is I read local variable which was defined somewhere here. So we have this information encoded in this. We can go and do some similar stuff that we were doing before where we look at this specific method. We take its location and then we map its line number and file name so that we can get information like that. Standard stuff, nothing fancy. Now we go a little bit more deep. We saw that local variable that was there. So this is the same as what we had before. Look at the method, inside this graph, find out what local variables are there, get the types of those local variables and then list them. So we suddenly see that we can actually get the size also. Okay, yeah, that's the path issue that's there. Okay, so yeah. And now we can do some more stuff where we actually want to look at all the outgoing calls from inside this method. So there's a method parse public key packet and it has calls inside it and we want to list those calls. So it's as simple as just doing dot call and then getting the names of those calls and just listing them. So we see there are a lot of these calls here. Okay, so you can now see that operators are also calls for us. Why? Because an assignment contains is essentially a call with arguments of a left-hand side value and a right-hand side value. So it's again a call. So we call all of these as individual call sites. All of the operators are also call sites. You could maybe filter them out by doing something like this, dot call, dot where, not, you know, and then, or maybe call dot where not and then you could do underscore. So this returns a call and now you're adding a filter where the name is not an operator. Okay, so you do something like this, dot operator, okay. Name dot, something like this. Should work. Okay, so now what we have done is done some basic filtering where we looked at the calls, but now we filtered out the ones which have operator in them. So these are like the real calls which are happening inside this method that we were discussing. So like read MPI calls and mem copy these calls which are happening. So we removed all these calls of equal to operator of incrementing the operator, this operator. So we removed all of them and we get like the actual calls now. Okay, moving on, we can also not just find who this method is calling but who was calling this specific method as well. And we can actually get that. So parse public key is calling this and then we can go one level above and we can say caller and then now we know who is calling this parse public key and then we go one level above we can say caller and then we can keep on going and trying to find out who is calling this specific method. Okay, now this becomes a repetition. So what can be done is we can use repeat constructs which are there. So we could do something like this. cpg.method.name So we get that method and then we repeat caller, caller, caller, caller and every time we are repeating it we emit the name of who is the caller. So we do something like that. And this is basically a control flow. So we can see that this was called by this, was called by this, was called by this and all the way to the main. So we see a direct call chain from here to here. Something like that. You could do the reverse also where you try to find a method and then trying to find who all are they calling and then trying to get a list of it. But as you can see in a graph if you go from top to down and then there would be so many branches and it's gonna become very expensive to traverse all those branches and list all of them. So we do the reverse. We are trying to go from down to up. So down to up flow is usually easy because they're direct simple callers, a small set. But if you go top to down it's gonna be very deep because there will be so many callers coming from a high level. But you could try that out. So for example, if you try to go from download key the next query which is there I don't wanna run it because it's gonna take time. So if you do the next query where you are trying to get the callees of download key all the way until you reach public key packet. Okay until you reach this. If you try to do that, you'll get a big list because it's not gonna give you the straight list but all the possible ways in which you could reach that. So that's also something interesting. So this is repeating your graph traversals. So now it's become interesting from just searching the graph for certain methods searching the graph for certain call sites we are now able to link them together navigate the control flow and navigate the AST a little bit find out things like that. And that's what we're gonna do in the next. So for example, let's try to go think of this through in another way where we are now trying to navigate the types variables and understand how filtering could be done. So we try to look at the types which have VLC in their names and then find all the local variables of this type. So it's running through the whole graph. So there are all these local variables which have a type which starts from VLC underscore and these kinds of things are interesting because these are user defined types. Not primitive type. So since they're user defined we do not know what might happen to them. And that's why it's interesting they start narrowing down your analysis your manual part of the analysis. And we go a little bit more deeper and then we map and create a map of the name of that type and at the same time create a name of the member variable of that type. Yeah, so we just get VLC log T but probably there is nothing with VLC log T. Okay, let me see the next query. I think there was. So yeah, okay, there is we should have seen message here at least. So there is, so what's happening here is we are now not just going from the type we could go the other way around where we are going from the local variables. So CPG.local is gonna list all the local variables where the type has VLC log T in its name. Okay, we do that and we get a list of that. And then again, we could go a little bit more deep because everything is linked now. So I'm trying to show that link that you can start from a local where you will have a type of name VLC log T and then give me the method in which this is being called. So it's basically gonna give you where the message is there. Does it make sense? Where what you're trying to do is you start from the graph, pick one kind of thing you're interested in a local, that specific local which has a type VLC log T and then tell me in which methods that local is there. And the moment you do it, it tells you that. So it's only one. And then I think also gives you like a nice arrow something probably, okay, not there, but it gives you this arrow. It tells you that you have this object VLC log T and it's in this specific method. So I think it would be somewhere here, okay. So this way you can see that we have a holistic graph there from where we could go from a variable and find all the way in which method it is based on a certain condition being met. So we found out that condition being met by doing this basic filtering and then finding that specific method. Again, like same thing, but now we're gonna go one more step and we want to find which file that method is in. We found that method and now we want to find which file that method is in and it just gives you that information directly. So all things are linked. Can I quickly interject and ask you a question about CPG generation? Yes. So people are getting this warning member access linker cannot find type member. And does that affect the CPG generation and will you still be able to perform analysis? Yes, they would be able to perform the analysis. So again, I had explained this once. I'll remind this again. So what happens is there are multiple ways in which code is written and the graph is the graph generation is fuzzy in nature. So it will try to cover all those cases. And in some cases you might miss a few things. So for example, in this case, it was able to find the member variable, but it could not 100% tell what type it is. So it's not gonna stop the graph generation. It will just proceed on, but you may miss a few things and it's fine. It's not a big deal if you ask me, like it's not a very big deal. You still have a lot of coverage, which is there inside the graph. Okay, thank you. I think any NY0 had asked this. Okay, so let's move on. We can do something more interesting with this as well. So since we know about methods, we can go inside each methods and then filter them by those methods which have more than four parameters. Could be interesting to list things like this, like these methods have more than four parameters. So again, stuff like this gives you a little bit of more information about code complexity. So here we have methods and we are filtering them by those methods which have more than four control structures. So if you have more than if statements, it will list all those methods which have more than like four if else statements. So things like this. So it's a way in which you have information about ASTs and then you have linked them to a specific method and you can ask these in-depth questions to the whole code base. And for example, methods which have more than single return. Okay, those can be interesting and people have asked us this question. I'm concerned that there are many methods which have more than two returns or which have more than three returns and the returns are of this type. I want to find all of them. So all the locations. So any kind of code violation, you can map them out to these queries, run these queries, boom, you get results. And then sometimes if there are more results you will try is them. Sometimes you just get targeted results. So same with this, we now actually are able to just touch the AST, the raw AST, find out all the methods which have more than one return statement and list all of them out. So things like that. So code complexity going through them. This one is interesting, which I like, which basically is going through each method and then filtering out those methods which have for do or while in control structure and for more than four of them. So there are more than four for control structures inside that method. So it goes through all of them and then it lists all of them. You could even find complexity in terms of depth. So you could look at the control structures. It has more than three deep control structures. So there is if and then there's another if inside this and then there is another if inside this, there is another if inside this. So you could even find out all these methods which have nested if else is up to a given depth that it should not have. So coding practices and stuff like that. We have still not touched security. It's still like something super basic at this time trying to find code complexity and also trying to understand that we can navigate the code like this. Now let's go one step further. So the graph maps something which is called as external and it tries to give you all the methods which are external, which means like if you have done SUDIO.H so the methods called from the standard input output library, these are external. And what we can do is we can filter only those methods which you have not written and they are external and then just list them out in the whole graph. It's gonna take some time. So these are all external methods and they also give you a good point to start investigating that, okay, this is something which we didn't write but we are using from different libraries. We should investigate them. So this is also something super nice. Okay, so let's stick to some single stuff here. So we'll stick to, for example, trying to find all the call sites which have string in their name. It runs that we have string compare and all that stuff. Nice places which you should look into but you could do this with an ID, not interesting. But now we, this is interesting where we try to find out where ever string is used and in all the methods in which it is being used. So these are all the places in which any STR kind of thing is being used, okay. And what we will do is we will now go a little bit more further and we'll try to find all the places where Sprintf is being used. Look at its second argument and that second argument should not be literal. Okay, so there's literally in fact, just one place where Sprintf is used and Sprintf is used and it's not a literal. Everywhere else, it's probably a literal. So it's an interesting place. It should not be there which means somebody else is controlling what comes in Sprintf. So an interesting point to look into and then we could actually do something like this. Okay, I think I forgot method maybe it should be. So we get the call site and then we should get a method from here. What does it do? Okay, just give me a second. Oh, it's not filter not, it's where not. So this, okay, so this just makes sure that this is not filter not, this is where not. This is also interesting. Filter not expects that you should have a traversal inside it whereas where not should expect that you should have just the condition being met or not. So you can use either of them but make sure you use the correct one. So here it should be where not. So what we did here is we went to this call site, looked at it where the second argument is not a literal and then we just did a dump which dumps where that method is called and it actually tells you. So this is a quick way to find out that, oh, we had a problem. And now where is it? How is the format being controlled? So you can quickly see how the format is okay. It's still a string, it was defined here and you can triage it very fairly quickly. This brings us to the quiz. We'll also take a pause here while you think through this. We do a quiz where we try to create a query that finds a recursive function. So think through this, don't go to the next slide, don't look at the answers but think how you would create a query that finds recursive functions. Try to think through it in like a normal human language and then see how you could map it down to something like Scala or the language that we are using here, the query language which is actually 100% Scala. Now we'll take a pause here also and probably some folks have questions, we'll try to solve that. I think most questions are on Discord right now so you can just take a look, it'll be great. Yeah, yeah. Okay. Yeah. So getting this error upon importing parts, let me see, access denied exception. So I, so just before you dig into that screenshot, it's, I somehow I've worked the install and I have Yoern installed twice and one of them is installed as root and that one actually works, the other installed doesn't and so right now it appears that it is loading the CBG finally. Okay. So sorry, sorry for all, I don't know what I did. I did it like at five o'clock this morning trying to set up for this, which is stupid. Okay, so I think it's also an indication for us we should help a little bit more on how proper install should be done. Yeah, I think, yeah, probably this is what you must have done. The files that were created were controlled by root because you did like a system-wide install and then you try to run it from your user. So I think that's what must have happened. Okay. And okay, Jay is asking something. Is there a plan to use some sort of compiler instrumentation while compiling, building a code base instead of using fuzzy parsing? Yes, there is a plan. One plan is just to go all the way to the end which means binary. So then try to work through this. This is in process. There is Ghidra to CPG, which is being developed by Niko. And then I think there is also a plan to... So LLVM to CPG is open source right now. I'm not sure. Let me just check it. Just give me a second. Or maybe it's not. Just give me a second. Let me find out. So we did open source parts of LLVM to CPG which allow you to basically take the LLVM IR and then convert it back to the graph so that you can explore that graph. I need to just find out if that piece is open source. Okay, I'll write in the... Jay, I'll inform you back. I'll get back to you and I'll inform about this in this chat session which explains us whether this is open or not. Yes, and you're right. So since fuzzy source parsing misses something so we have observed it does miss something. And as you are also seeing like it sometimes misses something the way people have written pieces of code and we, because we have... It's like us writing half a compiler. So we are only able to cover so much but there's a lot of corpus of people who have already done that, looked at all different variations of see how it's written, all the variants that could be out there and then compile all of them. So we might miss stuff but it's still working, I would say. So some folks say that fuzzy parsing is enough for us and some would say that we want complete control and fuzzy parsing is good for some folks because they don't need to carry compiler instrumentation. They can just look at source code when they go to their customers or their clients and they don't need it running and buildable at the same time. So some folks got that. So, it's your choice but yes, we are gonna build that. Okay, so recursive queries. How would you find recursive function? Query for a recursive function. So it's pretty straightforward. We want to find all the call sites. We want to find all the methods which have call sites having the name the same as the method. It's as simple as that. And you could actually do this and a good nice story about this query is that someone in a workshop which I was running before asked this question and I just converted into a quiz. So you could actually create, oh, what did I do? Okay, you could actually create a query where you look at all the methods and filter them by the ones where X which is that method itself right now has a call site and the list of all those call sites contain the name of the X, okay? It's probably not gonna be 100% recursive functions because it's contains and contains only just checks partial strings and not complete strings but still a good point inside the whole VLC code base. And you can actually see there are recursive functions. Look, recursive insert, copy, recursive add to parent and many of these are recursive. Like locking and unlocking is recursive. So you will find mutexes, et cetera, maybe somewhere. If we see like taking and releasing mutexes usually, okay, like this, mutex lock and lock. These are also recursive in nature. So we do find there are recursive functions. Nice, interesting sites to look into that might cause places to break. So people want to look into that when they do audits. Moving on, we go to module two. We can take, I think a five minute break and if people have questions, you can come back to us. We will take some questions and the questions will be underscored and here you can unmute yourself and ask as well. Five minute break, but actually just give me two minutes. I'll move away from my desk and get some water. Vicky, can you please take over? Sure, yeah. Feel free to ask any questions while Sutraka is gone. And you can submit your questions to Discord as well. Let me just go through some of these questions right now. So yeah, most of what we've shown in Yorn is in the shell right now, but we're gonna show you how to use it as like an automated tool later in this workshop. Yorn comes shipped with a tool called Yorn Scan and it's actually basically like a prepackage Yorn where you can put queries in the query database and then you can just run the entire query database by entering the Yorn Scan command. What's the way to, well, how can we visualize the CPG or the code as a whole? Is it only like using the commands and we get the somewhat raw output or can we get the graph database and navigate through it using normal tools? What do you mean normal tools? I mean, I've used ThinkerPop and Gremlin a bit before and we had like some tools to just visualize the database. This helps a lot when debugging to ensure like here's what the data type is and when looking for specific things. So is this all abstract right now via the CPG and only like accessible via the shell? Or are there other tools to like say, I wanna do the raw JSON of some part of it? Actually, I think Yorn comes with like a graphing tool that allows you to print out the entire co-property graph if that's what you're looking for. Mr. Chakra, what's the command for printing out the like the graphical representation of the CPG? Okay, super interesting. Okay, people are asking very interesting question. So I didn't show that because I didn't know people would like it or not. So what I'll do is I'll show for one of these methods. So we have a method. There is something called as plot.ast which basically plots the complete AST and shows it to you. So you do this and it's gonna run and it's gonna generate a SVG and it will open that SVG. So since I have inkscape, it shows me something like this. And as you can see, this is the complete method and representation of the method. So we have this method node. It even says what kind of node it is. So it says it's a method node. It's name is parse public key. And then it has all these different, other nodes connected to it. So the block is gonna contain all the information about the code. The patterns are gonna be independent parameters for that method. So this method has three parameters, one method written and the block and the block contains everything inside this method. So this method also had, I guess like an conditional statement or no, yeah. It had a conditional statement. So there's a control structure. It says if control structure, and then if it's true, do this. If it's not true, do this. And if it's not true, what should be the next and incrementation and, so basically the whole AST. We have the whole AST, but as you can see, we can generate as many graphs as possible. They will look very complex. What we need is an actual query engine which allows us to traverse through this properly. And that's what we have here. But it's nice sometimes to plot the graphs to quickly see if things are mapped properly or not inside your code base. And I think you could do the same thing. You have plot.cfg. So it only prints the control flow graph. And then you have plot.pdg. It will print the program dependence graph as well. So, yeah. So the control flow graph is gonna look a little bit more linear in this case, which means like, okay, you have a call and then this is gonna call a less than operator. So there's some packet length checking. Then it's gonna call this and then it's gonna call other operator, still an operator. Then there's a call to some field identifier or there might be a condition here somewhere like not equals to this. And then if the condition is met, you do this. If it's not met, you do that. And basically, and then again, another condition here. So that's how the control flow graph will be and all of this comes back to the return of the method. So when all the conditions are checked and met, there might be calls in between. They will be noted as call, C-A-L-N. I don't know how they have been attached in the visual representation. This is just for our internal debugging, I would say, but still interesting to note. Okay, so let's move on. We'll do some security stuff, a little bit more interesting security stuff and something which is easy to understand. So in the repo that you have, there is something called as ALOC party. And again, you may use, I don't know where you have cloned it, but don't use the temp here. Just use wherever that ALOC party directory is and then import the code. So we'll import ALOC party or something and we import this and it imports again. Some things are missed out. I don't know what got missed out. So some edges were not created for some nodes. It's fine. Don't worry about them. So once you have the ALOC party code there, I will also open the code here so it becomes a bit easy. So it has a small piece of code like this, okay? So it's gonna be fast, very fast to do this so no problems with that. So you could import this, run the OSS data flow. So run dot OSS data flow. And then we also save it and it saves this back to the disk. Then we have the graph ready with us. It's fast because it's just a single file. There's nothing much inside this. Okay, so once this part is done, I'll explain to you about this simple memory allocation bug. It's called as zero allocation or overflow bug. So what might happen in a case like this is you have ALOC Havoc method, which is there and it has an argument, why? And then there is memory allocation done and that memory allocation has an arithmetic operation happening inside this, okay? So based on Z is fine, we know it's 10, okay? But Y is coming from somewhere from outside. So anybody could call ALOC Havoc with some value of Y. And if the value of Y is such that it causes, for example, if it's zero, it's gonna cause a zero allocation, okay? And we get basically a null pointer at X and then we are gonna use that. So zero allocated memory, we will use that. So it might cause a problem because this is dynamic in nature. And at the same time, what might happen is that if it's, for example, half the size of max U in 64, you might have an overflow. So if you have Y into Z plus one and like Z assume it's two, so you just have an overflow or a Y is max U in 64. So suddenly it tries to allocate a large chunk, it overflows and then you get left with less allocation. And imagine you start using this pointer at that time. It's gonna cause bugs. So it's as simple as that. You might get into a condition of an overflow or you might get into a condition of zero allocation. And it's not a joke because, you know, this was suggested to us by some folks, I wrote some queries to get this and then we actually found a bug and then VLC itself. So today you're gonna see that exact bug which got found by this silly mistake of having an allocation which overflows and then getting used someplace which definitely causes it to break. And with yarn we were able to catch it. And I hope in your code bases also you're able to catch things like these. So we do the same thing, nothing fancy but now we up this thing a little bit. So first let's try to find out all the places where malloc is used. So call gives you the call sites. So all the call sites where malloc is where the first argument is a call to a multiplication operator. So, and then list all the nodes. So we quickly get this location, you know, if you run this for VLC you will get all the results for VLC as well. And what we do is we try to find flows now and this is the first time we'll do some data flow analysis and you would also be able to do that. So we take all the methods which have Alloc in the name take all their parameters. So anywhere where Alloc is defined we are basically trying to find a parameter of any method which has Alloc in their name and then mark them as source. So this is how you mark them as source you define a small functional source. And then we try to define a function as sync and this is basically what we were trying to do. Any sync which is malloc. So an argument of the call to malloc where the first argument of the malloc is containing a multiplication operator. So we mark them as sync. And what we are trying to do is find a flow from any method inside your code base which has Alloc in their name, the parameter of that method all the way to the parameter of malloc which has multiplication as one of the other parameters. And then we try to mark them as sources and syncs. And then what we can do is we can do sync.reachable by flows. So control and R also does reverse eye search on this. So you would be able to get that. So we do sync.reachable by flows.source. And then we do .p which prints flows in a pretty printable format. So what we have here is a flow coming from Alloc, Havoc and why because we started to track that all the way to y into z at like line 13, which was basically this. And then you have this whole flow already there. It tells you in which method it is there. It's inside this Alloc Havoc method itself. This is very simple data flow, not a very complex one, but you could do the same kind of things on a VLC code basis as well, okay? We will do some of this, but not data flow, lot of data flow, it's basically up to you how you would want to use it. But data flow is one of the very interesting aspects where you would want to find out how the data is flowing from one method to another method. If you know a data is tainted, you would want to see if it's flowing to a tainted sync. And you could use this kind of semantic to mark sources and sinks. And then look at the data flow. So quiz here. In this function itself, in this file itself, the Alloc Havoc, we have a double free situation, okay? There is memory allocation being done and it's then being freed, but then we are again doing another free to that, okay? So there's a double free. And double free is quite common. Like last year, there was a big WhatsApp bug. It was actually exposed by a double free in one of the, I think, GIF libraries or something. I don't exactly recall what had happened at that time. There was a GIF library where you send these nice GIFs in WhatsApp and then that had a double free. I'm not exactly sure if the GIF library had it, but there was a double free in WhatsApp. And the condition was very close to something like this, where you had memory allocation being done, you were freeing it once and then you were freeing it again. And then that would cause an unintended free and hence you suddenly free memory you were not supposed to free. So this is a good condition. And how would you try to look for situations like this? You can give your answers in simple English and I will actually translate them to query or you could try to find how those queries would be. Keep in mind this concept of sources and syncs is there. So we are able to find flows coming from the return of the malloc to a free. Was it heart bleed something like this? Yeah, I think it was a mix of this. Usually I'm not exactly sure. I'll check that. But I think it was two or three things which were happening. I guess my question is could you find heart bleed with yarn, like is there something would you be able to construct queries complex enough? Yes, yes, yes. We would be able to construct queries complex enough. I would say that it will take you 70% there because heart bleed is this one specific instance where one specific thing happens. So it will tell you about those, take you 70% there and provide you 70% of automation and then the rest 30% would be you triaging what you got. Like it will narrow your scope of analysis. Awesome, thank you. So in this case, it's pretty straightforward. We try to find out all the calls to anything which looks like a malloc or it could be malloc, like directly a malloc as well. And then we mark those as sources and then we mark any argument of free as a sync. So traditionally it should give us a flow where there is one malloc reaching to one free, which means that a malloc is easily paired with one free that has happened and we are good with it because we are looking at data flows. But in this case, because there are two frees and we are tracking P, it's gonna give us a data flow, which contains two frees. And the moment you see this, we definitely know that the return of the malloc tracked as P is now reaching one free and then again is reaching another free, okay? And then the moment you see these two frees in data flow, it's like a clear sign to you that, okay, there are more than one freeze which are there. You could not use data flow and also just look at, go all the way to the control flow and see if the same data is going in that free, if the same data is going in that free. But this is a simpler approach. Just try to look at data flows and see if the data you are tracking, which is the return of this malloc, the specific malloc you have identified is having more than one freeze all the way to the end. Okay, so the moment it has, we can say that, yeah, there's a double three. So pretty simple, so not too much, but try to use it in some like real use cases and see what comes out of it. Okay, so we'll apply this knowledge to VLC. So we, since we have the VLC graph already there, you can just do workspace and see that, yeah, VLC is already there. What we'll do is we'll just open it again and not generate the graph from scratch. Okay, I'll just copy this. Oh, okay. So we open this again and the graph is again open and what we will do is we can run similar queries now on a graph as well, okay? VLC is ready. So let's try to do something very similar to what we have done on VLC. So what we do is we define a source here. I'll increase the size a bit. So we try to define a source which is the same as what we were doing, but we just use addition operator. I'll just tell you that you could use any operator or you could use all the operators, like is a call to all the operators, but we are using addition because I'm only interested in looking at one condition where the addition happened. And we do the same thing and we mark all of this as source. Okay, so nothing fancy. It's the same thing that we were doing before. Find me all the mallocs in VLC where the first argument is a call to addition operator. So trying to narrow down what a source could be, okay? And then try to find out if all the locations where you had a malloc like this with an addition operator. So this is a multiplication, but assume that there was an addition here. Assume that there was an addition here. So try to find out if this X is being used inside a mem copy, okay? And for that, you would create a query, something like this, okay? So where Cpg.call.memcopy, it gives you all the mem copy calls. So all the mem copy calls where the call to the first argument is reachable by this source, which is basically all those mallocs where the first argument is a call to addition operator. And then run all of them and list out all those locations. It's expensive, it's gonna take time. Now, why it's gonna take time because it's going through all the mem copies, going through all the mallocs and then trying to link this condition to them, okay? And the condition is that the argument of the mem copy should be coming from a mallocs call where the first argument is an addition operator. Make sense or is it confusing? So I'll try to explain something like this. So imagine you had something like this where you were using mem copy and then the first argument of the mem copy was like this X and the second argument was, is it source or destination or is it destination or source? Some, something like this. And so in this case, a bad malloc is reaching a mem copy. That's what's happening. And we are trying to find these locations right now with this query and it's running. It's running on the graph. Don't think it's stuck. It's just going through your whole graph. What's the difference between like a syntax and capabilities of where and filter? Where and filter, okay, yeah. So where is gonna work on traversals and filter is gonna work on conditions. So for example, if a condition is like this contains this and it should return Boolean true or false, it will work like that. And I think where requires traversal. So let me confirm this properly. We in fact have this noted down in the docs as well. Docs.yorn, like when to use where, when to use filter. Filter steps, okay. Yeah, returns true. So yeah, so it's for Boolean and for where is where the traversal proceed with non-empty returns. So, which means like, this is a traversal because it's going through this call and then taking that call, which is underscore, finding out the arguments of the call. So it's still traversing that call. So it's going through the graph until the traversal is returning true. So this is when you will use where and filter is when you will just have a Boolean true or false for a condition being met. Like if the argument is equal to something. Right. I mean, I imagine in most cases you can write the same query with both, right? Just with slightly different syntax anyway. Yeah, the syntax would change in that case, but sometimes you have to use where because you want to continue the traversal. Right. Okay, so it ran for some time and it found all the locations, all the mem copy locations inside VLC. And the first argument of all these mem copy locations is, okay, so it's source actually. So all the arguments from the first mem copy location are coming from mallocs, which have addition in their first operator. So all the mallocs having arithmetic operations, they are here. So these all sets of mem copies. So now they might be 10,000 mem copies, but these sets of mem copies are the ones you would want to investigate a little bit more deeper. Makes sense so far. You could to go a little bit more deep. You could actually define sources and syncs to exactly what I said, like mark all the sources as those mallocs, which have first argument as addition operators, mark a sync where all the arguments of all the mem copies and then find data flows. So it's gonna be a lot of data flows. For each of these points, there will be a data flow. And then you'll have to go through them, but you have something to go through them. It narrows the scope. You can narrow the scope a little bit more further as well for each one of them. So yeah, someone asks, can you get source so that it's easier to see where you need to investigate? Of course. So how this will change is you have .code.l instead of that you could do .location.filename.l and stuff like that. But let's not go there. We'll make this more easier for you so that you don't have to type this again and again. You don't have to type this kind of thing again. And let's move to the next module, which is scripting. So what we will do is the knowledge that we have gained in this big query that we were writing so far, like this define a source and then define a sync again, we're gonna not repeat ourselves and create nice functions for this. So the first function is like a function. Now you think that you're able to find this properly so you will create a function for yourself where you say, okay, let me define a source which is a call to malloc and this. Let's define, let's run this query where we have a call to these arguments where the malloc is there. And then we name all of this as buffer overflows and buffer overflow takes in the CPG. So the CPG that you already have. And then we do this, we have a function like this. So now all you'll have to do it is to do buffer overflows and then you do CPG and then you press enter and it's gonna run. But do note that this does not contain.code.l. So you would need to do something like .code.l and it runs. And again, the whole thing is gonna run and it will generate the same thing that it gave you here because it's running the same exact code base. So I'm not gonna run this again, but you can try this out. You don't repeat yourself. You have some functions and you can do control C to stop a traversal anytime. So what we do with this information that we have now that we can create nice functions for us and then not repeat them is to create these functions. Okay, let's test it on VLC. Okay, fine. It's asking us to test it on VLC but I will test it with something more interesting where I actually run that and I filter it for a given text. And what you will see is a location which looks something like this. Okay, so what you'll see is a method called as parse text and that's what our, you know, this tool that simple nice function that we had created is gonna return when we actually have just filtered with a method which has parse text. So it's gonna return a lot of these buffer overflow locations but we only are concerned with one which is parse text. And what we see here is this condition happening in real life in VLC. Okay, and we dump this method so we can quickly see it as well. So this is happening in real life in VLC where we have a malloc where this is an arithmetic operation happening on it and this PSZ subtitle is being used as the first argument of mem copy. Okay, so a nice condition. So if the value of P block iBuffer is ever max U in 64 we will have an overflow because we are overflowing by one there. So we looked at it, we checked who was calling this, how this P block iBuffer was being populated. We went through the code and we found out that this was okay. It was not a buffer overflow. But then like a few weeks back, like two weeks back we found an exact same condition where it was actually causing a buffer overflow and then we reported it. And this query can just run and you'll find these many locations where this condition will be met. And your scope is narrow now you can hunt through all of them. And I'm pretty sure you are also gonna find bugs. Not that I know of any. I do not know of any bugs as of now but if you run this, you are going to find something for sure because they are just, pardon. How are you guys testing this? Are you guys then compiling it into a binary and doing full on POCs? Yes, we have done. So once we found these locations these are pretty much straightforward. You cannot refute the process where you have a flow, it's proven and the value of this when we find it like just by quickly manually auditing is or by just queries you because you can create queries, data flow queries, saying that find me all the places where the P block I buffer value is being set and give me a list of all of them. So once you see all of them, it's pretty much sure that there is a buffer overflow there and what we then do further is to actually create a small POC. So I will actually show this to you what we did. Thank you. So if you go on yarn.io and it's not, this is the correct website. If you go on yarn.io, we wrote this in the block where we found this, this is the exact query that you are seeing. You know, it's very similar in nature. There is just one more check which is happening. Okay. And we ran this whole query automatically. We found many locations and it just generates a result like this. We even have a pretty printer which tells you, okay, look at this location. You look at this location. There might be a dangerous copy operation happening here in this line and we actually went and then we created a small POC. And I think the POC is here. You can trigger the POC like this and then it actually calls a buffer overflow. And we reported it to them, they fixed it. So the next version of VLC which got released a few days back does not have this bug. So these are real life conditions and it's not just us who is doing this. So Fourscout released this name wreck thing, name wreck. Where they created, where they used yarn. So they used yarn and created this nice big script which runs and it automatically runs and it gave them nine bugs out of the box. They didn't have to create POCs for them because it's a straightforward check for those bugs. And I think they then followed up and actually created the POCs and found it. But what I'm trying to explain is that you could actually create these complex queries if you know the system enough. And what we're doing right now in the workshop is enough to get started. And then you could go all the way down, go to the graph, find out these exact queries, match them properly and then run them once on large code bases and boom, boom, boom, boom, you start getting some results. Maybe they got like 60, 70 and then they triaged them and found out that nine were true out of them. But still on large code bases, it's kind of like a very exciting thing to do. So... Right, so someone in Discord asked a really interesting question actually. This was one of the questions I had when I first started learning yarn is what's the difference between reachable wife blows and reachable buy? So reachable buy flows is just going to print you this nice thing, you know, this allows you to, it returns you the flows which you can print. So if you do L, it's gonna show you the flow object. And if you do dot P, it prints that out in this nice, you know, this graph thing that you had seen, not a graph, but what was that thing, the table thing that you were seeing before? Yeah, something like this. So it gives you the flow object, which then, which you can use to find out how the flow was actually happening. You know, so this nice table and you can print that flow. Reachable buy just tells you whether this is true or false. Just returns that it is reachable buy or not. So reachable buy flows also returns you the flows. So we're probably also gonna change the syntax a little bit where you can use reachable buy and then you do dot flows dot P. So it prints the flows. So it will become a little bit more easy, but at this time we have two reachable buys. One is like a Boolean true or false, which we used here, here. And the other one is reachable buy flows, which returns the flows object, which contains the exact flow so that you can find out where it went, what was the source, what was the sink and you know, actually visually track it. So that's the difference basically. You should get the same results. Yes, the results are same, yeah. But actually a while back I ran into this issue where as you dedupe the results of reachable buy and reachable buy flow, you get different results because they're deduping by different criteria. Ah, okay, okay, yeah. Very, very interesting observation. So what Vicky just mentioned is that, so what happens is that the moment that we had generated flows like this, they might be in large code bases, there might be multiple ways in which you could reach the flows, okay? So this was straightforward, it's a simple file. We know that, you know, the flow is straightforward. But what if there are multiple branches and multiple ways if L statements inside where you are reaching a specific sink from a source? So what we do inside is a mechanism to deduplicate these flows, which are very similar in nature. So we just remove them because if there are three conditions and they're all in like one if, so it's very straightforward, you are definitely reaching the sink. So we would like to remove all of those three. So the deduplication logic is probably different for both of them, I will figure this out and you know, explain that to you. The syntax for deduplication is dot dedu by the D, D-U-P, it's very, very useful. Okay, yeah, so you can do like dot dedu, I think. Yeah, dedu by, and then you deduplicate multiple flows based on a criteria, like the source is different or the sink is different. Actually, this is something basic scholar. So if you have a list which is returning many things, you could actually dedu by this. For example, if you have three mains being returned by some other thing and you only want to find out whether a main is being used here or not, you could just do dedu and then you deduplicate the replicants in the list. What's the dedu by operating that you can do? I don't even know. Okay. I've actually never used that before. Yeah, I've also never used that before. It's actually a by criteria, like by name, by something else. Oh, yeah, it's nice. So let's find out. Okay, it's an interesting thing. So let me try to list all those weird methods. We have this parse thing, yeah? So I'll do this parse public star.name.l. So it should list me some methods which have parse public. Oh, what happened? Oh. The regex is incorrect. No, regex is incorrect, okay. Okay, so we have two of them. There are no duplicates still. Oh, yeah, nice. Okay, we have many weird mains and they are these mains. Maybe they are all in different files. Okay, now it's nice. So what we could do is we have all the weird mains which are there, okay? Name.l. So it's just gonna list me all the names, okay? But now you, they may be in different locations. So let me look at their location. Location.file, no, it's not file, name. What's in location? Let's just leave location. There are all these new locations which have a method with a file name and a full name, okay? So we can probably file name, okay? File name.l. So this lists me all the methods in different file names. So maybe what I can do is I can do ddipby underscore dot file name, maybe. I don't know. Let me try that out. Okay, still different method. So which means that now we do not have any duplicates for the same file name. Maybe let's look at the size. This is 59. And if we didn't have this, maybe it's 59, which means that there were no duplicates. Okay, there are no duplicates in this. But if there were duplicates in that list, it would basically just tell me that. So if you did by name instead of file name, all the names are the same, you would get only one result, Diri. Yeah, obviously, because we just want to ddipby its own name, yeah, yeah. Okay, so it removes the duplicates from the list, basically. And you can, with ddipby, you can set some criteria. Okay, so let's move on. Yeah, so where were we? Yes, we were creating these nice DRY functions. What you could now do is go one level above where you take that function and you put that function inside a file. So we are not going to run this here, but we are going to save this inside a script. And I think I have already provided that script here. So you can look at my tools.sc. So you can create your own tools like this, one, two, three, four, where you create these nice scripts and they are all ready with you. And then once they are there, you could run these scripts directly. So you could do something like import file. So import file. And then I think I have this my tools already in this directory itself. So it should import. And it imports my script. So it's like my nice tool. And then I can run my tools.buffer overflows and then run it any way I want, basically. Okay, and if I run this, it's gonna give me all that list. So you could create these nice tooling and using just these scripts and just drop these scripts directly. And you can, so it's like these nifty tools. I was showing you these folks, how they have done it, the four scout people with Namewreck. And this is how they have just put these scripts. They just run them once on different code bases and try to get some results. The next, you could also go one level above and create your scripts with this main annotation, okay? And what this does is, I think it's called as buffer overflows, yeah. So what this does is it creates a function which will be executed if you try to use your dash dash script. I think Jay was already talking about that in the Discord channel. Yeah, join dash dash script and you can give a path to the script and the script would run. So in this case, what we have is that we have an execute method. You can name it anything, it need not be, but it's annotated as main, one main, okay? And then it takes the argument as a graph with a string, okay? And that's what we have. And inside this, you can do anything, like open the graph, print it, generate PDFs, pipe it to something, use some API because it's 100% Scala. So use some other Scala library which makes API calls and then take all of this JSON representation, put it in some other file or do whatever you want. It's like your own tooling. So like proper standalone tooling that you could run with your own. So you create scripts like this and then the way to run them is actually from outside of your own. You could just simply do with something like this. And what's gonna happen is it takes the script, you give the location of the script and you will define, you will specify what parameter. So you'll say use the graph param and give it the string value. So now what's gonna do is it takes this VLC string value, it runs open graph. So it's assuming that the graph is already there. If you do not know where this is or if this is the location of some direct code base, you could actually just give that code base, but here you will do import code. And then it will take that string, start importing it from that location and then you will do run.osdataflow and all that other stuff. So which rates the graph first. So now here we are assuming the graph is already there but you could do graph generation and workspace management and everything from this. And then it runs that, finds your buffer overflow, generates a nice table or whatever. You can print anything here. You know, it's a 100% Scala. So you could do whatever you want with this. But what I'm trying to explain to you is you can take that script and convert it into an external tool. So now this is a tool which you can just run. So assume that you're able to create graphs of your code easily and you can do that in a CI CD system like on a GitHub, you know, GitHub action or something like that. And what you can very quickly easily do at that point is to just say that on each pull request, create this graph, run these small tiny scripts which are looking for specific cases and every time you get those results just print them. So you will just put this single string inside the CI CD system with your script and then you run it and then it quickly gives you results. So that's where you move on from your manual query-based thing you were doing. You have perfected your query and then now you create your own tooling and then you deploy that tooling in whatever system you want to do. So that's kind of like the idea of creating these standalone scripting tools. I will now take a pause and Vicky takes over and she takes you in this next journey of building custom scanners with Yorn. So till now what you have seen is basically just work with scripts but you can do something much more with Yorn. You can run it into, you can create a complete scanner with it and she's gonna explain how those scanners are and what you can do with that. So thanks a lot, I'll stop sharing. Vicky, you can take over. Yeah, let's now take like a five minute break or so and feel free to ask any questions in the meantime as well. And when we come back, we'll talk about Yorn scan, the scanner that's packaged with Yorn that you can customize and sneak in. Can y'all see my screen right now? Yes, we can. Okay, cool. So does anyone have any questions so far? I've got a question. This may be incoherent because I've never really done this kind of source code analysis before in an automated way, but do questions of like soundness emerge in the same ways that they emerge and say like using Z3 or other, like binary analysis where you're doing it, not this static binary analysis. So can you elaborate a little bit like what you mean by soundness? Yeah, probably not. I mean, soundness in the sense of like false positives versus false negatives and overdue, you know, yeah. Yeah, so there are questions like this. I mean, I would not say questions, but can you elaborate a little bit I would not say questions, but concern. So situations where you are always worried, have I covered everything? Okay, things like these and where you will say that, okay, in a binary, that's the truth that's gonna execute for sure. Okay, but what about source code? We do not know whether it's gonna be compiled properly or not or the question about static analysis in general, you know, how much is it covering? In dynamic case, we really know this is exactly what got executed on the machine and, you know, this is what we can analyze. But in static, we do not know what piece of code will get executed or not and what gets compiled out, you know, or optimized out, we may not know. Okay, and these questions will arise and static analysis can only take you so far. It's not gonna cover everything, but the benefit of that I believe is that it allows, it gives you an opportunity to find things very early on. So those, so in that sense, you know, I mean, it makes someone more comfortable, I would say, and, you know, more prepared, you know, there would be questions about soundness for sure. I mean, if you mean soundness by that. Yeah, it's a contested term and I haven't really, I'm not, I wasn't prepared to ask the right technical question there, but I mean, I know that I understand the limitations of static analysis, but I guess I was curious of what are the particular dimensions of the problem when you're working with these kind of graphs? But that's it. Yeah, the problems are okay. So I will actually lay down the problems because I work with customers day and night. So I can actually tell that the main questions that they would have are, you know, or people in general would have is, is it gonna give me too many results? That's the first question that we get because all the static tools that we see right now are gonna run over the weekend and give you 10,000 results and they're non-actionable. So the question, the thing that we try to solve or we want to build with a tool like this is you ask specific questions yourself and, you know, targeted questions. And I feel, you know, the direction that we want to take with this. And the other thing I would say is there are questions where, you know, dynamic functions are there and they do not know how the program is gonna behave in real life and then they would say that should we actually implement security in static form? I mean, then we should say that, yeah, you should because that's catching things very early on before they're even gonna show up in your system. So, yeah. Thank you. Sushrika, how does Semli work? Do you know? Okay, yeah. So Semli is also very close to this. And if someone asks me, I would say that it's actually the only closest tool to what we have. And, you know, it's, I would say it's the closest tool that is there to yarn at this time and to the graph technology as well. It just does not use the same graph structures underneath and their queries are written in their own language, own DSL, which they have defined themselves, a code QL, I think, code query language. But what we use is Scala, like 100% Scala. So our queries are total Scala with some sprinkling here and there. And then, okay. So someone says Semli is data log based. Okay, so maybe code QL is data log based. Okay. But yeah, the information that you can gather from the code is same with yarn as well as code QL. It's just the semantics are totally different. Isn't, doesn't GitHub also have like a graph query language for code? Yes, they acquired Semli. And then that's how they acquired code QL and that's what they do. Gotcha. Yeah. I have a few questions here if we were not starting just now. Sure, go ahead. What would be the best way to integrate yarn with like complex frameworks? I'm thinking say WordPress in PHP where we can have actions and hooks everywhere in the code which are called with like custom strings depending on if the string is exactly something would like creating our own modified CPG to add this bit of solution or a lot of custom methods to figure out what the paths are. Yeah. So what you can do is, so CPG is very, very flexible. Okay. So if you have been able to generate the graph and you have some methods and if you have them you could actually start tagging and annotating those methods. You could even link them to different other methods on the fly itself. So you could create some scripts which run with all these two, three graphs together and then they stitch them together. So this is something which you could do but it's all custom work. Okay, but the APIs are there. I'm saying that the mechanisms are there. It's not just out of the box at this time. Yeah, awesome. And another question is what would, how can we like clean the false positives from, or the data validation from the paths such as if we use a buffer overflow example from before if there was a check either earlier in the code or even if the same method for the value entering it's not the max value of UN64. Yes. Yeah, very, very good question. I exactly know what you're asking. So if in the data flow, if the data is passing through one of the methods, how would you say that this is validated? So just don't show this to me. So there is an API, it's called as dot passes, okay? And, okay, I think Vicky is sharing but we can, I'll give you a link about the passes. So what you will do is you had source, sync dot reachable by flows and then in bracket you had source and then you could do dot passes and then you could give a name of a method. And then which means that it's gonna look through all the methods and then see if it's passing through that method which means that the data flow is affected by that method and you can define what those sanitization methods are or validation methods are and then you can filter them out. Awesome, thank you. Yeah, let me see if it's exposed in your, it should be in the docs. I will quickly see and just write back on the channel. Yeah. Okay, are we okay to get started with the next module now? Okay, let's get started. So Sutraka, are you gonna share anything for the open source thing? No, nothing, you can go ahead. Okay, sure. So let's now talk about how we can basically use Yorn to build a custom scanner. Yorn comes with package with Yorn scan which is basically a code scanner that's built on top of Yorn. It's basically a package tool that comes with quite a few pre-written queries that you can just use to scan any file that you have. So the syntax for that for Yorn scan is Yorn scan and then the path to the file that you wanna scan. Let's just do the TACB thing that we did earlier. So what happens when you do Yorn scan is basically that everything is automated. Yorn scan will generate the CPG for you and then Yorn scan will run a set of pre-written queries against the code base that you are scanning and then finally it's gonna print out the scan results in standard output. It takes a while to run here, but oh, that's very fast. So you can see here that with this code base, Yorn scan found quite a few issues and the format of the result is, this is the severity score and then the name of the query and then the file path and the line number and the function name where the vulnerability is found. Oh, sorry, my computer seems to be hanging right now. Let's see what happens. We can see you. You can see it? Yeah, yeah. Okay, I'm actually not able to do anything on my computer right now, let's see. Okay, we're back. Yeah. And so yeah, you can go ahead and try running Yorn scan. It should already be installed if you have Yorn installed and Yorn scan also has quite a few options. The first one is update DB. So because Yorn scan comes shipped with a query database of queries are pre-written and so every time you run update DB, Yorn scan basically goes to the query DB database on GitHub, see if there are any new queries that are written and then downloaded to your local copy. So here's the query database repo. This is where like all the open source queries are stored. It's in the source, scanners, see folder. So here you can see there are a lot of queries that are written by someone else and Yorn scan basically use these queries to scan your local code base. Let's see. So another Yorn scan option is overwrite which basically overwrites the existing project CPG. The way that Yorn scan works right now is that because it wants to save computational resources you don't wanna generate the CPG for the same project every time you scan it. So it'll basically use the same CPG against the new query items. But if you use overwrite, Yorn scan will generate a new CPG for you. So this is best run after any major application changes and so on, if every time you change your application and you think that it will affect the query results you should run Yorn scan with overwrite. And finally there's also an option to use tags with Yorn scan. So tags are basically a way to specify which queries you wanna run instead of running all the queries in the query database. So for instance, if this time you just want to scan for a buffer overflows, you can do that or you just wanna send for XSS, you can do that as well. And if you don't specify a tag, Yorn scan will basically run all of the queries in the query database for that language. So let's look at a sample function that's a sample query that's stored in the query database. This query finds a dangerous cause to the function gets. So gets is a dangerous function that can lead to buffer overflow because what it does is that it reads from standard input and it doesn't have any bounce checking. It will just continue reading standard input until a new line character is found. So this is a very dangerous function that can basically lead to a lot of buffer overflows. So this query written by Sutracra scans for this dangerous function. And you can see here that it's very similar to how you write a normal Yorn query in like the Yorn shell. You put your query here and the rest is basically just markup on the name of the query, the author of the query, the title of the query, the description, the severity score. And then you put your Yorn queries here and your tags here. And then once you upload that to the query database, it will automatically generate these kind of results. Yeah, so as I said, Yorn scans ships with a default set of queries and it's called the Yorn query database. It's stored in this GitHub repo and you're welcome to submit pull request to it. And every time you submit a pull request to it, once it's approved, every time people run update DB, it will basically update their local copy of Yorn scan and you'll be able to run it using Yorn scan. And to add your own queries, basically as I said in this slide, you need to clone a local copy of the query database and then you need to add your queries to slash IO, slash Yorn slash scanners. And then there are two folders, each storing the C queries and the Java queries. And you can basically just add your query directly there and then upload it to the GitHub repo. Yeah, are there any questions for us so far? So at the same time, I gave a link about the queries.yorn.io and those contain like the browsable queries that you can go through on your system and you can look through how those are gonna look when you have submitted all of them. So when you try to add the queries, Wiki is, is this like, are you gonna, so we don't need to install new Yorn or can we work with the Yorn that is already installed? There was that confusion I had before. At this point in time, you'd have to install the core DB repo in this slide. Because that repo actually comes with all of the source code of the queries. So it's easier to just clone these repo and then update the queries there and then reinstall the query database from there. Instead of the Yorn scan that comes shipped with Yorn. Okay, I get it. So this is like development, you're developing new queries. So you would use the query database repo, develop all of them, the queries and all that stuff and then send a PR. So this Yorn scan that you're gonna get in this step, I think in module five, the slide that you are seeing is basically gonna be like locally installed Yorn just to test out the new queries that you have. Yeah? Right, I think I was being very unclear earlier. So the Yorn scan that comes shipped with Yorn comes with a set of queries from the data, from the query database that are already installed. But if you want to try out and building, if you wanna try out building new queries and updating the query DB, you'll actually have to download the query DB repo and then update the queries from there. Yeah, okay. So what we can do right now, I think is open the forum for any open Q&A that you have. You can decide to create your own queries and submit PRs, try to do this. And then you can browse new queries which are already there. If you have new ideas, please submit it to us. I think there was a question which additional languages are being supported and Niko is already here. So basically binary and PHP is in pipeline at this point right now. We will, so Java is already there at a decent stage. So the way you were doing import code and importing the C source code, you could actually import a jar, war file, compile Java source directly. So if you have a jar or wall file ready, you could just do import code and it would import the jar and war files and then begin Java analysis. Does this mean you support all JVM languages? Yeah, I mean, to an extent, okay. So for example, like we would even support or like I have done it that you could even support APKs, Android applications and just submit them and it will decompile them, convert them and but if they're obfuscated, you don't get much out of it. So that's one thing. And I think Scala, I have seen it being working but not stress tested the engine. This is the new development adding the Java support. And but beyond the channel, the support was added by David who is there. His name is Big Data Dave on the channel. And if you have any Java related questions, just ask him, he just fixes them very quickly. So he has plans to support all the other JVM stuff like Kotlin and things like that, but not tested. I have actually tried analyzing a few Java applications with yarn and it worked pretty well actually. We ran into some issues with CPG generation but I think those are all sorted out right now. Okay, nice. I've seen under the technical support channel that there's JS to CPG, which is commercial which brings me to the question, what's the plan for like open source versus commercial? Will you have like artificial limitations or restrictions like proprietary parsers and generators? No, no. So the thing with open source is the whole open source effort that you see with C and Java. It's, even though we are part of shift lift, this is all done in community. So there's no involvement internally as of now. Like this, the C version was existing way before the company even started. So are the company that we belong to even started. So this has been there in existence from a research project. So it was always open. The Java version is developed by a student at Stelenbach University and his name is David and he based it on Plume. So independent system that he developed himself which does decompiling based on soot then generating the graph adhering to the CPG specs. Similarly, another project going on in Germany right now with, I do not exactly know how that project is with which universities it's going on with and also with some industries where there is a parser for PHP being developed. So it's all done independently. So if you want to do JS to CPG or if you have a group which wants to do JS to CPG start doing it, we will support you. It's that would be separate from the JS to CPG and other tooling that is done internally. And we do not know what the plan for open sourcing them is at this point. Awesome, thank you. Okay, so we'll keep the forum open, I think for 10 minutes and or maybe we can close the session right now, what would you suggest? We keep maybe we can close it and then the channel is always open. If you're analyzing anything right now, just stick around and then showcase what you have done. If you want to try out some Java projects, please find bugs, file those bugs. And again, just to remind everyone, this is a community effort. So, you know, things will come out when they come out. You know, we do not have any timeline or any plan or anything. So it's not, there's no commercial timeline associated with this thing. So, you know, it's just community. If you have a nice project, you're working in a lab, you know, you find something super cool. Just come to us, build something, we'll help you. Yeah, I'm just wondering if anyone wrote any interesting queries during the duration of the workshop that you want to share. If not, I think, Sutraqra, let's just close the session then. Yes, I guess let's close the session. We'll carry on in the query corner. Before you guys get out of here, I just wanted to say thank you, like unmute and say thank you. I appreciate being at it. And this was really, really interesting. I can't wait to test out the tool on some of some code bases that I have an interest in. And I look forward to getting in the Gitter and asking tons of questions like I did today. So, thanks for putting this out there. Okay, it's not Gitter, it's Discord. So the Gitter channel has been discontinued. Gotcha, sorry, thank you. Okay, thanks, Daniels. Very nice to hear these words. Yeah, thank you. And feel free to ask any questions you have in the urn.io discord as it comes up. We're always hanging out there. Okay then, bye folks. We'll close this out now, I think. Abdel, if you're here, you can close the stream. Thanks Hugo and thanks to NSEC for hosting us. Bye. Yeah, thank you.