 So, hello and welcome to this presentation from the GitHub Security Lab. My name is Joseph and I've had my dream job here at GitHub for the past four months now. I say this because I'm very passionate about our mission, which is to inspire and enable the community to secure the open source software we all depend on. My part in that mission is to make security easy for developers and this is why I'm here today, to use those 30 minutes to give you the superpower of securing your code like NASA did. Well, this is not a science fiction or a Netflix scenario. Ten years ago, when NASA's curiosity was landing on the surface of Mars, NASA engineers performed a code review mid-flight. They wanted to check the software responsible for opening the parachute of the curiosity rover during landing on the surface of Mars. And that was when they found a bug. The snippet on the slide that is written in C is not the actual code, but a fair description of what was really happening. The little bug that NASA engineers discovered was that the function signature in line one expected an array of 12 elements, but an array of three elements will be passed as an argument in line eight. This means that the Lubin lines 23 will read the correct memory coordinates just for the first three elements, but then it will go out of bounds leading to random behavior. The NASA team found out that this cause would prevent the parachute opening during the landing phase and lead to the crash of the curiosity rover. The smart thing that NASA engineers did was that they didn't just fix that vulnerable instance, but they wrote a generic code QL query to look for variants of that vulnerability in all their code bases. What we call a variant is another occurrence of the same bug pattern at another place in the code and they found 30 other variants. They analyzed that some of those would also result in catastrophic consequences such as the crash of the rover. They fixed all the variants and curiosity was able to land safely on the surface of Mars. If we now see the bigger picture and compare this scenario of fixing a bug mid-flight to that of fixing a bug in production, then I'm sure you will agree that it is very late in the process of software development lifecycle. Last year NASA sent another rover to Mars, but what have they done differently this time? They shifted security left by integrating code QL at the very beginning of the software development lifecycle by using GitHub. In two clicks, you can enable code scanning with code QL and get alerted about security vulnerabilities in your code. Code QL is free for open source and you can benefit from the continuously growing query set contributed by GitHub, by the community, and by top security teams like NASA's. That's already nice as you can consume the world's security knowledge to secure your code, but in today's presentation, I will show you how you can not only consume but be an actor in your use of code QL. Let's now introduce security as code through the lessons learned from DevOps and quality assurance. One of the main levers of DevOps adoption was the introduction of infrastructure as code, where developers use code for setting up their own infra without the need to open tickets for operations teams. The fact that developers were writing code and powered them with further benefits such as reading, contributing, and understanding what they were doing. Same for the world of testing. In the pre agile days, developers and testers belong to two separate teams. QA will find the bugs and report them back to devs. Nowadays, this methodology will not resonate with the vast majority of developers. As they progressed in taking ownership of the automation around code testing. So we believe that what worked for testing and operations should inspire us for security. With security as code, we expect the security experts of an organization to qualify security knowledge that is then shared under both readable and executable form with developers. This sharing helps developers read, understand, and contribute to the code, which facilitates a security culture. Therefore, think about security becoming a seamless observer of the day to day DevOps that doesn't intervene or affect DevOps speed. Security as code will be integrated and automated to the pipelines so that every time a security related violation exists, actionable feedback will be generated. By the way, you hear more and more people talking about DevSecOps nowadays, right? The vulnerability covered by our demo is an SQL injection or SQL injection. It depends where you're from. And I just want to introduce it here for those that might not be familiar with it. As per the Mimons screen, this happens when a user is able to execute arbitrary queries on a database using SQL. The root cause of this bug is insufficient or missing input sanitization, allowing users to execute whatever database operation they want. The backend of the software on the MIM will be similar to this. The text in purple represents the user input, which is processed and sanitized. The query at the top of the slide inserts the name Clio into the database. But the second example shows that by passing user input that contains the parentheses and semicolon, a user is able to trick the bug into creating a second query that deletes the student's table once and forever. While the third line with two dashes will comment out anything that the bug and code would add after the user input. This is a very simple example, but in real life, user input flows in different places of your code base through files and functions before reaching the SQL execution. In our demo, we will build a query that automatically finds SQL injections in those complex scenarios. Just before the demo, let's define two important concepts for our Dataflow query, sources and syncs. Sources are places in the program that receive untrusted user input, for example, a field in a web form. Syncs are places in the program where something malicious can happen. If the malicious input reaches eventually these places. In our example, the sync is the place where the SQL query is executed. The question we need to ask is a Dataflow one. Does this untrusted data ever flow to the point of executing a potentially vulnerable action? We can answer this question by identifying all paths from sources to syncs by using CodeQL. Notice that CodeQL allows users to query code in general, not necessarily for vulnerabilities. You can use it for any type of bugs or just to explore your code. We try to make these queries generic to find variants of vulnerabilities like NASA did. And the biggest benefit you get is that you will now be able to codify your knowledge of a whole security bug pattern in an expressive query language. CodeQL is declarative and logical. Declared means that we describe what to find, not how to find it. Logical means that you will use operators like end and or to define conditions about your domain. CodeQL is object-oriented, taking advantage of features such as encapsulation and inheritance composition. However, as a total beginner myself a few weeks ago, the feature I found the most useful to get started with was the existence of a rich set of standard libraries with reusable logic that made it quicker for me to be productive. For example, there are templates to use like the one of Dataflow analysis that we are going to see in our demo. Our demo is designed for total beginners. And while our examples are in Java, you don't need extensive experience with Java either. As what I will show you will be transferable to other languages that are supported by CodeQL, such as JavaScript, Python, Go, Rust, C, C++ and C-Sharp. Now, let's move to our demo. So this is VS Code, and I have the CodeQL extension being installed. Let's first check our vulnerable code base, which is the intentionally vulnerable security shepherd from OWASP. We have an SQL injection vulnerability in a mobile app. The source is in line 98 and 99. Where the program receives a username and a password for user authentication. This is because there is no sanitization happening with the username and password variable being able to maliciously alter our database like we've seen in our meme. And where this is happening is in line 147, where we have the raw query method accepting a query. So in that line, the first argument is essentially our SQL execution. If you're familiar with SQL, the structure of CodeQL may look familiar. We have the import clause at the top that allows us to reuse logic defined in other libraries. In this example, the JavaScript standard library. Then we have the query clause that describes what we are trying to find. It is made up of three parts normally, the from, where and select. Let's first start with two of them, the from and select, before adding the where in a moment. From specifies the variables that are going to be used in the query. Every declaration in the from clause has a variable type like method access here and the variable name like call here while select specifies what the result should be by referring to the variables above. As per our SQL injection explanation, we need to arrive at those methods or functions in the vulnerable code base that receive user input. How do we do this? We first need to start by getting the set of all methods in the program and then filter only those that receive user input. In the CodeQL Java library to find method invocations, we can use the type method access in line three. And then we can use a variable that I called call here. You can use any variable name. And if we run that, we are expecting CodeQL to provide us with all method invocations in the program. So if I click here, for example, we have the make text function being called. And if I click here, we can see where the show function is being called. But the problem is that these are all the functions. We just need to arrive to those that receive user input. How do we do this? By using the where clause. So by using the where clause, I'm going to search specifically for those methods that are receiving user input. And to do this, I'm gonna use my variable from above and the function called get argument, sorry, get method because we are looking for methods. Followed by has qualified name in order to have the specific method that I'm looking for. Look how I'm making use of the auto completion and how the inline doc helps me to find the right methods in order for me to be productive and use CodeQL. Inside the where clause, we can also see the object oriented nature of CodeQL because get method is an operation provided by the type method access, which through chaining provides further options, for example, to look for a function with a specific name. And this is another feature that CodeQL brings on top of SQL, which is expressivity with chaining. We can see from the signature of has qualified name that it is expecting a package name, which is android.widget in this case because the vulnerable code base I'm using is based on Android, a type, which in this case is gonna be edit text. Everything I use is visible in the code base, which is gonna be your code base when you use CodeQL for your code. And finally get text because get text is the method that receives user input and we are interested in that method because it's the source of an SQL injection vulnerability. So if we rank this, we arrive at the instances of get text in our code base. So far what you see is like a grep, command F, control F, but the true power of CodeQL is gonna be visible in here in the data flow. So let's continue towards that. Let's now move to syncs. To find syncs, we can use the same strategy with the difference being that we are looking for a different method in a different package like we do in line four. As we saw, raw query takes two arguments. Let me show you again, 147 here. Raw query takes two arguments from which only the first is of our interest because it's the sync, is where SQL injection is gonna be executed. So it's where the vulnerability is gonna explode and have problems cause problems in our database. So we can do this by using another type, which is gonna be the expression type. And I will use a variable called argument here in order to arrive into the very first argument of the raw query methods, followed by the logical operator end in order to impose even more restrictions on my variables. And here I can impose this restriction by saying that I'm only interested in the first argument. Therefore, as we are developers, they're one in index zero. If we run this, we expect to arrive to the raw query instances that are having the first argument being defined. Now, let's continue with Dataflow. If I zoom a bit, cool. So, so far, we've defined how to find sources and syncs. Let's now move to the Dataflow functionality of CodeQL, which is what is going to provide us with confirmed SQL injection findings. Luckily, the language comes with a rich set of standard libraries that have ready-made templates we just have to fill, like the one in front of us. On top of the file, we have some metadata that will help CodeQL to understand what we are trying to do. Ignore them for now. We then imported the Tain Tracking library, which is a template configuration to track untrusted user input followed by the Dataflow Path Graph library, which is all about the visualization of results at the end. We are defining a class here, on line 11, to help us out as the Tain Tracking configuration is a boilerplate. So this class is extending something to help ourselves with inheritance composition and the expressiveness of the language. And this is actually an example of how users can benefit from extensibility. And through classes, the expressiveness of a language is highlighted. Another important feature on top of SQL is code reusability with predigates, like in line 14 and 18. Predigates provide a way to encapsulate portions of logic in a program so that they can be reused. Think about the mass functions in CodeQL. What's important is how we are going to define the isSource predigate and the isSync. We just have to override them using the code that we have already written from before. The code that we've written here and here, that we are essentially going to transfer into the boilerplate. Just before filling in the predigates, let's briefly talk about this idea of Dataflow being represented by a graph with nodes so that when there's flow from one node to another node, then you know that these two nodes are connected. And I'm saying this because the two methods here, the one in line 14 and the one in line 18, namely inSource and isSync, are getting a variable of type node as input. And this is exactly to find if there's exist a Dataflow from sources to sinks so that if there exists, we know that we have a confirmed vulnerability which is critical because it's an SQL injection. We can now fill the predigates with the code we have just written before. I'm gonna use the exists keyword here to introduce you to that as well. And I'm gonna explain how we can read that. So far, we can have, we can say that there exists a method call that is having the variable call so that if that method call is specifically get text, we have untrusted user input entering with the potential of being malicious. Let's use here the code, probing and pasting really. So the way that we can read this is that there exists a method such that when that method is specifically the get text, we know that we have untrusted user input entering our code base. And we know that when untrusted user input enters our code base, there's the potential of that to be malicious. So that was the source. Let's continue with the sink with the exact same strategy. There exists a method call such that. So this becomes so that when raw query is called with an argument in index zero, you know that you have the sink. Let's copy and paste again. So it should become node expression. Okay. So what I've done was just filling the two template placeholders with the code we've used before. And if we run that, we expect code trill to tell us if we have indeed S trill injections happening in our code base. So let's analyze what this says. If we click on the first S trill injection finding, we have two pathways to explore. In the first one, we know that we have the username type, the user being passed to the code base that was then passed into the login method followed by the definition of the login method before being executed in the database. And here we have another pathway of the same variable is again the username variable. And this time instead of going through line 102, it went through line 116. So observe here line 102 and line 116. I'm highlighting them both. You can see that in the first pathway, we have something equaling true. While in the other one, we have something equaling false. But those two pathways are both going towards an SQL injection vulnerability. And here if we have a look at our other occurrence, we can see that instead of having the username variable, we have the password variable, which is following the exact same pathway as the user. Sometimes the path from a user input to the real SQL injection can be very long with more than 10 steps across several files, functions, different places of the code base, libraries. Imagine how difficult it will be to find those manual. Let CodeQL do it for you. Back to our presentation and the final slide. Finally, you can start your CodeQL journey by visiting the following URLs that are also shared with you in the description of this stream. And here we are. Everyone, thanks for your time.