 Hi, everyone. My name is Xavier Onukorai. As you can hear from my accent, I'm French. So today we'll talk about security as code. So I'm the senior director of the GitHub Security Lab. What we do at the GitHub Security Lab is that we help secure open source software, all open source software. So how we do that, our core activity is a security audit in open source projects. We find vulnerabilities in this project. We disclose them to the maintainers. We help the maintainers fix them. And we do that for all open source, well, I mean any open source project, not only those that we are using at GitHub, not even only those who are hosted on GitHub, any open source project. But of course, we cannot secure the open source by ourselves. So being at GitHub, we leverage the power of the community. So how do we do that? One, we educate the community. So we share our security techniques and our security findings with the security research community. And we try also to educate the open source community. We give them security tips, secure code tips, security best practices. We also host office hours for maintainers to help them answer their questions and help them adopt a better security posture. Two, we amplify security research. So we do variant analysis. Whenever we find a noticeable CVE or security vulnerability, we try to code this pattern with a cultural. And we try to run the query on all open source projects to find other instances of this vulnerability. And finally, we notify the ecosystem. So we are a CNA. So we assign CVEs for security vulnerabilities in open source. And we also create the GitHub advisory database, which is the free and open source database for security vulnerabilities in open source. What I want to do today is convince you of the benefits of using security as code. So I will first introduce the concept. I will show you what it looks like concretely with CodeQL. And then hopefully we'll have time for questions afterwards. So my story begins far, far away on Planet Mars. 10 years ago, Meet Rover. It's a, well, curiosity, sorry. It's a rover developed by NASA JPL. And 10 years ago, this small rover was en route to Mars. Well, not very small. Here you can see a few humans next to the rover to see the real size of it. And what happened once it was already going to Mars is that the NASA JPL engineering team found a bug in the piece of code that was in charge of the parachute during the landing phase. So pretty critical piece of code. This is, let's dig into this bug. So this is not NASA's real code. This is pseudocode that I've written for you to explain what it's about. So in C, if you declare a function parameter as an array of size 12 here, an array of doubles of size 12 here, it doesn't prevent you to call this function with an array of different size. So this is what's happening on 9.8. We are calling this function with an array of doubles of size 3. What happens if you look at, like, 3 during the processing of this function, what happens is that the system will access memory space that is beyond the allocated memory space and then the result will be unpredictable. You don't know what's in this space. So we found this bug. They checked that it was not very, very harmful. But they thought, hey, what if this bug is happening in other places in our software? After a quick code review, they found another place where this was happening. So they said, OK, hold on. We need to find all instances of this bug. And so we will use an automatic static analysis tool to find all of these instances. And so they used CodeQL to find all instances of this pattern in their code. So this is an example of this query. If you look at the third line, you see that we are looking for the argument in position i and that this argument in position i has, well, it's an array and that has a size a. And then we are comparing this to the parameter of the function declaration in position i, same position i. And you see that it's also an array of size b. And we compare this size and we see that a is smaller than b. We have coded the pattern of our bug. So they run this query on their code and they found more than 20 instances of the same bug in their code. And a handful of these instances were critical and would be causing a crash. And here I'm talking about the physical crash of the rover on the surface, not a software crash. So what they did is that they fixed that and, of course, they deployed the code remotely to Curiosity. Curiosity was already going to Mars, but they deployed remotely on the rover. And the rest is history. Fast forward, nine years later, NASA sent another rover on Earth, Perseverance. And here again, they used CodeQL into their security testing routine so that Perseverance also landed safely on Mars. But they did things a bit differently this time. What did they do differently? Is that no more late code reviews, no more late security testing, no more patching in production, they included all of that into their software development life cycle. And, well, as much as we would all love to follow NASA's example and shift left security to include all of that into the development life cycle, some companies, some organizations struggle with that apparently. In a recent survey about the state of DevSecOps, 43% of respondents said that they were frustrated that security testing was done late in the software development life cycle. So what can we do to help these companies? I mean, what could these companies do to effectively include security testing into their SDLC? I propose you to look back at the success of DevOps and get the lessons learned from DevOps. How did DevOps succeed in being deployed in organizations? I think that the key factor, the key success factor for that is empowering developers. You need to empower your developers to adopt these practices, right? And to do that, you need three things. In this book, Drive, Daniel Ping said that every one of us, to be motivated to do something, we need three things. Well, in fact, we first need to be paid enough and once that is off the table, we need three things. Autonomy, mastery, and purpose. This is what drives us to do something. This is what drives our motivation. What would that mean for security testing? Let's try to see what it would mean. So autonomy, autonomy would mean that you're in control. As a developer, you can run your tests when you want. You can act on the results, right? You are autonomous to do that. The opposite would be that another team is running the test for you and then creates a bunch of issues and then send all the issues to you, right? I remember one former colleague telling me, oh, wow, hey, my security tool, they finally moved from generating a PDF and now they are integrating with Drive. And, oh, developers will love it. Well, as much as we can all agree that moving from PDF to issues, it's an improvement, but developers didn't love it, right? They were still not autonomous. Mastery, mastery is being able to learn in the process, so you are acquiring a new scale learning. So for security testing, that would be that developers, during this security testing process, they learn about some security practices they are able to learn and to not repeat the mistakes, the same mistakes in the future. The contrary would be that the expertise stays in that for the team and doesn't benefit the developers. And finally, purpose, well, the purpose of a developer, right, that would be to create a delightful software for their users, right? We all know that, so high quality and high security, right? They should be able to relate what they are doing to this purpose, right? So they should be able to say, oh, okay, I know why I need to fix this thing. It's, I can relate it to this purpose, right? But if you just do what you're told to, hey, fix this thing, you don't have this relation and it doesn't work. So yeah, I think that we should empower developers as a key success factor to deploy depth-secOps practices. And how do we empower developers? Well, we give them code. This is what they do, right? Let's see some examples of the past again, right? When I needed to deploy practices such as functional testing with developers, what worked was when they were able to code it themselves with tools like fitness and cellular DevOps. When we wanted to include deployment testing and probability testing, what worked was infrastructure as code. This is really what ticked and made them adopt these practices in infrastructure as code. So why not security as code? Security as code would be, well, I took this definition from the web, but basically what I want to do is that the developers are able to code their security testing, their security checks, right? And with that, you get automation, you get repeatability, you get reusability, you get documentation, right? Same as infrastructure as code, but with security. You get all of these benefits. And this is where I introduce CodeQL. So CodeQL is a way to code your security checks. It's a SAS, a static analysis security testing tool, which means that it will run static analysis on your code. It will not run your program. QL stands for query language. It's kind of SQL for code. It will query your code as if it's data. With this language, you describe what to find and not how to find it. It's very expressive in this way. And it's a logical declarative language based on that catalog and it's also object-oriented, which is super useful for reusability. We will see that later. A bit more into details, CodeQL works in two phases. First phase, it will extract your code into a relational database. It will extract all aspects of your code like the abstract syntax tree, some semantics, and even the control flow graph. And then it gives you an optimized object-oriented language that will abstract SQL, right, to query this DB. Yeah, so on the diagram on the left, you see a bit more these two phases, extraction and then query. Now, let's have a look at the first example. So what happens behind the hood is that in the DB, we have a table that has been created that is named Function. And in this table, you have a column called Name. And then you have all the function declarations in this table. And CodeQL created a class on top of that. So you have a class Function that maps the table called Function. And in this class, you have a member name that you can access with GetName that maps the name column. If you look at the second example, you'll see that we are able, like in SQL, to do a join between two tables. So here are the Functions and the Functions calls. And this join is made super easily with the line and c.target equals f. So we are binding the target of the call with the Function. We can also access directly the first argument of the call with c.getArgument. So here you see that it's abstracting a SQL, and it gives you a language that is more expressive, something that you can read in plain English. So with CodeQL, you have support for all of these languages. There is also c.sharp that is missing on this slide. And we have also Swift and Kotlin that are in beta right now. So for each language, we need a bit of work to support it because the CodeQL team needs to design the optimal data schema. We need to design the extractor that transforms your code into a DB. We need also to provide the fundamental libraries, like the control photograph, for example. So each new language requires a bit of work. So it's not one fits all. And then, so once you have the CodeQL queries, the CodeQL queries are open source, and you can have two possible ways. You can just consume them. Your queries are there open source. You can just run them on your code. Or you can be a writer. If you choose to consume, well, this is super easy. You can really immediately benefit from these queries because if you are an open source project, it's free for you. It's free for all open source projects, not only the projects that are hosted on GitHub. If you are on GitHub, though, you can also have free inclusion of this into your CI CD. And it's one click easy. You go to your GitHub project and you enable CodeQL. This is a one click configuration. And once you do that, you will have CodeQL running on your project automatically. And for example, CodeQL will analyze all your new pull requests, and it will comment on it. So here, for example, you have a new code popping up and you have a CodeQL analysis telling you that there is a possible client-side cross-site XSS in your code. So here, as you can see, it's really acting as if you had a peer reviewer telling you, hey, I think there is a problem here in your code. So it's not changing anything from what you'd usually do. It's included into your software development lifecycle. You can also, from your IDE, get these alerts and then be able to code the fix immediately. If you click on Show More Details here, you will have a detailed explanation of the vulnerability that will educate you about this vulnerability. You will get a remediation advice. So as a developer, you will be able to act on it immediately and autonomously. And yes, this documentation is also customizable because, as I said, the queries are open source. So you can kind of adapt them to the particular case of your code, of your project, your organization to get your developers really doing the right thing for this particular vulnerability. And now, if you are a writer, then I've got a demo for you. Imagine that you're an open source maintainer. Yeah, you're the open source maintainer of a popular Java library, let's say log4j, completely randomly. And then you heard about this vulnerability pattern, a GNI injection. So you want to make sure that your library is not vulnerable to that. What would that be in your case? It would be that an attacker can use your logging functionality where they can pass a string. And then this string would perform a GNI lookup in a remote server. What does that look like? So for example, it means that here you have an attacker who can control this message here. In this message, they will pass a GNI lookup and this will end up into this file, GNIManager.java into this call to context lookup into this argument name. Okay, so what I will do with that, I will try to show you how we find this pattern with CodeQL in log4j. So I'm in Visual Studio Code and what I did is that I installed the CodeQL extension here. All of this is free for open source. And then, as I told you, with CodeQL, you have to do two phases. You generate the codebase and then you query this codebase. So I've used the CodeQL CLI to generate the log4j library here and I imported it into my workspace. So now I can run queries on it. Okay, so for example, here I got a query that looked for empty statement block and I can run it. But that's not what is interesting for me. What I want to find, I want to find this message here. Okay, so one thing that is super useful is that in CodeQL, in this extension, you also have the AST viewer here that gives you the name of the classes that CodeQL is mapping your code with. So for example, here, the method, well, the class that I need to query is called method, cool. This parameter here, well, it's called a parameter. And for example, this is an annotation. Okay, cool, very useful, right? Some of the names are pretty intuitive, but in some cases it's very useful to use this AST viewer to know what you are querying. Okay, so let's do that. I want to find the first parameter of a method that overrides logger.info. So loop for... So here, as you can see, my friend Copilot is clicking in and helping me. I won't lie, I will use it because it's super useful. So I'm looking for a method that overrides logger.info. Yes, it tells me that because I practiced of course also. Okay, well, and has a car sequence parameter. Yes, please. Okay, so from method M and... Okay, and parameter P. Let's see. So my method is overriding logger.info. My method has one parameter. This parameter is P and the type of P is car sequence. Okay, yes, I think this is what I want. And then select... No, I don't want to select the method. I want to select the parameter. Okay, let's run this query. Okay, one result. If I click on it, boom. This is exactly what I was looking for. Okay, cool. Now, what I want is this one. I want the first argument of a call to a method called lookup. And this method is a member of a class that implements jabax namings.context. Okay, now what we'll do? We will ask Copilot to write that for us. So look for a method call that implements... Well, method call of a method called lookup and the method is declared in the class that is a subtype of jabax naming context. Yes, okay. And select the first argument. Okay, so from... Let's see, from method access method, I'm binding the two. The name of the method is lookup and the declaring type, which is the class where this member is, is implementing jabax naming context. Well, this is exactly what I want. And I will select the first argument. Here, I need to comment this bit here. Okay, and I will run this query. Okay, I've got two results. If I click on the first one, it's in data source connection source of java. This is not what I'm looking for. But this one is exactly the one I'm looking for. I mean, gndi manager.java and I've got my first argument. Cool. Okay, so what I did now, I found these two places in the code, but what I want to know is, is there a possibility that this entrusted source, that an attacker can pass to the login functionality? Is this a possibility that this data here can flow to the gndi lookup? Let's ask CodeQL about that. How do we do that? CodeQL has a chain tracking default library that you can reuse. This library will tell you if there is a potential flow between a source and what we call a sync. And for that, this library comes with a boilerplate code. In this code, you just have to complete the definition of what you consider as an entrusted source. Here with this predicate is source. And what you consider as being a dangerous sync. With implementing this predicate is sync. So let's use that. I will copy this boilerplate code into my file here. I will command this part. And I will copy the code here. Okay, and now let's implement isSource and isSync. Well, we already implemented them, in fact, right? The source is this parameter of logger.unfo. So let me copy that here. Boom, I will uncomment. So if there is a method and a parameter p, search as this method overwrite logger.unfo. This method has one parameter. This parameter is p and of type char sequence. And I have to say that my source is my parameter. So here I've got a problem because source is not of the good type. So I need to cast it as parameter. Okay, so here what I will do is that I will evaluate this predicate to make sure that I've got the same result as before. Just to be sure. Yes, we have the same result. Okay, cool. Let's go back. Okay, so we have implemented our source. Good, now we have to implement our sync. What do we do? Well, we should copy what is in here into the sync. But no, there is a simpler way. Remember that queries are open source and it's an object-oriented language. So because of that, the community, people from the community are providing queries but also some libraries and some classes that we can reuse. It happens that a community contributor created a class called JNDI injection sync. I think that I will use this one. So I will say, yeah, my sync is an instance of this class JNDI injection sync. Let's quickly evaluate to make sure that, okay, this class is coming from the community. I'm not really sure if that does what I want. Oh, so see, it's not doing exactly the same thing, right? It gives me five results instead of the two ones that I had. So it less precise than what I had before. But this one here is exactly the one that I'm looking for. So let's use it, right? Let's trade this precision for simplicity and let's use that. So now I have implemented my source and my sync. I will run the query to see what I get. And boom, I've got a result. Let's look at the result. So CodeQL found four different paths between this source and this sync. Let's look at one of them. So here, indeed, the first step of this path, I can see that it's my parameter here. And then if I take the second step, it goes just one line below here. And then certainly it will go to log if, the definition of log if enabled. Yes, et cetera, et cetera. And it goes across multiple steps, across multiple functions, across multiple files, even, down to the code to my GNDI lookup. So you see more than 100 steps. If we look at this order path here, so same thing, you have 150 steps, right? So this is something that is impossible to find manually, right? Even if you're Elon Musk and you have code review superpower and you can read pages and pages of printed code, you cannot do that. You cannot beat that. So this is the way that you can use CodeQL to find paths across a whole of your code. Here, I want to mention one thing. The minimal thing that you have to do for this same tracking configuration is to define a source and a sync, right? And the library is already pretty good to give you nice results. But you have control of your tentracking, total control, how? Because you can also implement two other predicates. One, where you can define sanitizers. You can say, hey, when you're passing through this call, then it's sanitizing the data. So stop your tentracking here. Don't go through. The other side, right? On the other side, you can say the opposite and say, hey, imagine, for example, that you are calling an external framework that you don't know about, right? I mean, CodeQL doesn't know about this code. Then you can say, hey, I'm adding a taint step. I want you to continue the taint when I pass through this call because it's tainting my data. So for example, you can say, hey, I want to have a taint between the first argument of this call and the return value or the first argument of this call and that argument of this call, I want you to propagate the taint. So that gives you really total control of your tentracking. You can, with sanitizers, reduce false positives. You can, with additional tent steps, reduce full negatives, right? So the minimal configuration by just defining a source and a sink is already pretty good. But know that you have full control on the tentracking feature. So that's it. We found Loc4Shell with CodeQL. That's pretty cool. Let's go back to our slides. So yeah, time for the conclusion. So with security as code and with CodeQL, you get an automated, reputable, reusable security check that you can include seamlessly into your software development lifecycle. And if you're a developer, when you get that, you know it's code. So I can read it. I can understand. I can learn from it. There is a documentation attached to it. So I can really learn from it even more. And there is also a bonus is that because the CodeQL queries are open source, you get community-driven security checks. What I mean by that is that these queries, they are written by the CodeQL team at GitHub. They are written by my team, the Security Lab. They are written by security teams at our customers. They are written by dozens of independent security researchers who are contributing to these CodeQL queries. So with that, you get a knowledge coming from the community that is wider and that is more diverse than any in-house team could get you. So that's the bonus of having these queries being community-driven. And that is if you want to know more, you can browse CodeQL.com. You can go to the website of the Security Lab. On this website, you will find access to our public Slack, to our community Slack. You will find some examples of how we use CodeQL. You will find also some CodeQL CTFs that you can play with to learn about CodeQL. And you can also visit GitHub at the booth G17. Thank you. I think we're right on time, but perhaps we have time for one or two questions. Yes? So how does that play? Yeah, this isn't you. Oh, no, no. So the question is, how is GPT playing into CodeQL? So no, CodeQL is pretty deterministic. It just analyzes your code, put all of that into a database, and then look for, you know, analyze your control flow graph, et cetera. So I was just using Copilot to write the code that was the query. Now, if you go to the CodeQL site, you will see that I think we run a beta to use machine learning to identify more automatically some sources and some things, right? So we have a bit of machine learning that helps with defining sources and things automatically, but it's still in beta. And no, so we're not using GPT inside of the query itself. That makes sense. No more questions? Okay, so see you around. Thank you.